Documenting Sample Selection

2020-08-12 2092 words 10 minutes

Contents

1
2
import numpy as np
import pandas as pd

Problem

I have a dataframe on which I perform a series of data selection steps. What I want is to automatically build a table for the appendix of my paper that tells me the number of users left in the data after each selection step.

Here’s a mock dataset:

1
2
3
df = (pd.DataFrame({'user_id': [1, 2, 3, 4] * 2, 'data': np.random.rand(8)})
      .sort_values('user_id'))
df

	user_id	data
0	1	0.107515
4	1	0.306182
1	2	0.184724
5	2	0.217231
2	3	0.688004
6	3	0.284524
3	4	0.990159
7	4	0.466758

here some selection functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def first_five(df):
    return df[:5]

def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def select_sample(df):
    return (
        df
        .pipe(first_five)
        .pipe(n_largest)
    )

select_sample(df)

	user_id	data
2	3	0.688004
4	1	0.306182
5	2	0.217231

Solution

If we have a single dataframe on which to perform selection, as in the setting above, we can use a decorator and a dictionary.

As a first step, let’s build a decorator that prints out the number of users after applying each function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from functools import wraps

def user_counter(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        print(f'{func.__name__}: {num_users}')
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def select_sample(df):
    return (
        df
        .pipe(first_five)
        .pipe(n_largest)
    )

select_sample(df)

first_five: 3
n_largest: 3

	user_id	data
2	3	0.688004
4	1	0.306182
5	2	0.217231

That’s already nice. But I need those counts for the data appendix of my paper, so what I really want is to store the counts in a container that I can turn into a table. To do this, we can store the counts in a dictionary instead of printing them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
counts = dict()

def user_counter(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        counts.update({func.__name__: num_users})
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def select_sample(df):
    return (
        df
        .pipe(first_five)
        .pipe(n_largest)
    )

display(select_sample(df))
counts

	user_id	data
2	3	0.688004
4	1	0.306182
5	2	0.217231

{'first_five': 3, 'n_largest': 3}

Next, I want to add the number of users at the beginning and the end of the process (the count at the end is identical with the final step, but I think it’s worth adding so readers can easily spot the final numbers).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
counts = dict()

def user_counter(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        counts.update({func.__name__: num_users})
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def add_user_count(df, step):
    num_users = df.user_id.nunique()
    counts.update({step: num_users})
    return df

def select_sample(df):
    return (
        df
        .pipe(add_user_count, 'start')
        .pipe(first_five)
        .pipe(n_largest)
        .pipe(add_user_count, 'end')
    )

display(select_sample(df))
counts

	user_id	data
2	3	0.688004
4	1	0.306182
5	2	0.217231

{'start': 4, 'first_five': 3, 'n_largest': 3, 'end': 3}

We’re nearly there. Let’s turn this into a table that we can store to disk (as a Latex table, say) and automatically import in our paper.

1
2
table = pd.DataFrame(counts.items(), columns=['Processing step', 'Number of unique users'])
table 

	Processing step	Number of unique users
0	start	4
1	first_five	3
2	n_largest	3
3	end	3

Finally, let’s make sure readers of our paper (and we ourselves a few weeks from now) actually understand what’s going on at each step.

1
2
3
4
5
6
7
8
9
description = {
    'start': 'Raw dataset',
    'first_five': 'Keep first five observations',
    'n_largest': 'Keep three largest datapoints',
    'end': 'Final dataset'
}

table['Processing step'] = table['Processing step'].map(description)
table

	Processing step	Number of unique users
0	Raw dataset	4
1	Keep first five observations	3
2	Keep three largest datapoints	3
3	Final dataset	3

That’s it. We can can now export this as a Latex table (or some other format) and automatically load it in our paper.

Multiple datasets

Instead of having a single dataframe on which to perform selection, I actually have multiple pieces of a large dataframe (because the full dataframe doesn’t fit into memory). What I want is to perform the data selection on each chunk separately but have the values in the counter object add up so that – at the end – the counts represent the counts for the full dataset. The solution here is to use collection.Counter() instead of a dictionary.

So, my setup is akin to the following:

1
2
large_df = pd.DataFrame({'user_id': list(range(12)), 'data': np.random.rand(12)})
large_df

	user_id	data
0	0	0.507218
1	1	0.933454
2	2	0.740951
3	3	0.654135
4	4	0.952187
5	5	0.807332
6	6	0.742915
7	7	0.344259
8	8	0.134813
9	9	0.952129
10	10	0.859282
11	11	0.376175

1
2
3
4
buckets = pd.cut(large_df.user_id, bins=2)
raw_pieces = [data for key, data in large_df.groupby(buckets)]
for piece in raw_pieces:
    display(piece)

	user_id	data
0	0	0.507218
1	1	0.933454
2	2	0.740951
3	3	0.654135
4	4	0.952187
5	5	0.807332

	user_id	data
6	6	0.742915
7	7	0.344259
8	8	0.134813
9	9	0.952129
10	10	0.859282
11	11	0.376175

What happens if we use a dict() as our counts object as we did above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
counts = dict()

def user_counter(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        counts.update({func.__name__: num_users})
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def add_user_count(df, step='start'):
    num_users = df.user_id.nunique()
    counts.update({step: num_users})
    return df

def select_sample(df):
    return (
        df
        .pipe(add_user_count)
        .pipe(first_five)
        .pipe(n_largest)
        .pipe(add_user_count, 'end')
    )

selected_pieces = []
for piece in raw_pieces:
    selected_pieces.append(select_sample(piece))
    print(counts)

df = pd.concat(selected_pieces)

{'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3}
{'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3}

The counts are replaced rather than added up, which is how updating works for a dictionary:

1
2
3
4
5
m = dict(a=1, b=2)
n = dict(b=3, c=4)

m.update(n)
m

{'a': 1, 'b': 3, 'c': 4}

collections.Counter() (docs) solve this problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import collections

counts = collections.Counter()

def user_counter(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        counts.update({func.__name__: num_users})
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def n_largest(df, n=3):
    return df.loc[df.data.nlargest(n).index]

def add_user_count(df, step='start'):
    num_users = df.user_id.nunique()
    counts.update({step: num_users})
    return df

def select_sample(df):
    return (
        df
        .pipe(add_user_count)
        .pipe(first_five)
        .pipe(n_largest)
        .pipe(add_user_count, 'end')
    )

selected_pieces = []
for piece in raw_pieces:
    selected_pieces.append(select_sample(piece))
    print(counts)

df = pd.concat(selected_pieces)

Counter({'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3})
Counter({'start': 12, 'first_five': 10, 'n_largest': 6, 'end': 6})

Now, updating adds up the values for each key, just as we want. We can add the same formatting as we did above and are done with our table.

Background

Other cool stuff `Counter()` can do

1
2
3
4
5
o = collections.Counter(a=1, b=2)
p = collections.Counter(b=3, c=-4)

o.update(p)
o

Counter({'a': 1, 'b': 5, 'c': -4})

Counters can also do cool things like this:

1
list(o.elements())

['a', 'b', 'b', 'b', 'b', 'b']

1
o.most_common(2)

[('b', 5), ('a', 1)]

1
o - p

Counter({'a': 1, 'b': 2})

Why is counts a global variable?

Because I want all decorated functions to write to the same counter object.

Often, decorators make use of closures instead, which have access to a nonlocal variable defined inside the outermost function. Let’s look at what happens if we do this for our user counter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def user_counter(func):
    counts = collections.Counter()
    @wraps(func)
    def wrapper(*args, **kwargs):
        df = func(*args, **kwargs)
        num_users = df.user_id.nunique()
        counts.update({func.__name__: num_users})
        print(counts)
        return df
    return wrapper

@user_counter
def first_five(df):
    return df[:5]

@user_counter
def largest(df):
    return df.loc[df.data.nlargest(3).index]

def select(df):
    return (
        df
        .pipe(first_five)
        .pipe(largest)
    )
result = select(df)

Counter({'first_five': 5})
Counter({'largest': 3})

Now, each decorated function gets its own counter object, which is not what we want here. For more on decorator state retention options, see chapter 39 in Learning Python.

What are closures and nonlocal variables?

(Disclaimer: Just about all of the text and code on closures is taken – sometimes verbatim – from chapter 7 in Fluent Python. So the point here is not to produce new insight, but to absorb the material and write an easily accessible note to my future self.)

Closures are functions that have access to nonlocal arguments – variabls that are neither local nor global, but are defined inside an outer function within which the closure was defined, and to which the closure has access.

Let’s look at an example. A simple function that takes one number as an argument and returns the average of all numbers passed to it since it’s definition. For this, we need a way to store all previously passed values. One way to do this is to define a class with a call method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class Averager():
    
    def __init__(self):
        self.series = []
        
    def __call__(self, new_value):
        self.series.append(new_value)
        total = sum(self.series)
        return total / len(self.series)
    
avg = Averager()

avg(10), avg(20), avg(30)

(10.0, 15.0, 20.0)

Another way is to use a closure function and store the series of previously passed numbers as a free variable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def make_averager():
    series = []    
    def averager(new_value):
        series.append(new_value)
        total = sum(series)
        return total / len(series)
    return averager
    
avg = make_averager()

avg(10), avg(20), avg(30)

(10.0, 15.0, 20.0)

This gives the same result, but is arguably simpler than defining a class.

We can improve the above function by storing previous results so that we don’t have to calculate the new average from scratch at every function call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def make_fast_averager():
    count = 0
    total = 0
    def averager(new_value):
        nonlocal count, total
        count += 1
        total += new_value
        return total / count
    return averager

avg = make_fast_averager()

avg(10), avg(11), avg(12)

(10.0, 10.5, 11.0)

1
2
3
%%timeit
avg = make_averager()
[avg(n) for n in range(10_000)]

233 ms ± 8.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1
2
3
%%timeit
avg = make_fast_averager()
[avg(n) for n in range(10_000)]

1.69 ms ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This simple change gives us a massive speedup.

Notice the nonlocal statement inside the averager function. Why do we need this? Let’s see what happens if we don’t specify it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def make_fast_averager():
    count = 0
    total = 0
    def averager(new_value):
        count += 1
        total += new_value
        return total / count
    return averager

avg = make_fast_averager()

avg(10), avg(11), avg(12)

UnboundLocalError: local variable 'count' referenced before assignment

How come our fast averager can’t find count and total even though our slow averager could find series just fine?

The answer lies in Python’s variable scope rules and the difference between assigning to unmutable objects and updating mutable ones.

Whenever we assign to a variable inside a function, it is treated as a local variable.
count += 1 is the same as count = count + 1, so we are assigning to count, which makes it a local variable (the same goes for total). We are assigning new values to count rather than updaing it because integers are immutable, so we can’t update it.
Lists are mutable, so series.append() doesn’t create a new list, but merely appends to it, which doesn’t count as an assignment, so that series is not treated as a local variable.

Hence, we need to explicitly tell Python that count and total are nonlocal variables.