Documenting Sample Selection
|
|
Problem
I have a dataframe on which I perform a series of data selection steps. What I want is to automatically build a table for the appendix of my paper that tells me the number of users left in the data after each selection step.
Here’s a mock dataset:
|
|
user_id | data | |
---|---|---|
0 | 1 | 0.107515 |
4 | 1 | 0.306182 |
1 | 2 | 0.184724 |
5 | 2 | 0.217231 |
2 | 3 | 0.688004 |
6 | 3 | 0.284524 |
3 | 4 | 0.990159 |
7 | 4 | 0.466758 |
here some selection functions:
|
|
user_id | data | |
---|---|---|
2 | 3 | 0.688004 |
4 | 1 | 0.306182 |
5 | 2 | 0.217231 |
Solution
If we have a single dataframe on which to perform selection, as in the setting above, we can use a decorator and a dictionary.
As a first step, let’s build a decorator that prints out the number of users after applying each function:
|
|
first_five: 3
n_largest: 3
user_id | data | |
---|---|---|
2 | 3 | 0.688004 |
4 | 1 | 0.306182 |
5 | 2 | 0.217231 |
That’s already nice. But I need those counts for the data appendix of my paper, so what I really want is to store the counts in a container that I can turn into a table. To do this, we can store the counts in a dictionary instead of printing them.
|
|
user_id | data | |
---|---|---|
2 | 3 | 0.688004 |
4 | 1 | 0.306182 |
5 | 2 | 0.217231 |
{'first_five': 3, 'n_largest': 3}
Next, I want to add the number of users at the beginning and the end of the process (the count at the end is identical with the final step, but I think it’s worth adding so readers can easily spot the final numbers).
|
|
user_id | data | |
---|---|---|
2 | 3 | 0.688004 |
4 | 1 | 0.306182 |
5 | 2 | 0.217231 |
{'start': 4, 'first_five': 3, 'n_largest': 3, 'end': 3}
We’re nearly there. Let’s turn this into a table that we can store to disk (as a Latex table, say) and automatically import in our paper.
|
|
Processing step | Number of unique users | |
---|---|---|
0 | start | 4 |
1 | first_five | 3 |
2 | n_largest | 3 |
3 | end | 3 |
Finally, let’s make sure readers of our paper (and we ourselves a few weeks from now) actually understand what’s going on at each step.
|
|
Processing step | Number of unique users | |
---|---|---|
0 | Raw dataset | 4 |
1 | Keep first five observations | 3 |
2 | Keep three largest datapoints | 3 |
3 | Final dataset | 3 |
That’s it. We can can now export this as a Latex table (or some other format) and automatically load it in our paper.
Multiple datasets
Instead of having a single dataframe on which to perform selection, I actually have multiple pieces of a large dataframe (because the full dataframe doesn’t fit into memory). What I want is to perform the data selection on each chunk separately but have the values in the counter object add up so that – at the end – the counts represent the counts for the full dataset. The solution here is to use collection.Counter()
instead of a dictionary.
So, my setup is akin to the following:
|
|
user_id | data | |
---|---|---|
0 | 0 | 0.507218 |
1 | 1 | 0.933454 |
2 | 2 | 0.740951 |
3 | 3 | 0.654135 |
4 | 4 | 0.952187 |
5 | 5 | 0.807332 |
6 | 6 | 0.742915 |
7 | 7 | 0.344259 |
8 | 8 | 0.134813 |
9 | 9 | 0.952129 |
10 | 10 | 0.859282 |
11 | 11 | 0.376175 |
|
|
user_id | data | |
---|---|---|
0 | 0 | 0.507218 |
1 | 1 | 0.933454 |
2 | 2 | 0.740951 |
3 | 3 | 0.654135 |
4 | 4 | 0.952187 |
5 | 5 | 0.807332 |
user_id | data | |
---|---|---|
6 | 6 | 0.742915 |
7 | 7 | 0.344259 |
8 | 8 | 0.134813 |
9 | 9 | 0.952129 |
10 | 10 | 0.859282 |
11 | 11 | 0.376175 |
What happens if we use a dict()
as our counts object as we did above.
|
|
{'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3}
{'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3}
The counts are replaced rather than added up, which is how updating works for a dictionary:
|
|
{'a': 1, 'b': 3, 'c': 4}
collections.Counter()
(docs) solve this problem.
|
|
Counter({'start': 6, 'first_five': 5, 'n_largest': 3, 'end': 3})
Counter({'start': 12, 'first_five': 10, 'n_largest': 6, 'end': 6})
Now, updating adds up the values for each key, just as we want. We can add the same formatting as we did above and are done with our table.
Background
Other cool stuff Counter()
can do
|
|
Counter({'a': 1, 'b': 5, 'c': -4})
Counters can also do cool things like this:
|
|
['a', 'b', 'b', 'b', 'b', 'b']
|
|
[('b', 5), ('a', 1)]
|
|
Counter({'a': 1, 'b': 2})
Why is counts a global variable?
Because I want all decorated functions to write to the same counter object.
Often, decorators make use of closures instead, which have access to a nonlocal variable defined inside the outermost function. Let’s look at what happens if we do this for our user counter.
|
|
Counter({'first_five': 5})
Counter({'largest': 3})
Now, each decorated function gets its own counter object, which is not what we want here. For more on decorator state retention options, see chapter 39 in Learning Python.
What are closures and nonlocal variables?
(Disclaimer: Just about all of the text and code on closures is taken – sometimes verbatim – from chapter 7 in Fluent Python. So the point here is not to produce new insight, but to absorb the material and write an easily accessible note to my future self.)
Closures are functions that have access to nonlocal arguments – variabls that are neither local nor global, but are defined inside an outer function within which the closure was defined, and to which the closure has access.
Let’s look at an example. A simple function that takes one number as an argument and returns the average of all numbers passed to it since it’s definition. For this, we need a way to store all previously passed values. One way to do this is to define a class with a call method.
|
|
(10.0, 15.0, 20.0)
Another way is to use a closure function and store the series of previously passed numbers as a free variable.
|
|
(10.0, 15.0, 20.0)
This gives the same result, but is arguably simpler than defining a class.
We can improve the above function by storing previous results so that we don’t have to calculate the new average from scratch at every function call.
|
|
(10.0, 10.5, 11.0)
|
|
233 ms ± 8.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
|
1.69 ms ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This simple change gives us a massive speedup.
Notice the nonlocal
statement inside the averager function. Why do we need this? Let’s see what happens if we don’t specify it:
|
|
UnboundLocalError: local variable 'count' referenced before assignment
How come our fast averager can’t find count and total even though our slow averager could find series just fine?
The answer lies in Python’s variable scope rules and the difference between assigning to unmutable objects and updating mutable ones.
Whenever we assign to a variable inside a function, it is treated as a local variable.
count += 1 is the same as count = count + 1, so we are assigning to count, which makes it a local variable (the same goes for total). We are assigning new values to count rather than updaing it because integers are immutable, so we can’t update it.
Lists are mutable, so series.append() doesn’t create a new list, but merely appends to it, which doesn’t count as an assignment, so that series is not treated as a local variable.
Hence, we need to explicitly tell Python that count and total are nonlocal variables.