A collection of recipes for effectively solving common and not so common problems in Pandas.
1
2
3
| import numpy as np
import pandas as pd
import seaborn as sns
|
Splitting dataframes based on column values
You want to split the dataframe every time case equals B and store the resulting dataframes in a list.
1
2
3
4
5
6
7
| df = pd.DataFrame(
data={
"case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
"data": np.random.randn(9),
}
)
df
|
| case | data |
---|
0 | A | -0.900566 |
1 | A | -1.127128 |
2 | A | 0.342823 |
3 | B | -0.854102 |
4 | A | 1.058840 |
5 | A | -0.307737 |
6 | B | 0.013741 |
7 | A | -1.768852 |
8 | A | 0.550569 |
Understanding the cookbook solution
From the cookbook:
1
2
3
4
5
6
7
8
| dfs = list(
zip(
*df.groupby(
(1 * (df["case"] == "B")).cumsum().rolling(window=3, min_periods=1).median()
)
)
)[-1]
dfs
|
( case data
0 A -0.900566
1 A -1.127128
2 A 0.342823
3 B -0.854102,
case data
4 A 1.058840
5 A -0.307737
6 B 0.013741,
case data
7 A -1.768852
8 A 0.550569)
This works. But because it’s so heavily nested and uses methods like rolling()
and median()
not really designed for that purpose, the code is impossible to interpret at a glance.
Let’s break this down into separate pieces.
First, the code creates a grouping variable that changes its value each time case equaled B on the previous row.
1
2
3
4
5
6
7
8
| # Creating grouping variable
a = df.case == "B"
b = 1 * (df.case == "B")
c = 1 * (df.case == "B").cumsum()
d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()
a, b, c, d
|
(0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
Name: case, dtype: bool,
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
Name: case, dtype: int64,
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
Name: case, dtype: int64,
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 1.0
6 1.0
7 2.0
8 2.0
Name: case, dtype: float64)
Series d above is the argument passed to groupby()
in the solution. This works, but is a very roundabout way to create such a series. I’ll use a different approach below.
Next, the code uses list()
, zip()
, and argument expansion to pack the data for each group into a single list of dataframes. Let’s look at these one by one.
First, groupby()
stores the grouped data as (label, df) tuples.
1
2
3
4
| groups = df.groupby("case")
for g in groups:
print(type(g))
print(g, end="\n\n")
|
<class 'tuple'>
('A', case data
0 A -0.900566
1 A -1.127128
2 A 0.342823
4 A 1.058840
5 A -0.307737
7 A -1.768852
8 A 0.550569)
<class 'tuple'>
('B', case data
3 B -0.854102
6 B 0.013741)
Simplified, this is what we work with:
1
2
| groups2 = [("g1", "data1"), ("g2", "data2")]
groups2
|
[('g1', 'data1'), ('g2', 'data2')]
Argument expansion unpacks elements from a list as separate arguments that can be passed to a function. In our context here, it turns each (label, df) tuple into a separate object.
1
2
| print(groups2)
print(*groups2)
|
[('g1', 'data1'), ('g2', 'data2')]
('g1', 'data1') ('g2', 'data2')
zip()
stitches together the ith elements of each iterable passed to it, effectively separating the “columns”, and returns an iterator.
<zip at 0x164d84200>
[('g1', 'g2'), ('data1', 'data2')]
Putting this all together, zip()
is used to separate the group label from the data, and list()
consumes the iterator created by zip and displays its content.
[('A', 'B'),
( case data
0 A -0.900566
1 A -1.127128
2 A 0.342823
4 A 1.058840
5 A -0.307737
7 A -1.768852
8 A 0.550569,
case data
3 B -0.854102
6 B 0.013741)]
Because we only want the data, we select the last element from the list:
( case data
0 A 0.684978
1 A 0.000269
2 A -1.040497
4 A 0.448596
5 A 0.222168
7 A -2.208787
8 A -0.440758,
case data
3 B 0.451358
6 B 1.031011)
Now we’re basically done. What remains is to use the list(zip(*groups))
procedure on the more complicated grouping variable, to obtain the original result.
1
2
3
| d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()
groups = df.groupby(d)
list(zip(*groups))[-1]
|
( case data
0 A 0.684978
1 A 0.000269
2 A -1.040497
3 B 0.451358,
case data
4 A 0.448596
5 A 0.222168
6 B 1.031011,
case data
7 A -2.208787
8 A -0.440758)
Simplifying the code
I think this can be made much more readable like so:
| case | data |
---|
0 | A | -0.900566 |
1 | A | -1.127128 |
2 | A | 0.342823 |
3 | B | -0.854102 |
4 | A | 1.058840 |
5 | A | -0.307737 |
6 | B | 0.013741 |
7 | A | -1.768852 |
8 | A | 0.550569 |
1
2
3
| grouper = df.case.eq("B").cumsum().shift().fillna(0)
dfs = [df for (g, df) in df.groupby(grouper)]
dfs
|
[ case data
0 A -0.900566
1 A -1.127128
2 A 0.342823
3 B -0.854102,
case data
4 A 1.058840
5 A -0.307737
6 B 0.013741,
case data
7 A -1.768852
8 A 0.550569]
Where the grouper works like so:
1
2
3
4
5
6
| dd = df.set_index("case", drop=False) # Use case as index for clarity
a = dd.case.eq("B") # Boolean logic
b = a.cumsum() # Create groups
c = b.shift() # Shift so B included in previous group
d = c.fillna(0) # Replace 0th element emptied by shift
a, b, c, d
|
(case
A False
A False
A False
B True
A False
A False
B True
A False
A False
Name: case, dtype: bool,
case
A 0
A 0
A 0
B 1
A 1
A 1
B 2
A 2
A 2
Name: case, dtype: int64,
case
A NaN
A 0.0
A 0.0
B 0.0
A 1.0
A 1.0
B 1.0
A 2.0
A 2.0
Name: case, dtype: float64,
case
A 0.0
A 0.0
A 0.0
B 0.0
A 1.0
A 1.0
B 1.0
A 2.0
A 2.0
Name: case, dtype: float64)
Creating new columns based on existing ones using mappings
The below is a straightforward adaptation from the cookbook:
1
2
3
4
5
6
7
8
9
| df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 4, 2], "CCC": [2, 1, 3, 1]})
source_cols = ["AAA", "BBB"]
new_cols = [str(c) + "_cat" for c in source_cols]
cats = {1: "One", 2: "Two", 3: "Three"}
dd = df.copy()
dd[new_cols] = df[source_cols].applymap(cats.get)
dd
|
| AAA | BBB | CCC | AAA_cat | BBB_cat |
---|
0 | 1 | 1 | 2 | One | One |
1 | 2 | 1 | 1 | Two | One |
2 | 1 | 4 | 3 | One | None |
3 | 3 | 2 | 1 | Three | Two |
But it made me wonder why applymap required the use of the get method while we can map values of a series like so:
1
2
| s = pd.Series([1, 2, 3, 1])
s.map(cats)
|
0 One
1 Two
2 Three
3 One
dtype: object
or so
0 One
1 Two
2 Three
3 One
dtype: object
The answer is simple: applymap requires a function as argument, while map takes functions or mappings.
One limitation of the cookbook solution above is that is doesn’t seem to allow for default values (notice that 4 gets substituted with “None”).
One way around this is the following:
1
2
| df[new_cols] = df[source_cols].applymap(lambda x: cats.get(x, "Hello"))
df
|
| AAA | BBB | CCC | AAA_cat | BBB_cat |
---|
0 | 1 | 1 | 2 | One | One |
1 | 2 | 1 | 1 | Two | One |
2 | 1 | 4 | 3 | One | Hello |
3 | 3 | 2 | 1 | Three | Two |
Creating dummy variables
A reminder to my future self, based on this great video from Data School.
1
2
3
4
5
6
7
| df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5],
"quality": ["good", "excellent", "very good", "excellent", "good"],
}
)
df.head()
|
| id | quality |
---|
0 | 1 | good |
1 | 2 | excellent |
2 | 3 | very good |
3 | 4 | excellent |
4 | 5 | good |
Pandas makes creating dummies easy:
1
| pd.get_dummies(df.quality)
|
| excellent | good | very good |
---|
0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 1 | 0 |
If you want to label the source of the data, you can use the prefix argument:
1
| pd.get_dummies(df.quality, prefix="quality")
|
| quality_excellent | quality_good | quality_very good |
---|
0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 1 | 0 |
Often when we work with dummies from a variable with $n$ distinct values, we create $n-1$ dummies and treat the remaining group as the reference group. Pandas provides a convenient way to do this:
1
| pd.get_dummies(df.quality, prefix="quality", drop_first=True)
|
| quality_good | quality_very good |
---|
0 | 1 | 0 |
1 | 0 | 0 |
2 | 0 | 1 |
3 | 0 | 0 |
4 | 1 | 0 |
Usually, we’ll want to use the dummies with the rest of the data, so it’s conveninet to have them in the original dataframe. One way to do this is to use concat like so:
1
2
3
| dummies = pd.get_dummies(df.quality, prefix="quality", drop_first=True)
df_with_dummies = pd.concat([df, dummies], axis=1)
df_with_dummies.head()
|
| id | quality | quality_good | quality_very good |
---|
0 | 1 | good | 1 | 0 |
1 | 2 | excellent | 0 | 0 |
2 | 3 | very good | 0 | 1 |
3 | 4 | excellent | 0 | 0 |
4 | 5 | good | 1 | 0 |
This works. But Pandas provides a much easier way:
1
2
| df_with_dummies1 = pd.get_dummies(df, columns=["quality"], drop_first=True)
df_with_dummies1
|
| id | quality_good | quality_very good |
---|
0 | 1 | 1 | 0 |
1 | 2 | 0 | 0 |
2 | 3 | 0 | 1 |
3 | 4 | 0 | 0 |
4 | 5 | 1 | 0 |
That’s it. In one line we get a new dataframe that includes the dummies and excludes the original quality column.
Pivot tables
1
2
3
| df = sns.load_dataset("planets")
print(df.shape)
df.head(3)
|
(1035, 6)
| method | number | orbital_period | mass | distance | year |
---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
Create a table that shows the number of planets discovered by each method in each decade
1
2
3
4
5
6
7
| # method 1: groupby
decade = df.year // 10 * 10
decade = decade.astype("str") + "s"
decade.name = "decade"
df.groupby(["method", decade]).number.sum().unstack().fillna(0).astype("int")
|
decade | 1980s | 1990s | 2000s | 2010s |
---|
method | | | | |
Astrometry | 0 | 0 | 0 | 2 |
Eclipse Timing Variations | 0 | 0 | 5 | 10 |
Imaging | 0 | 0 | 29 | 21 |
Microlensing | 0 | 0 | 12 | 15 |
Orbital Brightness Modulation | 0 | 0 | 0 | 5 |
Pulsar Timing | 0 | 9 | 1 | 1 |
Pulsation Timing Variations | 0 | 0 | 1 | 0 |
Radial Velocity | 1 | 52 | 475 | 424 |
Transit | 0 | 0 | 64 | 712 |
Transit Timing Variations | 0 | 0 | 0 | 9 |
1
2
3
4
5
| # method 2: pivot table
df.pivot_table(
values="number", index="method", columns=decade, aggfunc="sum", fill_value=0
)
|
decade | 1980s | 1990s | 2000s | 2010s |
---|
method | | | | |
Astrometry | 0 | 0 | 0 | 2 |
Eclipse Timing Variations | 0 | 0 | 5 | 10 |
Imaging | 0 | 0 | 29 | 21 |
Microlensing | 0 | 0 | 12 | 15 |
Orbital Brightness Modulation | 0 | 0 | 0 | 5 |
Pulsar Timing | 0 | 9 | 1 | 1 |
Pulsation Timing Variations | 0 | 0 | 1 | 0 |
Radial Velocity | 1 | 52 | 475 | 424 |
Transit | 0 | 0 | 64 | 712 |
Transit Timing Variations | 0 | 0 | 0 | 9 |
Another example
From here
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| df = pd.DataFrame(
data={
"province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
"city": [
"Toronto",
"Montreal",
"Vancouver",
"Calgary",
"Edmonton",
"Winnipeg",
"Windsor",
],
"sales": [13, 6, 16, 8, 4, 3, 1],
}
)
df
|
| province | city | sales |
---|
0 | ON | Toronto | 13 |
1 | QC | Montreal | 6 |
2 | BC | Vancouver | 16 |
3 | AL | Calgary | 8 |
4 | AL | Edmonton | 4 |
5 | MN | Winnipeg | 3 |
6 | ON | Windsor | 1 |
You want to group sales by province and get subtotal for total state.
1
2
3
4
5
6
7
8
| table = (
df.pivot_table(
values="sales", index="province", columns="city", aggfunc="sum", margins=True
)
.stack()
.drop("All")
)
table
|
province city
AL Calgary 8.0
Edmonton 4.0
All 12.0
BC Vancouver 16.0
All 16.0
MN Winnipeg 3.0
All 3.0
ON Toronto 13.0
Windsor 1.0
All 14.0
QC Montreal 6.0
All 6.0
dtype: float64
Counting number of equal adjacent values
In the below column, cumulatively count the number of adjacent equal values.
1
2
| df = pd.DataFrame([1, 5, 7, 7, 1, 1, 1, 0], columns=["a"])
df
|
Solution based on this Stack Overflow answer:
1
2
| df["count"] = df.groupby((df.a != df.a.shift()).cumsum()).cumcount().add(1)
df
|
| a | count |
---|
0 | 1 | 1 |
1 | 5 | 1 |
2 | 7 | 1 |
3 | 7 | 2 |
4 | 1 | 1 |
5 | 1 | 2 |
6 | 1 | 3 |
7 | 0 | 1 |
Reset cumsum()
at each missing value
1
2
| s = pd.Series([1.0, 3.0, 1.0, np.nan, 1.0, 1.0, 1.0, 1.0, np.nan, 1.0])
s
|
0 1.0
1 3.0
2 1.0
3 NaN
4 1.0
5 1.0
6 1.0
7 1.0
8 NaN
9 1.0
dtype: float64
What I don’t want:
0 1.0
1 4.0
2 5.0
3 NaN
4 6.0
5 7.0
6 8.0
7 9.0
8 NaN
9 10.0
dtype: float64
Instead, I want to reset the counter to zero after each missing value. Solution from this SO answer.
1
2
3
4
| cumsum = s.cumsum().ffill()
reset = -cumsum[s.isna()].diff().fillna(cumsum)
result = s.where(s.notna(), reset).cumsum()
result
|
0 1.0
1 4.0
2 5.0
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 0.0
9 1.0
dtype: float64
Apply
Using apply with groupby
From the cookbook:
1
2
3
4
5
6
7
8
9
| df = pd.DataFrame(
{
"animal": "cat dog cat fish dog cat cat".split(),
"size": list("SSMMMLL"),
"weight": [8, 10, 11, 1, 20, 12, 12],
"adult": [False] * 5 + [True] * 2,
}
)
df
|
| animal | size | weight | adult |
---|
0 | cat | S | 8 | False |
1 | dog | S | 10 | False |
2 | cat | M | 11 | False |
3 | fish | M | 1 | False |
4 | dog | M | 20 | False |
5 | cat | L | 12 | True |
6 | cat | L | 12 | True |
1
2
3
| # Return size of heaviest animal
df.groupby("animal").apply(lambda g: g.loc[g.weight.idxmax(), "size"])
|
animal
cat L
dog M
fish M
dtype: object
Expanding apply
Assume you want to calculate the cumulative return from a series of one-period returns in an expanding fashion – in each period, you want the cumulative return up to that period.
1
2
| s = pd.Series([i / 100.0 for i in range(1, 4)])
s
|
0 0.01
1 0.02
2 0.03
dtype: float64
The solution is given here.
1
2
3
4
5
6
7
8
9
10
11
12
13
| import functools
def cum_return(x, y):
return x * (1 + y)
def red(x):
res = functools.reduce(cum_return, x, 1)
return res
s.expanding().apply(red, raw=True)
|
0 1.010000
1 1.030200
2 1.061106
dtype: float64
I found that somewhere between bewildering and magical. To see what’s going on, it helps to add a few print statements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import functools
def cum_return(x, y):
print("x:", x)
print("y:", y)
return x * (1 + y)
def red(x):
print("Series:", x)
res = functools.reduce(cum_return, x, 1)
print("Result:", res)
print()
return res
s.expanding().apply(red, raw=True)
|
Series: [0.01]
x: 1
y: 0.01
Result: 1.01
Series: [0.01 0.02]
x: 1
y: 0.01
x: 1.01
y: 0.02
Result: 1.0302
Series: [0.01 0.02 0.03]
x: 1
y: 0.01
x: 1.01
y: 0.02
x: 1.0302
y: 0.03
Result: 1.061106
0 1.010000
1 1.030200
2 1.061106
dtype: float64
This makes transparent how reduce works: it takes the starting value (1 here) as the initial x value and the first value of the series as y value, and then returns the result of cum_returns. Next, it uses that result as x, and the second element in the series as y, and calculates the new result of cum_returns. This is then repeated until it has run through the entire series.
What surprised me is to see that reduce always starts the calculation from the beginning, rather than re-using the last calculated result. This seems inefficient, but is probably necessary for some reason.
Sort by sum of group values
1
2
3
4
5
6
7
8
9
| df = pd.DataFrame(
{
"code": ["foo", "bar", "baz"] * 2,
"data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
"flag": [False, True] * 3,
}
)
df
|
| code | data | flag |
---|
0 | foo | 0.16 | False |
1 | bar | -0.21 | True |
2 | baz | 0.33 | False |
3 | foo | 0.45 | True |
4 | bar | -0.59 | False |
5 | baz | 0.62 | True |
1
2
3
| g = df.groupby("code")
sort_order = g["data"].transform(sum).sort_values().index
df.loc[sort_order]
|
| code | data | flag |
---|
1 | bar | -0.21 | True |
4 | bar | -0.59 | False |
0 | foo | 0.16 | False |
3 | foo | 0.45 | True |
2 | baz | 0.33 | False |
5 | baz | 0.62 | True |
Get observation with largest data entry for each group
1
2
| g = df.groupby("code")
g.apply(lambda g: g.loc[g.data.idxmax()])
|
| code | data | flag |
---|
code | | | |
bar | bar | -0.21 | True |
baz | baz | 0.62 | True |
foo | foo | 0.45 | True |
Expanding group operations
Based on this answer.
1
2
3
4
5
6
7
8
| df = pd.DataFrame(
{
"code": ["foo", "bar", "baz"] * 4,
"data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62] * 2,
"flag": [False, True] * 6,
}
)
df
|
| code | data | flag |
---|
0 | foo | 0.16 | False |
1 | bar | -0.21 | True |
2 | baz | 0.33 | False |
3 | foo | 0.45 | True |
4 | bar | -0.59 | False |
5 | baz | 0.62 | True |
6 | foo | 0.16 | False |
7 | bar | -0.21 | True |
8 | baz | 0.33 | False |
9 | foo | 0.45 | True |
10 | bar | -0.59 | False |
11 | baz | 0.62 | True |
1
2
3
4
5
6
7
8
9
10
11
12
| g = df.groupby("code")
def helper(g):
s = g.data.expanding()
g["exp_mean"] = s.mean()
g["exp_sum"] = s.sum()
g["exp_count"] = s.count()
return g
g.apply(helper).sort_values("code")
|
| code | data | flag | exp_mean | exp_sum | exp_count |
---|
1 | bar | -0.21 | True | -0.210000 | -0.21 | 1.0 |
4 | bar | -0.59 | False | -0.400000 | -0.80 | 2.0 |
7 | bar | -0.21 | True | -0.336667 | -1.01 | 3.0 |
10 | bar | -0.59 | False | -0.400000 | -1.60 | 4.0 |
2 | baz | 0.33 | False | 0.330000 | 0.33 | 1.0 |
5 | baz | 0.62 | True | 0.475000 | 0.95 | 2.0 |
8 | baz | 0.33 | False | 0.426667 | 1.28 | 3.0 |
11 | baz | 0.62 | True | 0.475000 | 1.90 | 4.0 |
0 | foo | 0.16 | False | 0.160000 | 0.16 | 1.0 |
3 | foo | 0.45 | True | 0.305000 | 0.61 | 2.0 |
6 | foo | 0.16 | False | 0.256667 | 0.77 | 3.0 |
9 | foo | 0.45 | True | 0.305000 | 1.22 | 4.0 |
Get observation with largest data entry for each group
1
2
| df = pd.DataFrame({"a": [4, 5, 6, 7], "b": [10, 20, 30, 40], "c": [100, 50, -30, -50]})
df
|
| a | b | c |
---|
0 | 4 | 10 | 100 |
1 | 5 | 20 | 50 |
2 | 6 | 30 | -30 |
3 | 7 | 40 | -50 |
1
2
| myval = 34
df.loc[(df.c - myval).abs().argsort()]
|
| a | b | c |
---|
1 | 5 | 20 | 50 |
2 | 6 | 30 | -30 |
0 | 4 | 10 | 100 |
3 | 7 | 40 | -50 |
Reminder of what happens here:
1
2
3
| a = (df.c - myval).abs()
b = a.argsort()
a, b
|
(0 66
1 16
2 64
3 84
Name: c, dtype: int64,
0 1
1 2
2 0
3 3
Name: c, dtype: int64)
argsort
returns a series of indexes, so that df[b]
returns an ordered dataframe. The first element in b
thus refers to the index of the smallest element in a
.
Creating separate dataframe for each group
1
2
3
| df = sns.load_dataset("iris")
pieces = dict(list(df.groupby("species")))
pieces["setosa"].head(3)
|
| sepal_length | sepal_width | petal_length | petal_width | species |
---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
Aggregating
From here
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| df = pd.DataFrame(
{
"StudentID": ["x1", "x10", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"],
"StudentGender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
"ExamYear": [
"2007",
"2007",
"2007",
"2008",
"2008",
"2008",
"2008",
"2009",
"2009",
"2009",
],
"Exam": [
"algebra",
"stats",
"bio",
"algebra",
"algebra",
"stats",
"stats",
"algebra",
"bio",
"bio",
],
"Participated": [
"no",
"yes",
"yes",
"yes",
"no",
"yes",
"yes",
"yes",
"yes",
"yes",
],
"Passed": ["no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes"],
},
columns=[
"StudentID",
"StudentGender",
"ExamYear",
"Exam",
"Participated",
"Passed",
],
)
df.columns = [str.lower(c) for c in df.columns]
df
|
| studentid | studentgender | examyear | exam | participated | passed |
---|
0 | x1 | F | 2007 | algebra | no | no |
1 | x10 | M | 2007 | stats | yes | yes |
2 | x2 | F | 2007 | bio | yes | yes |
3 | x3 | M | 2008 | algebra | yes | yes |
4 | x4 | F | 2008 | algebra | no | no |
5 | x5 | M | 2008 | stats | yes | yes |
6 | x6 | F | 2008 | stats | yes | yes |
7 | x7 | M | 2009 | algebra | yes | yes |
8 | x8 | M | 2009 | bio | yes | no |
9 | x9 | M | 2009 | bio | yes | yes |
1
2
| numyes = lambda x: sum(x == "yes")
df.groupby("examyear").agg({"participated": numyes, "passed": numyes})
|
| participated | passed |
---|
examyear | | |
2007 | 2 | 2 |
2008 | 3 | 3 |
2009 | 3 | 2 |
Count days since last occurrence of event
We want a count of the number of days passed since the last event happened.
1
2
3
4
5
6
7
8
| dfr = pd.DataFrame(
{
"date": pd.date_range(start="1 Jan 2022", periods=8, freq="d"),
"event": [np.nan, np.nan, 1, np.nan, np.nan, np.nan, 4, np.nan],
"result": [np.nan, np.nan, 0, 1, 2, 3, 0, 1],
}
)
dfr
|
| date | event | result |
---|
0 | 2022-01-01 | NaN | NaN |
1 | 2022-01-02 | NaN | NaN |
2 | 2022-01-03 | 1.0 | 0.0 |
3 | 2022-01-04 | NaN | 1.0 |
4 | 2022-01-05 | NaN | 2.0 |
5 | 2022-01-06 | NaN | 3.0 |
6 | 2022-01-07 | 4.0 | 0.0 |
7 | 2022-01-08 | NaN | 1.0 |
1
2
| df = dfr.iloc[:, :-1]
df
|
| date | event |
---|
0 | 2022-01-01 | NaN |
1 | 2022-01-02 | NaN |
2 | 2022-01-03 | 1.0 |
3 | 2022-01-04 | NaN |
4 | 2022-01-05 | NaN |
5 | 2022-01-06 | NaN |
6 | 2022-01-07 | 4.0 |
7 | 2022-01-08 | NaN |
1
2
3
4
5
6
7
8
9
| def add_days_since_event(df):
ones = df.event.ffill().where(lambda s: s.isna(), 1)
cumsum = ones.cumsum()
reset = -cumsum[df.event.notna()].diff().fillna(cumsum).sub(1)
df["result"] = ones.where(df.event.isna(), reset).cumsum()
return df
add_days_since_event(df)
|
| date | event | result |
---|
0 | 2022-01-01 | NaN | NaN |
1 | 2022-01-02 | NaN | NaN |
2 | 2022-01-03 | 1.0 | 0.0 |
3 | 2022-01-04 | NaN | 1.0 |
4 | 2022-01-05 | NaN | 2.0 |
5 | 2022-01-06 | NaN | 3.0 |
6 | 2022-01-07 | 4.0 | 0.0 |
7 | 2022-01-08 | NaN | 1.0 |
Sources