Pandas cookbook

2020-08-09 3619 words 17 minutes

A collection of recipes for effectively solving common and not so common problems in Pandas.

1
2
3
import numpy as np
import pandas as pd
import seaborn as sns

Splitting dataframes based on column values

You want to split the dataframe every time case equals B and store the resulting dataframes in a list.

1
2
3
4
5
6
7
df = pd.DataFrame(
    data={
        "case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
        "data": np.random.randn(9),
    }
)
df

	case	data
0	A	-0.900566
1	A	-1.127128
2	A	0.342823
3	B	-0.854102
4	A	1.058840
5	A	-0.307737
6	B	0.013741
7	A	-1.768852
8	A	0.550569

Understanding the cookbook solution

From the cookbook:

1
2
3
4
5
6
7
8
dfs = list(
    zip(
        *df.groupby(
            (1 * (df["case"] == "B")).cumsum().rolling(window=3, min_periods=1).median()
        )
    )
)[-1]
dfs

(  case      data
 0    A -0.900566
 1    A -1.127128
 2    A  0.342823
 3    B -0.854102,
   case      data
 4    A  1.058840
 5    A -0.307737
 6    B  0.013741,
   case      data
 7    A -1.768852
 8    A  0.550569)

This works. But because it’s so heavily nested and uses methods like rolling() and median() not really designed for that purpose, the code is impossible to interpret at a glance.

Let’s break this down into separate pieces.

First, the code creates a grouping variable that changes its value each time case equaled B on the previous row.

1
2
3
4
5
6
7
8
# Creating grouping variable

a = df.case == "B"
b = 1 * (df.case == "B")
c = 1 * (df.case == "B").cumsum()
d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()

a, b, c, d

(0    False
 1    False
 2    False
 3     True
 4    False
 5    False
 6     True
 7    False
 8    False
 Name: case, dtype: bool,
 0    0
 1    0
 2    0
 3    1
 4    0
 5    0
 6    1
 7    0
 8    0
 Name: case, dtype: int64,
 0    0
 1    0
 2    0
 3    1
 4    1
 5    1
 6    2
 7    2
 8    2
 Name: case, dtype: int64,
 0    0.0
 1    0.0
 2    0.0
 3    0.0
 4    1.0
 5    1.0
 6    1.0
 7    2.0
 8    2.0
 Name: case, dtype: float64)

Series d above is the argument passed to groupby() in the solution. This works, but is a very roundabout way to create such a series. I’ll use a different approach below.

Next, the code uses list(), zip(), and argument expansion to pack the data for each group into a single list of dataframes. Let’s look at these one by one.

First, groupby() stores the grouped data as (label, df) tuples.

1
2
3
4
groups = df.groupby("case")
for g in groups:
    print(type(g))
    print(g, end="\n\n")

<class 'tuple'>
('A',   case      data
0    A -0.900566
1    A -1.127128
2    A  0.342823
4    A  1.058840
5    A -0.307737
7    A -1.768852
8    A  0.550569)

<class 'tuple'>
('B',   case      data
3    B -0.854102
6    B  0.013741)

Simplified, this is what we work with:

1
2
groups2 = [("g1", "data1"), ("g2", "data2")]
groups2

[('g1', 'data1'), ('g2', 'data2')]

Argument expansion unpacks elements from a list as separate arguments that can be passed to a function. In our context here, it turns each (label, df) tuple into a separate object.

1
2
print(groups2)
print(*groups2)

[('g1', 'data1'), ('g2', 'data2')]
('g1', 'data1') ('g2', 'data2')

zip() stitches together the ith elements of each iterable passed to it, effectively separating the “columns”, and returns an iterator.

1
zip(*groups2)

<zip at 0x164d84200>

1
list(zip(*groups2))

[('g1', 'g2'), ('data1', 'data2')]

Putting this all together, zip() is used to separate the group label from the data, and list() consumes the iterator created by zip and displays its content.

1
list(zip(*groups))

[('A', 'B'),
 (  case      data
  0    A -0.900566
  1    A -1.127128
  2    A  0.342823
  4    A  1.058840
  5    A -0.307737
  7    A -1.768852
  8    A  0.550569,
    case      data
  3    B -0.854102
  6    B  0.013741)]

Because we only want the data, we select the last element from the list:

1
list(zip(*groups))[-1]

(  case      data
 0    A  0.684978
 1    A  0.000269
 2    A -1.040497
 4    A  0.448596
 5    A  0.222168
 7    A -2.208787
 8    A -0.440758,
   case      data
 3    B  0.451358
 6    B  1.031011)

Now we’re basically done. What remains is to use the list(zip(*groups)) procedure on the more complicated grouping variable, to obtain the original result.

1
2
3
d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()
groups = df.groupby(d)
list(zip(*groups))[-1]

(  case      data
 0    A  0.684978
 1    A  0.000269
 2    A -1.040497
 3    B  0.451358,
   case      data
 4    A  0.448596
 5    A  0.222168
 6    B  1.031011,
   case      data
 7    A -2.208787
 8    A -0.440758)

Simplifying the code

I think this can be made much more readable like so:

1
df

	case	data
0	A	-0.900566
1	A	-1.127128
2	A	0.342823
3	B	-0.854102
4	A	1.058840
5	A	-0.307737
6	B	0.013741
7	A	-1.768852
8	A	0.550569

1
2
3
grouper = df.case.eq("B").cumsum().shift().fillna(0)
dfs = [df for (g, df) in df.groupby(grouper)]
dfs

[  case      data
 0    A -0.900566
 1    A -1.127128
 2    A  0.342823
 3    B -0.854102,
   case      data
 4    A  1.058840
 5    A -0.307737
 6    B  0.013741,
   case      data
 7    A -1.768852
 8    A  0.550569]

Where the grouper works like so:

1
2
3
4
5
6
dd = df.set_index("case", drop=False)  # Use case as index for clarity
a = dd.case.eq("B")  # Boolean logic
b = a.cumsum()  # Create groups
c = b.shift()  # Shift so B included in previous group
d = c.fillna(0)  # Replace 0th element emptied by shift
a, b, c, d

(case
 A    False
 A    False
 A    False
 B     True
 A    False
 A    False
 B     True
 A    False
 A    False
 Name: case, dtype: bool,
 case
 A    0
 A    0
 A    0
 B    1
 A    1
 A    1
 B    2
 A    2
 A    2
 Name: case, dtype: int64,
 case
 A    NaN
 A    0.0
 A    0.0
 B    0.0
 A    1.0
 A    1.0
 B    1.0
 A    2.0
 A    2.0
 Name: case, dtype: float64,
 case
 A    0.0
 A    0.0
 A    0.0
 B    0.0
 A    1.0
 A    1.0
 B    1.0
 A    2.0
 A    2.0
 Name: case, dtype: float64)

Creating new columns based on existing ones using mappings

The below is a straightforward adaptation from the cookbook:

1
2
3
4
5
6
7
8
9
df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 4, 2], "CCC": [2, 1, 3, 1]})

source_cols = ["AAA", "BBB"]
new_cols = [str(c) + "_cat" for c in source_cols]
cats = {1: "One", 2: "Two", 3: "Three"}

dd = df.copy()
dd[new_cols] = df[source_cols].applymap(cats.get)
dd

	AAA	BBB	CCC	AAA_cat	BBB_cat
0	1	1	2	One	One
1	2	1	1	Two	One
2	1	4	3	One	None
3	3	2	1	Three	Two

But it made me wonder why applymap required the use of the get method while we can map values of a series like so:

1
2
s = pd.Series([1, 2, 3, 1])
s.map(cats)

0      One
1      Two
2    Three
3      One
dtype: object

or so

1
s.map(cats.get)

0      One
1      Two
2    Three
3      One
dtype: object

The answer is simple: applymap requires a function as argument, while map takes functions or mappings.

One limitation of the cookbook solution above is that is doesn’t seem to allow for default values (notice that 4 gets substituted with “None”).

One way around this is the following:

1
2
df[new_cols] = df[source_cols].applymap(lambda x: cats.get(x, "Hello"))
df

	AAA	BBB	CCC	AAA_cat	BBB_cat
0	1	1	2	One	One
1	2	1	1	Two	One
2	1	4	3	One	Hello
3	3	2	1	Three	Two

Creating dummy variables

A reminder to my future self, based on this great video from Data School.

1
2
3
4
5
6
7
df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "quality": ["good", "excellent", "very good", "excellent", "good"],
    }
)
df.head()

	id	quality
0	1	good
1	2	excellent
2	3	very good
3	4	excellent
4	5	good

Pandas makes creating dummies easy:

1
pd.get_dummies(df.quality)

	excellent	good	very good
0	0	1	0
1	1	0	0
2	0	0	1
3	1	0	0
4	0	1	0

If you want to label the source of the data, you can use the prefix argument:

1
pd.get_dummies(df.quality, prefix="quality")

	quality_excellent	quality_good	quality_very good
0	0	1	0
1	1	0	0
2	0	0	1
3	1	0	0
4	0	1	0

Often when we work with dummies from a variable with $n$ distinct values, we create $n-1$ dummies and treat the remaining group as the reference group. Pandas provides a convenient way to do this:

1
pd.get_dummies(df.quality, prefix="quality", drop_first=True)

	quality_good	quality_very good
0	1	0
1	0	0
2	0	1
3	0	0
4	1	0

Usually, we’ll want to use the dummies with the rest of the data, so it’s conveninet to have them in the original dataframe. One way to do this is to use concat like so:

1
2
3
dummies = pd.get_dummies(df.quality, prefix="quality", drop_first=True)
df_with_dummies = pd.concat([df, dummies], axis=1)
df_with_dummies.head()

	id	quality	quality_good	quality_very good
0	1	good	1	0
1	2	excellent	0	0
2	3	very good	0	1
3	4	excellent	0	0
4	5	good	1	0

This works. But Pandas provides a much easier way:

1
2
df_with_dummies1 = pd.get_dummies(df, columns=["quality"], drop_first=True)
df_with_dummies1

	id	quality_good	quality_very good
0	1	1	0
1	2	0	0
2	3	0	1
3	4	0	0
4	5	1	0

That’s it. In one line we get a new dataframe that includes the dummies and excludes the original quality column.

Pivot tables

1
2
3
df = sns.load_dataset("planets")
print(df.shape)
df.head(3)

(1035, 6)

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011

Create a table that shows the number of planets discovered by each method in each decade

1
2
3
4
5
6
7
# method 1: groupby

decade = df.year // 10 * 10
decade = decade.astype("str") + "s"
decade.name = "decade"

df.groupby(["method", decade]).number.sum().unstack().fillna(0).astype("int")

decade	1980s	1990s	2000s	2010s
method
Astrometry	0	0	0	2
Eclipse Timing Variations	0	0	5	10
Imaging	0	0	29	21
Microlensing	0	0	12	15
Orbital Brightness Modulation	0	0	0	5
Pulsar Timing	0	9	1	1
Pulsation Timing Variations	0	0	1	0
Radial Velocity	1	52	475	424
Transit	0	0	64	712
Transit Timing Variations	0	0	0	9

1
2
3
4
5
# method 2: pivot table

df.pivot_table(
    values="number", index="method", columns=decade, aggfunc="sum", fill_value=0
)

decade	1980s	1990s	2000s	2010s
method
Astrometry	0	0	0	2
Eclipse Timing Variations	0	0	5	10
Imaging	0	0	29	21
Microlensing	0	0	12	15
Orbital Brightness Modulation	0	0	0	5
Pulsar Timing	0	9	1	1
Pulsation Timing Variations	0	0	1	0
Radial Velocity	1	52	475	424
Transit	0	0	64	712
Transit Timing Variations	0	0	0	9

Another example

From here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
df = pd.DataFrame(
    data={
        "province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
        "city": [
            "Toronto",
            "Montreal",
            "Vancouver",
            "Calgary",
            "Edmonton",
            "Winnipeg",
            "Windsor",
        ],
        "sales": [13, 6, 16, 8, 4, 3, 1],
    }
)
df

	province	city	sales
0	ON	Toronto	13
1	QC	Montreal	6
2	BC	Vancouver	16
3	AL	Calgary	8
4	AL	Edmonton	4
5	MN	Winnipeg	3
6	ON	Windsor	1

You want to group sales by province and get subtotal for total state.

1
2
3
4
5
6
7
8
table = (
    df.pivot_table(
        values="sales", index="province", columns="city", aggfunc="sum", margins=True
    )
    .stack()
    .drop("All")
)
table

province  city     
AL        Calgary       8.0
          Edmonton      4.0
          All          12.0
BC        Vancouver    16.0
          All          16.0
MN        Winnipeg      3.0
          All           3.0
ON        Toronto      13.0
          Windsor       1.0
          All          14.0
QC        Montreal      6.0
          All           6.0
dtype: float64

Counting number of equal adjacent values

In the below column, cumulatively count the number of adjacent equal values.

1
2
df = pd.DataFrame([1, 5, 7, 7, 1, 1, 1, 0], columns=["a"])
df

	a
0	1
1	5
2	7
3	7
4	1
5	1
6	1
7	0

Solution based on this Stack Overflow answer:

1
2
df["count"] = df.groupby((df.a != df.a.shift()).cumsum()).cumcount().add(1)
df

	a	count
0	1	1
1	5	1
2	7	1
3	7	2
4	1	1
5	1	2
6	1	3
7	0	1

Reset `cumsum()` at each missing value

1
2
s = pd.Series([1.0, 3.0, 1.0, np.nan, 1.0, 1.0, 1.0, 1.0, np.nan, 1.0])
s

0    1.0
1    3.0
2    1.0
3    NaN
4    1.0
5    1.0
6    1.0
7    1.0
8    NaN
9    1.0
dtype: float64

What I don’t want:

1
s.cumsum()

0     1.0
1     4.0
2     5.0
3     NaN
4     6.0
5     7.0
6     8.0
7     9.0
8     NaN
9    10.0
dtype: float64

Instead, I want to reset the counter to zero after each missing value. Solution from this SO answer.

1
2
3
4
cumsum = s.cumsum().ffill()
reset = -cumsum[s.isna()].diff().fillna(cumsum)
result = s.where(s.notna(), reset).cumsum()
result

0    1.0
1    4.0
2    5.0
3    0.0
4    1.0
5    2.0
6    3.0
7    4.0
8    0.0
9    1.0
dtype: float64

Apply

Using apply with groupby

From the cookbook:

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(
    {
        "animal": "cat dog cat fish dog cat cat".split(),
        "size": list("SSMMMLL"),
        "weight": [8, 10, 11, 1, 20, 12, 12],
        "adult": [False] * 5 + [True] * 2,
    }
)
df

	animal	size	weight	adult
0	cat	S	8	False
1	dog	S	10	False
2	cat	M	11	False
3	fish	M	1	False
4	dog	M	20	False
5	cat	L	12	True
6	cat	L	12	True

1
2
3
# Return size of heaviest animal

df.groupby("animal").apply(lambda g: g.loc[g.weight.idxmax(), "size"])

animal
cat     L
dog     M
fish    M
dtype: object

Expanding apply

Assume you want to calculate the cumulative return from a series of one-period returns in an expanding fashion – in each period, you want the cumulative return up to that period.

1
2
s = pd.Series([i / 100.0 for i in range(1, 4)])
s

0    0.01
1    0.02
2    0.03
dtype: float64

The solution is given here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import functools


def cum_return(x, y):
    return x * (1 + y)


def red(x):
    res = functools.reduce(cum_return, x, 1)
    return res


s.expanding().apply(red, raw=True)

0    1.010000
1    1.030200
2    1.061106
dtype: float64

I found that somewhere between bewildering and magical. To see what’s going on, it helps to add a few print statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import functools


def cum_return(x, y):
    print("x:", x)
    print("y:", y)
    return x * (1 + y)


def red(x):
    print("Series:", x)
    res = functools.reduce(cum_return, x, 1)
    print("Result:", res)
    print()
    return res


s.expanding().apply(red, raw=True)

Series: [0.01]
x: 1
y: 0.01
Result: 1.01

Series: [0.01 0.02]
x: 1
y: 0.01
x: 1.01
y: 0.02
Result: 1.0302

Series: [0.01 0.02 0.03]
x: 1
y: 0.01
x: 1.01
y: 0.02
x: 1.0302
y: 0.03
Result: 1.061106

0    1.010000
1    1.030200
2    1.061106
dtype: float64

This makes transparent how reduce works: it takes the starting value (1 here) as the initial x value and the first value of the series as y value, and then returns the result of cum_returns. Next, it uses that result as x, and the second element in the series as y, and calculates the new result of cum_returns. This is then repeated until it has run through the entire series.

What surprised me is to see that reduce always starts the calculation from the beginning, rather than re-using the last calculated result. This seems inefficient, but is probably necessary for some reason.

Sort by sum of group values

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(
    {
        "code": ["foo", "bar", "baz"] * 2,
        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
        "flag": [False, True] * 3,
    }
)

df

	code	data	flag
0	foo	0.16	False
1	bar	-0.21	True
2	baz	0.33	False
3	foo	0.45	True
4	bar	-0.59	False
5	baz	0.62	True

1
2
3
g = df.groupby("code")
sort_order = g["data"].transform(sum).sort_values().index
df.loc[sort_order]

	code	data	flag
1	bar	-0.21	True
4	bar	-0.59	False
0	foo	0.16	False
3	foo	0.45	True
2	baz	0.33	False
5	baz	0.62	True

Get observation with largest data entry for each group

1
2
g = df.groupby("code")
g.apply(lambda g: g.loc[g.data.idxmax()])

	code	data	flag
code
bar	bar	-0.21	True
baz	baz	0.62	True
foo	foo	0.45	True

Expanding group operations

Based on this answer.

1
2
3
4
5
6
7
8
df = pd.DataFrame(
    {
        "code": ["foo", "bar", "baz"] * 4,
        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62] * 2,
        "flag": [False, True] * 6,
    }
)
df

	code	data	flag
0	foo	0.16	False
1	bar	-0.21	True
2	baz	0.33	False
3	foo	0.45	True
4	bar	-0.59	False
5	baz	0.62	True
6	foo	0.16	False
7	bar	-0.21	True
8	baz	0.33	False
9	foo	0.45	True
10	bar	-0.59	False
11	baz	0.62	True

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
g = df.groupby("code")


def helper(g):
    s = g.data.expanding()
    g["exp_mean"] = s.mean()
    g["exp_sum"] = s.sum()
    g["exp_count"] = s.count()
    return g


g.apply(helper).sort_values("code")

	code	data	flag	exp_mean	exp_sum	exp_count
1	bar	-0.21	True	-0.210000	-0.21	1.0
4	bar	-0.59	False	-0.400000	-0.80	2.0
7	bar	-0.21	True	-0.336667	-1.01	3.0
10	bar	-0.59	False	-0.400000	-1.60	4.0
2	baz	0.33	False	0.330000	0.33	1.0
5	baz	0.62	True	0.475000	0.95	2.0
8	baz	0.33	False	0.426667	1.28	3.0
11	baz	0.62	True	0.475000	1.90	4.0
0	foo	0.16	False	0.160000	0.16	1.0
3	foo	0.45	True	0.305000	0.61	2.0
6	foo	0.16	False	0.256667	0.77	3.0
9	foo	0.45	True	0.305000	1.22	4.0

Get observation with largest data entry for each group

1
2
df = pd.DataFrame({"a": [4, 5, 6, 7], "b": [10, 20, 30, 40], "c": [100, 50, -30, -50]})
df

	a	b	c
0	4	10	100
1	5	20	50
2	6	30	-30
3	7	40	-50

1
2
myval = 34
df.loc[(df.c - myval).abs().argsort()]

	a	b	c
1	5	20	50
2	6	30	-30
0	4	10	100
3	7	40	-50

Reminder of what happens here:

1
2
3
a = (df.c - myval).abs()
b = a.argsort()
a, b

(0    66
 1    16
 2    64
 3    84
 Name: c, dtype: int64,
 0    1
 1    2
 2    0
 3    3
 Name: c, dtype: int64)

argsort returns a series of indexes, so that df[b] returns an ordered dataframe. The first element in b thus refers to the index of the smallest element in a.

Creating separate dataframe for each group

1
2
3
df = sns.load_dataset("iris")
pieces = dict(list(df.groupby("species")))
pieces["setosa"].head(3)

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

Aggregating

From here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
df = pd.DataFrame(
    {
        "StudentID": ["x1", "x10", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"],
        "StudentGender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
        "ExamYear": [
            "2007",
            "2007",
            "2007",
            "2008",
            "2008",
            "2008",
            "2008",
            "2009",
            "2009",
            "2009",
        ],
        "Exam": [
            "algebra",
            "stats",
            "bio",
            "algebra",
            "algebra",
            "stats",
            "stats",
            "algebra",
            "bio",
            "bio",
        ],
        "Participated": [
            "no",
            "yes",
            "yes",
            "yes",
            "no",
            "yes",
            "yes",
            "yes",
            "yes",
            "yes",
        ],
        "Passed": ["no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes"],
    },
    columns=[
        "StudentID",
        "StudentGender",
        "ExamYear",
        "Exam",
        "Participated",
        "Passed",
    ],
)

df.columns = [str.lower(c) for c in df.columns]
df

	studentid	studentgender	examyear	exam	participated	passed
0	x1	F	2007	algebra	no	no
1	x10	M	2007	stats	yes	yes
2	x2	F	2007	bio	yes	yes
3	x3	M	2008	algebra	yes	yes
4	x4	F	2008	algebra	no	no
5	x5	M	2008	stats	yes	yes
6	x6	F	2008	stats	yes	yes
7	x7	M	2009	algebra	yes	yes
8	x8	M	2009	bio	yes	no
9	x9	M	2009	bio	yes	yes

1
2
numyes = lambda x: sum(x == "yes")
df.groupby("examyear").agg({"participated": numyes, "passed": numyes})

	participated	passed
examyear
2007	2	2
2008	3	3
2009	3	2

Count days since last occurrence of event

We want a count of the number of days passed since the last event happened.

1
2
3
4
5
6
7
8
dfr = pd.DataFrame(
    {
        "date": pd.date_range(start="1 Jan 2022", periods=8, freq="d"),
        "event": [np.nan, np.nan, 1, np.nan, np.nan, np.nan, 4, np.nan],
        "result": [np.nan, np.nan, 0, 1, 2, 3, 0, 1],
    }
)
dfr

	date	event	result
0	2022-01-01	NaN	NaN
1	2022-01-02	NaN	NaN
2	2022-01-03	1.0	0.0
3	2022-01-04	NaN	1.0
4	2022-01-05	NaN	2.0
5	2022-01-06	NaN	3.0
6	2022-01-07	4.0	0.0
7	2022-01-08	NaN	1.0

1
2
df = dfr.iloc[:, :-1]
df

	date	event
0	2022-01-01	NaN
1	2022-01-02	NaN
2	2022-01-03	1.0
3	2022-01-04	NaN
4	2022-01-05	NaN
5	2022-01-06	NaN
6	2022-01-07	4.0
7	2022-01-08	NaN

1
2
3
4
5
6
7
8
9
def add_days_since_event(df):
    ones = df.event.ffill().where(lambda s: s.isna(), 1)
    cumsum = ones.cumsum()
    reset = -cumsum[df.event.notna()].diff().fillna(cumsum).sub(1)
    df["result"] = ones.where(df.event.isna(), reset).cumsum()
    return df


add_days_since_event(df)

	date	event	result
0	2022-01-01	NaN	NaN
1	2022-01-02	NaN	NaN
2	2022-01-03	1.0	0.0
3	2022-01-04	NaN	1.0
4	2022-01-05	NaN	2.0
5	2022-01-06	NaN	3.0
6	2022-01-07	4.0	0.0
7	2022-01-08	NaN	1.0

Contents

Pandas cookbook

Splitting dataframes based on column values

Understanding the cookbook solution

Simplifying the code

Creating new columns based on existing ones using mappings

Creating dummy variables

Pivot tables

Another example

Counting number of equal adjacent values

Reset `cumsum()` at each missing value

Apply

Using apply with groupby

Expanding apply

Sort by sum of group values

Get observation with largest data entry for each group

Expanding group operations

Get observation with largest data entry for each group

Creating separate dataframe for each group

Aggregating

Count days since last occurrence of event

Sources

Contents

Pandas cookbook

Splitting dataframes based on column values

Understanding the cookbook solution

Simplifying the code

Creating new columns based on existing ones using mappings

Creating dummy variables

Pivot tables

Another example

Counting number of equal adjacent values

Reset cumsum() at each missing value

Apply

Using apply with groupby

Expanding apply

Sort by sum of group values

Get observation with largest data entry for each group

Expanding group operations

Get observation with largest data entry for each group

Creating separate dataframe for each group

Aggregating

Count days since last occurrence of event

Sources

Reset `cumsum()` at each missing value