Contents

Pandas cookbook

A collection of recipes for effectively solving common and not so common problems in Pandas.

1
2
3
import numpy as np
import pandas as pd
import seaborn as sns

Splitting dataframes based on column values

You want to split the dataframe every time case equals B and store the resulting dataframes in a list.

1
2
3
4
5
6
7
df = pd.DataFrame(
    data={
        "case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
        "data": np.random.randn(9),
    }
)
df

casedata
0A-0.900566
1A-1.127128
2A0.342823
3B-0.854102
4A1.058840
5A-0.307737
6B0.013741
7A-1.768852
8A0.550569

Understanding the cookbook solution

From the cookbook:

1
2
3
4
5
6
7
8
dfs = list(
    zip(
        *df.groupby(
            (1 * (df["case"] == "B")).cumsum().rolling(window=3, min_periods=1).median()
        )
    )
)[-1]
dfs
(  case      data
 0    A -0.900566
 1    A -1.127128
 2    A  0.342823
 3    B -0.854102,
   case      data
 4    A  1.058840
 5    A -0.307737
 6    B  0.013741,
   case      data
 7    A -1.768852
 8    A  0.550569)

This works. But because it’s so heavily nested and uses methods like rolling() and median() not really designed for that purpose, the code is impossible to interpret at a glance.

Let’s break this down into separate pieces.

First, the code creates a grouping variable that changes its value each time case equaled B on the previous row.

1
2
3
4
5
6
7
8
# Creating grouping variable

a = df.case == "B"
b = 1 * (df.case == "B")
c = 1 * (df.case == "B").cumsum()
d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()

a, b, c, d
(0    False
 1    False
 2    False
 3     True
 4    False
 5    False
 6     True
 7    False
 8    False
 Name: case, dtype: bool,
 0    0
 1    0
 2    0
 3    1
 4    0
 5    0
 6    1
 7    0
 8    0
 Name: case, dtype: int64,
 0    0
 1    0
 2    0
 3    1
 4    1
 5    1
 6    2
 7    2
 8    2
 Name: case, dtype: int64,
 0    0.0
 1    0.0
 2    0.0
 3    0.0
 4    1.0
 5    1.0
 6    1.0
 7    2.0
 8    2.0
 Name: case, dtype: float64)

Series d above is the argument passed to groupby() in the solution. This works, but is a very roundabout way to create such a series. I’ll use a different approach below.

Next, the code uses list(), zip(), and argument expansion to pack the data for each group into a single list of dataframes. Let’s look at these one by one.

First, groupby() stores the grouped data as (label, df) tuples.

1
2
3
4
groups = df.groupby("case")
for g in groups:
    print(type(g))
    print(g, end="\n\n")
<class 'tuple'>
('A',   case      data
0    A -0.900566
1    A -1.127128
2    A  0.342823
4    A  1.058840
5    A -0.307737
7    A -1.768852
8    A  0.550569)

<class 'tuple'>
('B',   case      data
3    B -0.854102
6    B  0.013741)

Simplified, this is what we work with:

1
2
groups2 = [("g1", "data1"), ("g2", "data2")]
groups2
[('g1', 'data1'), ('g2', 'data2')]

Argument expansion unpacks elements from a list as separate arguments that can be passed to a function. In our context here, it turns each (label, df) tuple into a separate object.

1
2
print(groups2)
print(*groups2)
[('g1', 'data1'), ('g2', 'data2')]
('g1', 'data1') ('g2', 'data2')

zip() stitches together the ith elements of each iterable passed to it, effectively separating the “columns”, and returns an iterator.

1
zip(*groups2)
<zip at 0x164d84200>
1
list(zip(*groups2))
[('g1', 'g2'), ('data1', 'data2')]

Putting this all together, zip() is used to separate the group label from the data, and list() consumes the iterator created by zip and displays its content.

1
list(zip(*groups))
[('A', 'B'),
 (  case      data
  0    A -0.900566
  1    A -1.127128
  2    A  0.342823
  4    A  1.058840
  5    A -0.307737
  7    A -1.768852
  8    A  0.550569,
    case      data
  3    B -0.854102
  6    B  0.013741)]

Because we only want the data, we select the last element from the list:

1
list(zip(*groups))[-1]
(  case      data
 0    A  0.684978
 1    A  0.000269
 2    A -1.040497
 4    A  0.448596
 5    A  0.222168
 7    A -2.208787
 8    A -0.440758,
   case      data
 3    B  0.451358
 6    B  1.031011)

Now we’re basically done. What remains is to use the list(zip(*groups)) procedure on the more complicated grouping variable, to obtain the original result.

1
2
3
d = 1 * (df.case == "B").cumsum().rolling(window=3, min_periods=1).median()
groups = df.groupby(d)
list(zip(*groups))[-1]
(  case      data
 0    A  0.684978
 1    A  0.000269
 2    A -1.040497
 3    B  0.451358,
   case      data
 4    A  0.448596
 5    A  0.222168
 6    B  1.031011,
   case      data
 7    A -2.208787
 8    A -0.440758)

Simplifying the code

I think this can be made much more readable like so:

1
df

casedata
0A-0.900566
1A-1.127128
2A0.342823
3B-0.854102
4A1.058840
5A-0.307737
6B0.013741
7A-1.768852
8A0.550569
1
2
3
grouper = df.case.eq("B").cumsum().shift().fillna(0)
dfs = [df for (g, df) in df.groupby(grouper)]
dfs
[  case      data
 0    A -0.900566
 1    A -1.127128
 2    A  0.342823
 3    B -0.854102,
   case      data
 4    A  1.058840
 5    A -0.307737
 6    B  0.013741,
   case      data
 7    A -1.768852
 8    A  0.550569]

Where the grouper works like so:

1
2
3
4
5
6
dd = df.set_index("case", drop=False)  # Use case as index for clarity
a = dd.case.eq("B")  # Boolean logic
b = a.cumsum()  # Create groups
c = b.shift()  # Shift so B included in previous group
d = c.fillna(0)  # Replace 0th element emptied by shift
a, b, c, d
(case
 A    False
 A    False
 A    False
 B     True
 A    False
 A    False
 B     True
 A    False
 A    False
 Name: case, dtype: bool,
 case
 A    0
 A    0
 A    0
 B    1
 A    1
 A    1
 B    2
 A    2
 A    2
 Name: case, dtype: int64,
 case
 A    NaN
 A    0.0
 A    0.0
 B    0.0
 A    1.0
 A    1.0
 B    1.0
 A    2.0
 A    2.0
 Name: case, dtype: float64,
 case
 A    0.0
 A    0.0
 A    0.0
 B    0.0
 A    1.0
 A    1.0
 B    1.0
 A    2.0
 A    2.0
 Name: case, dtype: float64)

Creating new columns based on existing ones using mappings

The below is a straightforward adaptation from the cookbook:

1
2
3
4
5
6
7
8
9
df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 4, 2], "CCC": [2, 1, 3, 1]})

source_cols = ["AAA", "BBB"]
new_cols = [str(c) + "_cat" for c in source_cols]
cats = {1: "One", 2: "Two", 3: "Three"}

dd = df.copy()
dd[new_cols] = df[source_cols].applymap(cats.get)
dd

AAABBBCCCAAA_catBBB_cat
0112OneOne
1211TwoOne
2143OneNone
3321ThreeTwo

But it made me wonder why applymap required the use of the get method while we can map values of a series like so:

1
2
s = pd.Series([1, 2, 3, 1])
s.map(cats)
0      One
1      Two
2    Three
3      One
dtype: object

or so

1
s.map(cats.get)
0      One
1      Two
2    Three
3      One
dtype: object

The answer is simple: applymap requires a function as argument, while map takes functions or mappings.

One limitation of the cookbook solution above is that is doesn’t seem to allow for default values (notice that 4 gets substituted with “None”).

One way around this is the following:

1
2
df[new_cols] = df[source_cols].applymap(lambda x: cats.get(x, "Hello"))
df

AAABBBCCCAAA_catBBB_cat
0112OneOne
1211TwoOne
2143OneHello
3321ThreeTwo

Creating dummy variables

A reminder to my future self, based on this great video from Data School.

1
2
3
4
5
6
7
df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "quality": ["good", "excellent", "very good", "excellent", "good"],
    }
)
df.head()

idquality
01good
12excellent
23very good
34excellent
45good

Pandas makes creating dummies easy:

1
pd.get_dummies(df.quality)

excellentgoodvery good
0010
1100
2001
3100
4010

If you want to label the source of the data, you can use the prefix argument:

1
pd.get_dummies(df.quality, prefix="quality")

quality_excellentquality_goodquality_very good
0010
1100
2001
3100
4010

Often when we work with dummies from a variable with $n$ distinct values, we create $n-1$ dummies and treat the remaining group as the reference group. Pandas provides a convenient way to do this:

1
pd.get_dummies(df.quality, prefix="quality", drop_first=True)

quality_goodquality_very good
010
100
201
300
410

Usually, we’ll want to use the dummies with the rest of the data, so it’s conveninet to have them in the original dataframe. One way to do this is to use concat like so:

1
2
3
dummies = pd.get_dummies(df.quality, prefix="quality", drop_first=True)
df_with_dummies = pd.concat([df, dummies], axis=1)
df_with_dummies.head()

idqualityquality_goodquality_very good
01good10
12excellent00
23very good01
34excellent00
45good10

This works. But Pandas provides a much easier way:

1
2
df_with_dummies1 = pd.get_dummies(df, columns=["quality"], drop_first=True)
df_with_dummies1

idquality_goodquality_very good
0110
1200
2301
3400
4510

That’s it. In one line we get a new dataframe that includes the dummies and excludes the original quality column.

Pivot tables

1
2
3
df = sns.load_dataset("planets")
print(df.shape)
df.head(3)
(1035, 6)

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011

Create a table that shows the number of planets discovered by each method in each decade

1
2
3
4
5
6
7
# method 1: groupby

decade = df.year // 10 * 10
decade = decade.astype("str") + "s"
decade.name = "decade"

df.groupby(["method", decade]).number.sum().unstack().fillna(0).astype("int")

decade1980s1990s2000s2010s
method
Astrometry0002
Eclipse Timing Variations00510
Imaging002921
Microlensing001215
Orbital Brightness Modulation0005
Pulsar Timing0911
Pulsation Timing Variations0010
Radial Velocity152475424
Transit0064712
Transit Timing Variations0009
1
2
3
4
5
# method 2: pivot table

df.pivot_table(
    values="number", index="method", columns=decade, aggfunc="sum", fill_value=0
)

decade1980s1990s2000s2010s
method
Astrometry0002
Eclipse Timing Variations00510
Imaging002921
Microlensing001215
Orbital Brightness Modulation0005
Pulsar Timing0911
Pulsation Timing Variations0010
Radial Velocity152475424
Transit0064712
Transit Timing Variations0009

Another example

From here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
df = pd.DataFrame(
    data={
        "province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
        "city": [
            "Toronto",
            "Montreal",
            "Vancouver",
            "Calgary",
            "Edmonton",
            "Winnipeg",
            "Windsor",
        ],
        "sales": [13, 6, 16, 8, 4, 3, 1],
    }
)
df

provincecitysales
0ONToronto13
1QCMontreal6
2BCVancouver16
3ALCalgary8
4ALEdmonton4
5MNWinnipeg3
6ONWindsor1

You want to group sales by province and get subtotal for total state.

1
2
3
4
5
6
7
8
table = (
    df.pivot_table(
        values="sales", index="province", columns="city", aggfunc="sum", margins=True
    )
    .stack()
    .drop("All")
)
table
province  city     
AL        Calgary       8.0
          Edmonton      4.0
          All          12.0
BC        Vancouver    16.0
          All          16.0
MN        Winnipeg      3.0
          All           3.0
ON        Toronto      13.0
          Windsor       1.0
          All          14.0
QC        Montreal      6.0
          All           6.0
dtype: float64

Counting number of equal adjacent values

In the below column, cumulatively count the number of adjacent equal values.

1
2
df = pd.DataFrame([1, 5, 7, 7, 1, 1, 1, 0], columns=["a"])
df

a
01
15
27
37
41
51
61
70

Solution based on this Stack Overflow answer:

1
2
df["count"] = df.groupby((df.a != df.a.shift()).cumsum()).cumcount().add(1)
df

acount
011
151
271
372
411
512
613
701

Reset cumsum() at each missing value

1
2
s = pd.Series([1.0, 3.0, 1.0, np.nan, 1.0, 1.0, 1.0, 1.0, np.nan, 1.0])
s
0    1.0
1    3.0
2    1.0
3    NaN
4    1.0
5    1.0
6    1.0
7    1.0
8    NaN
9    1.0
dtype: float64

What I don’t want:

1
s.cumsum()
0     1.0
1     4.0
2     5.0
3     NaN
4     6.0
5     7.0
6     8.0
7     9.0
8     NaN
9    10.0
dtype: float64

Instead, I want to reset the counter to zero after each missing value. Solution from this SO answer.

1
2
3
4
cumsum = s.cumsum().ffill()
reset = -cumsum[s.isna()].diff().fillna(cumsum)
result = s.where(s.notna(), reset).cumsum()
result
0    1.0
1    4.0
2    5.0
3    0.0
4    1.0
5    2.0
6    3.0
7    4.0
8    0.0
9    1.0
dtype: float64

Apply

Using apply with groupby

From the cookbook:

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(
    {
        "animal": "cat dog cat fish dog cat cat".split(),
        "size": list("SSMMMLL"),
        "weight": [8, 10, 11, 1, 20, 12, 12],
        "adult": [False] * 5 + [True] * 2,
    }
)
df

animalsizeweightadult
0catS8False
1dogS10False
2catM11False
3fishM1False
4dogM20False
5catL12True
6catL12True
1
2
3
# Return size of heaviest animal

df.groupby("animal").apply(lambda g: g.loc[g.weight.idxmax(), "size"])
animal
cat     L
dog     M
fish    M
dtype: object

Expanding apply

Assume you want to calculate the cumulative return from a series of one-period returns in an expanding fashion – in each period, you want the cumulative return up to that period.

1
2
s = pd.Series([i / 100.0 for i in range(1, 4)])
s
0    0.01
1    0.02
2    0.03
dtype: float64

The solution is given here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import functools


def cum_return(x, y):
    return x * (1 + y)


def red(x):
    res = functools.reduce(cum_return, x, 1)
    return res


s.expanding().apply(red, raw=True)
0    1.010000
1    1.030200
2    1.061106
dtype: float64

I found that somewhere between bewildering and magical. To see what’s going on, it helps to add a few print statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import functools


def cum_return(x, y):
    print("x:", x)
    print("y:", y)
    return x * (1 + y)


def red(x):
    print("Series:", x)
    res = functools.reduce(cum_return, x, 1)
    print("Result:", res)
    print()
    return res


s.expanding().apply(red, raw=True)
Series: [0.01]
x: 1
y: 0.01
Result: 1.01

Series: [0.01 0.02]
x: 1
y: 0.01
x: 1.01
y: 0.02
Result: 1.0302

Series: [0.01 0.02 0.03]
x: 1
y: 0.01
x: 1.01
y: 0.02
x: 1.0302
y: 0.03
Result: 1.061106

0    1.010000
1    1.030200
2    1.061106
dtype: float64

This makes transparent how reduce works: it takes the starting value (1 here) as the initial x value and the first value of the series as y value, and then returns the result of cum_returns. Next, it uses that result as x, and the second element in the series as y, and calculates the new result of cum_returns. This is then repeated until it has run through the entire series.

What surprised me is to see that reduce always starts the calculation from the beginning, rather than re-using the last calculated result. This seems inefficient, but is probably necessary for some reason.

Sort by sum of group values

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(
    {
        "code": ["foo", "bar", "baz"] * 2,
        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
        "flag": [False, True] * 3,
    }
)

df

codedataflag
0foo0.16False
1bar-0.21True
2baz0.33False
3foo0.45True
4bar-0.59False
5baz0.62True
1
2
3
g = df.groupby("code")
sort_order = g["data"].transform(sum).sort_values().index
df.loc[sort_order]

codedataflag
1bar-0.21True
4bar-0.59False
0foo0.16False
3foo0.45True
2baz0.33False
5baz0.62True

Get observation with largest data entry for each group

1
2
g = df.groupby("code")
g.apply(lambda g: g.loc[g.data.idxmax()])

codedataflag
code
barbar-0.21True
bazbaz0.62True
foofoo0.45True

Expanding group operations

Based on this answer.

1
2
3
4
5
6
7
8
df = pd.DataFrame(
    {
        "code": ["foo", "bar", "baz"] * 4,
        "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62] * 2,
        "flag": [False, True] * 6,
    }
)
df

codedataflag
0foo0.16False
1bar-0.21True
2baz0.33False
3foo0.45True
4bar-0.59False
5baz0.62True
6foo0.16False
7bar-0.21True
8baz0.33False
9foo0.45True
10bar-0.59False
11baz0.62True
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
g = df.groupby("code")


def helper(g):
    s = g.data.expanding()
    g["exp_mean"] = s.mean()
    g["exp_sum"] = s.sum()
    g["exp_count"] = s.count()
    return g


g.apply(helper).sort_values("code")

codedataflagexp_meanexp_sumexp_count
1bar-0.21True-0.210000-0.211.0
4bar-0.59False-0.400000-0.802.0
7bar-0.21True-0.336667-1.013.0
10bar-0.59False-0.400000-1.604.0
2baz0.33False0.3300000.331.0
5baz0.62True0.4750000.952.0
8baz0.33False0.4266671.283.0
11baz0.62True0.4750001.904.0
0foo0.16False0.1600000.161.0
3foo0.45True0.3050000.612.0
6foo0.16False0.2566670.773.0
9foo0.45True0.3050001.224.0

Get observation with largest data entry for each group

1
2
df = pd.DataFrame({"a": [4, 5, 6, 7], "b": [10, 20, 30, 40], "c": [100, 50, -30, -50]})
df

abc
0410100
152050
2630-30
3740-50
1
2
myval = 34
df.loc[(df.c - myval).abs().argsort()]

abc
152050
2630-30
0410100
3740-50

Reminder of what happens here:

1
2
3
a = (df.c - myval).abs()
b = a.argsort()
a, b
(0    66
 1    16
 2    64
 3    84
 Name: c, dtype: int64,
 0    1
 1    2
 2    0
 3    3
 Name: c, dtype: int64)

argsort returns a series of indexes, so that df[b] returns an ordered dataframe. The first element in b thus refers to the index of the smallest element in a.

Creating separate dataframe for each group

1
2
3
df = sns.load_dataset("iris")
pieces = dict(list(df.groupby("species")))
pieces["setosa"].head(3)

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa

Aggregating

From here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
df = pd.DataFrame(
    {
        "StudentID": ["x1", "x10", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"],
        "StudentGender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
        "ExamYear": [
            "2007",
            "2007",
            "2007",
            "2008",
            "2008",
            "2008",
            "2008",
            "2009",
            "2009",
            "2009",
        ],
        "Exam": [
            "algebra",
            "stats",
            "bio",
            "algebra",
            "algebra",
            "stats",
            "stats",
            "algebra",
            "bio",
            "bio",
        ],
        "Participated": [
            "no",
            "yes",
            "yes",
            "yes",
            "no",
            "yes",
            "yes",
            "yes",
            "yes",
            "yes",
        ],
        "Passed": ["no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes"],
    },
    columns=[
        "StudentID",
        "StudentGender",
        "ExamYear",
        "Exam",
        "Participated",
        "Passed",
    ],
)

df.columns = [str.lower(c) for c in df.columns]
df

studentidstudentgenderexamyearexamparticipatedpassed
0x1F2007algebranono
1x10M2007statsyesyes
2x2F2007bioyesyes
3x3M2008algebrayesyes
4x4F2008algebranono
5x5M2008statsyesyes
6x6F2008statsyesyes
7x7M2009algebrayesyes
8x8M2009bioyesno
9x9M2009bioyesyes
1
2
numyes = lambda x: sum(x == "yes")
df.groupby("examyear").agg({"participated": numyes, "passed": numyes})

participatedpassed
examyear
200722
200833
200932

Count days since last occurrence of event

We want a count of the number of days passed since the last event happened.

1
2
3
4
5
6
7
8
dfr = pd.DataFrame(
    {
        "date": pd.date_range(start="1 Jan 2022", periods=8, freq="d"),
        "event": [np.nan, np.nan, 1, np.nan, np.nan, np.nan, 4, np.nan],
        "result": [np.nan, np.nan, 0, 1, 2, 3, 0, 1],
    }
)
dfr

dateeventresult
02022-01-01NaNNaN
12022-01-02NaNNaN
22022-01-031.00.0
32022-01-04NaN1.0
42022-01-05NaN2.0
52022-01-06NaN3.0
62022-01-074.00.0
72022-01-08NaN1.0
1
2
df = dfr.iloc[:, :-1]
df

dateevent
02022-01-01NaN
12022-01-02NaN
22022-01-031.0
32022-01-04NaN
42022-01-05NaN
52022-01-06NaN
62022-01-074.0
72022-01-08NaN
1
2
3
4
5
6
7
8
9
def add_days_since_event(df):
    ones = df.event.ffill().where(lambda s: s.isna(), 1)
    cumsum = ones.cumsum()
    reset = -cumsum[df.event.notna()].diff().fillna(cumsum).sub(1)
    df["result"] = ones.where(df.event.isna(), reset).cumsum()
    return df


add_days_since_event(df)

dateeventresult
02022-01-01NaNNaN
12022-01-02NaNNaN
22022-01-031.00.0
32022-01-04NaN1.0
42022-01-05NaN2.0
52022-01-06NaN3.0
62022-01-074.00.0
72022-01-08NaN1.0

Sources