Contents

Fluent Pandas

This post is part of my series of posts on pandas.

Fluent Pandas contains notes on how to effectively use pandas core features.

Fast pandas contains notes on how to effectively work with large datasets.

Pandas cookbook is a list of recipes for effectively solving common and not so common problems.

1
2
3
import numpy as np
import pandas as pd
import seaborn as sns

Sort and filter

1
2
3
df = sns.load_dataset("diamonds")
print(df.shape)
df.head(2)
(53940, 10)

caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
10.21PremiumESI159.861.03263.893.842.31

Filter data

1
2
3
4
5
cutoff = 30_000
a = df.loc[df.amount > cutoff]
b = df.query("amount > @cutoff")
c = df[df.amount > cutoff]
all(a == b) == all(b == c)
True

Filter columns

1
df.filter(like="sepal", axis=1).head(2)

sepal_lengthsepal_width
05.13.5
14.93.0
1
df.filter(regex=".+_length").head(2)

sepal_lengthpetal_length
05.11.4
14.91.4

groupb() vs resample()

groupby() implements the splict-apply-combine paradigm, while resample() is a convenience method for frequency conversion and resampling of time series. When both are used on time series, the main difference is that resample() fills in missing dates while groupby() doesn’t.

1
2
3
index = pd.date_range("2020", freq="2d", periods=3)
data = pd.DataFrame({"col": range(len(index))}, index=index)
data

col
2020-01-010
2020-01-031
2020-01-052
1
data.resample("d").col.sum()
2020-01-01    0
2020-01-02    0
2020-01-03    1
2020-01-04    0
2020-01-05    2
Freq: D, Name: col, dtype: int64
1
data.groupby(level=0).col.sum()
2020-01-01    0
2020-01-03    1
2020-01-05    2
Freq: 2D, Name: col, dtype: int64

Aggregate

count() vs size()

  • count() is a DataFrame, Series, and Grouper method that return the count of non-missing rows.
  • size() is a Grouper method that returns the count of rows per group (including rows with missing elements)
  • size is also a DataFrame property that returns the number of elements (including cells with missing values) and a Series property that returns the number of rows (including rows with missing values).
1
2
df = sns.load_dataset("titanic")
df.groupby("sex").count()

survivedpclassagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
sex
female31431426131431431431231431431497312314314
male577577453577577577577577577577106577577577
1
df.groupby("sex").size()
sex
female    314
male      577
dtype: int64

Naming columns

1
2
3
4
5
6
7
8
9
def spread(s):
    return s.max() - s.min()


df.groupby("species").agg(
    mean_sepal_length=("sepal_length", "mean"),
    max_petal_width=("petal_width", "max"),
    spread_petal_width=("petal_width", spread),
)

mean_sepal_lengthmax_petal_widthspread_petal_width
species
setosa5.0060.60.5
versicolor5.9361.80.8
virginica6.5882.51.1

MultiIndex

Working with indices, expecially column indices, and especially with hierarchical ones, is an area of Pandas I keep finding perplexing. The point of this notebook is to help my future self.

1
2
df = sns.load_dataset("iris")
df.head(2)

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa

Create hierarchical column names

1
2
3
4
df = df.set_index("species")
tuples = [tuple(c) for c in df.columns.str.split("_")]
df.columns = pd.MultiIndex.from_tuples(tuples)
df.head(2)

sepalpetal
lengthwidthlengthwidth
species
setosa5.13.51.40.2
setosa4.93.01.40.2

Flatten column names

1
2
3
4
names = ["_".join(c) for c in df.columns]
df.columns = names
df.reset_index(inplace=True)
df.head(2)

speciessepal_lengthsepal_widthpetal_lengthpetal_width
0setosa5.13.51.40.2
1setosa4.93.01.40.2

Flattening using method (from here)

1
df.set_axis(df.columns.map("_".join), axis=1)

or, of course, with a list comprehension, like so:

1
df.set_axis(["_".join(c) for c in df.columns], axis=1)

Mappings

apply vs map vs applymap

  • apply applies a function along an axis of a dataframe or on series values
  • map applies a correspondance to each value in a series
  • applymap applies a function to each element in a dataframe
1
2
3
data = df.loc[:2, ["gender", "merchant"]]
gender = {"m": "male", "f": "female"}
data

gendermerchant
0maviva
1mtesco
2mmcdonalds
1
data.apply(lambda x: x.map(gender))

gendermerchant
0maleNaN
1maleNaN
2maleNaN
1
data.gender.map(gender)
0    male
1    male
2    male
Name: gender, dtype: object
1
data.applymap(gender.get)

gendermerchant
0maleNone
1maleNone
2maleNone

get turns a dictionary into a function that takes a key and returns its corresponding value if the key is in the dictionary and a default value otherwise.

Sources