This post is part of my series of posts on pandas
.
Fluent Pandas contains notes on how to effectively use pandas core features.
Fast pandas contains notes on how to effectively work with large datasets.
Pandas cookbook is a list of recipes for effectively solving common and not so common problems.
1
2
3
| import numpy as np
import pandas as pd
import seaborn as sns
|
Sort and filter
1
2
3
| df = sns.load_dataset("diamonds")
print(df.shape)
df.head(2)
|
(53940, 10)
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
Filter data
1
2
3
4
5
| cutoff = 30_000
a = df.loc[df.amount > cutoff]
b = df.query("amount > @cutoff")
c = df[df.amount > cutoff]
all(a == b) == all(b == c)
|
True
Filter columns
1
| df.filter(like="sepal", axis=1).head(2)
|
| sepal_length | sepal_width |
---|
0 | 5.1 | 3.5 |
1 | 4.9 | 3.0 |
1
| df.filter(regex=".+_length").head(2)
|
| sepal_length | petal_length |
---|
0 | 5.1 | 1.4 |
1 | 4.9 | 1.4 |
groupb()
vs resample()
groupby()
implements the splict-apply-combine paradigm, while resample()
is a convenience method for frequency conversion and resampling of time series. When both are used on time series, the main difference is that resample()
fills in missing dates while groupby()
doesn’t.
1
2
3
| index = pd.date_range("2020", freq="2d", periods=3)
data = pd.DataFrame({"col": range(len(index))}, index=index)
data
|
| col |
---|
2020-01-01 | 0 |
2020-01-03 | 1 |
2020-01-05 | 2 |
1
| data.resample("d").col.sum()
|
2020-01-01 0
2020-01-02 0
2020-01-03 1
2020-01-04 0
2020-01-05 2
Freq: D, Name: col, dtype: int64
1
| data.groupby(level=0).col.sum()
|
2020-01-01 0
2020-01-03 1
2020-01-05 2
Freq: 2D, Name: col, dtype: int64
Aggregate
count()
vs size()
count()
is a DataFrame, Series, and Grouper method that return the count of non-missing rows.size()
is a Grouper method that returns the count of rows per group (including rows with missing elements)size
is also a DataFrame property that returns the number of elements (including cells with missing values) and a Series property that returns the number of rows (including rows with missing values).
1
2
| df = sns.load_dataset("titanic")
df.groupby("sex").count()
|
| survived | pclass | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone |
---|
sex | | | | | | | | | | | | | | |
female | 314 | 314 | 261 | 314 | 314 | 314 | 312 | 314 | 314 | 314 | 97 | 312 | 314 | 314 |
male | 577 | 577 | 453 | 577 | 577 | 577 | 577 | 577 | 577 | 577 | 106 | 577 | 577 | 577 |
1
| df.groupby("sex").size()
|
sex
female 314
male 577
dtype: int64
Naming columns
1
2
3
4
5
6
7
8
9
| def spread(s):
return s.max() - s.min()
df.groupby("species").agg(
mean_sepal_length=("sepal_length", "mean"),
max_petal_width=("petal_width", "max"),
spread_petal_width=("petal_width", spread),
)
|
| mean_sepal_length | max_petal_width | spread_petal_width |
---|
species | | | |
setosa | 5.006 | 0.6 | 0.5 |
versicolor | 5.936 | 1.8 | 0.8 |
virginica | 6.588 | 2.5 | 1.1 |
MultiIndex
Working with indices, expecially column indices, and especially with hierarchical ones, is an area of Pandas I keep finding perplexing. The point of this notebook is to help my future self.
1
2
| df = sns.load_dataset("iris")
df.head(2)
|
| sepal_length | sepal_width | petal_length | petal_width | species |
---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
Create hierarchical column names
1
2
3
4
| df = df.set_index("species")
tuples = [tuple(c) for c in df.columns.str.split("_")]
df.columns = pd.MultiIndex.from_tuples(tuples)
df.head(2)
|
| sepal | | petal | |
---|
| length | width | length | width |
species | | | | |
setosa | 5.1 | 3.5 | 1.4 | 0.2 |
setosa | 4.9 | 3.0 | 1.4 | 0.2 |
Flatten column names
1
2
3
4
| names = ["_".join(c) for c in df.columns]
df.columns = names
df.reset_index(inplace=True)
df.head(2)
|
| species | sepal_length | sepal_width | petal_length | petal_width |
---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 |
Flattening using method (from here)
1
| df.set_axis(df.columns.map("_".join), axis=1)
|
or, of course, with a list comprehension, like so:
1
| df.set_axis(["_".join(c) for c in df.columns], axis=1)
|
Mappings
apply
vs map
vs applymap
apply
applies a function along an axis of a dataframe or on series valuesmap
applies a correspondance to each value in a seriesapplymap
applies a function to each element in a dataframe
1
2
3
| data = df.loc[:2, ["gender", "merchant"]]
gender = {"m": "male", "f": "female"}
data
|
| gender | merchant |
---|
0 | m | aviva |
1 | m | tesco |
2 | m | mcdonalds |
1
| data.apply(lambda x: x.map(gender))
|
| gender | merchant |
---|
0 | male | NaN |
1 | male | NaN |
2 | male | NaN |
1
| data.gender.map(gender)
|
0 male
1 male
2 male
Name: gender, dtype: object
1
| data.applymap(gender.get)
|
| gender | merchant |
---|
0 | male | None |
1 | male | None |
2 | male | None |
get
turns a dictionary into a function that takes a key and returns its corresponding value if the key is in the dictionary and a default value otherwise.
Sources