This post is part of my series of posts on pandas.
Fluent Pandas contains notes on how to effectively use pandas core features.
Fast pandas contains notes on how to effectively work with large datasets.
Pandas cookbook is a list of recipes for effectively solving common and not so common problems.
1
2
3
| import numpy as np
import pandas as pd
import seaborn as sns
|
Sort and filter
1
2
3
| df = sns.load_dataset("diamonds")
print(df.shape)
df.head(2)
|
(53940, 10)
| carat | cut | color | clarity | depth | table | price | x | y | z |
|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
Filter data
1
2
3
4
5
| cutoff = 30_000
a = df.loc[df.amount > cutoff]
b = df.query("amount > @cutoff")
c = df[df.amount > cutoff]
all(a == b) == all(b == c)
|
True
Filter columns
1
| df.filter(like="sepal", axis=1).head(2)
|
| sepal_length | sepal_width |
|---|
| 0 | 5.1 | 3.5 |
| 1 | 4.9 | 3.0 |
1
| df.filter(regex=".+_length").head(2)
|
| sepal_length | petal_length |
|---|
| 0 | 5.1 | 1.4 |
| 1 | 4.9 | 1.4 |
groupb() vs resample()
groupby() implements the splict-apply-combine paradigm, while resample() is a convenience method for frequency conversion and resampling of time series. When both are used on time series, the main difference is that resample() fills in missing dates while groupby() doesn’t.
1
2
3
| index = pd.date_range("2020", freq="2d", periods=3)
data = pd.DataFrame({"col": range(len(index))}, index=index)
data
|
| col |
|---|
| 2020-01-01 | 0 |
| 2020-01-03 | 1 |
| 2020-01-05 | 2 |
1
| data.resample("d").col.sum()
|
2020-01-01 0
2020-01-02 0
2020-01-03 1
2020-01-04 0
2020-01-05 2
Freq: D, Name: col, dtype: int64
1
| data.groupby(level=0).col.sum()
|
2020-01-01 0
2020-01-03 1
2020-01-05 2
Freq: 2D, Name: col, dtype: int64
Aggregate
count() vs size()
count() is a DataFrame, Series, and Grouper method that return the count of non-missing rows.size() is a Grouper method that returns the count of rows per group (including rows with missing elements)size is also a DataFrame property that returns the number of elements (including cells with missing values) and a Series property that returns the number of rows (including rows with missing values).
1
2
| df = sns.load_dataset("titanic")
df.groupby("sex").count()
|
| survived | pclass | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone |
|---|
| sex | | | | | | | | | | | | | | |
| female | 314 | 314 | 261 | 314 | 314 | 314 | 312 | 314 | 314 | 314 | 97 | 312 | 314 | 314 |
| male | 577 | 577 | 453 | 577 | 577 | 577 | 577 | 577 | 577 | 577 | 106 | 577 | 577 | 577 |
1
| df.groupby("sex").size()
|
sex
female 314
male 577
dtype: int64
Naming columns
1
2
3
4
5
6
7
8
9
| def spread(s):
return s.max() - s.min()
df.groupby("species").agg(
mean_sepal_length=("sepal_length", "mean"),
max_petal_width=("petal_width", "max"),
spread_petal_width=("petal_width", spread),
)
|
| mean_sepal_length | max_petal_width | spread_petal_width |
|---|
| species | | | |
| setosa | 5.006 | 0.6 | 0.5 |
| versicolor | 5.936 | 1.8 | 0.8 |
| virginica | 6.588 | 2.5 | 1.1 |
MultiIndex
Working with indices, expecially column indices, and especially with hierarchical ones, is an area of Pandas I keep finding perplexing. The point of this notebook is to help my future self.
1
2
| df = sns.load_dataset("iris")
df.head(2)
|
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
Create hierarchical column names
1
2
3
4
| df = df.set_index("species")
tuples = [tuple(c) for c in df.columns.str.split("_")]
df.columns = pd.MultiIndex.from_tuples(tuples)
df.head(2)
|
| sepal | | petal | |
|---|
| length | width | length | width |
| species | | | | |
| setosa | 5.1 | 3.5 | 1.4 | 0.2 |
| setosa | 4.9 | 3.0 | 1.4 | 0.2 |
Flatten column names
1
2
3
4
| names = ["_".join(c) for c in df.columns]
df.columns = names
df.reset_index(inplace=True)
df.head(2)
|
| species | sepal_length | sepal_width | petal_length | petal_width |
|---|
| 0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 |
Flattening using method (from here)
1
| df.set_axis(df.columns.map("_".join), axis=1)
|
or, of course, with a list comprehension, like so:
1
| df.set_axis(["_".join(c) for c in df.columns], axis=1)
|
Mappings
apply vs map vs applymap
apply applies a function along an axis of a dataframe or on series valuesmap applies a correspondance to each value in a seriesapplymap applies a function to each element in a dataframe
1
2
3
| data = df.loc[:2, ["gender", "merchant"]]
gender = {"m": "male", "f": "female"}
data
|
| gender | merchant |
|---|
| 0 | m | aviva |
| 1 | m | tesco |
| 2 | m | mcdonalds |
1
| data.apply(lambda x: x.map(gender))
|
| gender | merchant |
|---|
| 0 | male | NaN |
| 1 | male | NaN |
| 2 | male | NaN |
1
| data.gender.map(gender)
|
0 male
1 male
2 male
Name: gender, dtype: object
1
| data.applymap(gender.get)
|
| gender | merchant |
|---|
| 0 | male | None |
| 1 | male | None |
| 2 | male | None |
get turns a dictionary into a function that takes a key and returns its corresponding value if the key is in the dictionary and a default value otherwise.
Sources