Contents

Pandas categories

Basics

Values and order:

  • All values of a categorical valiable are either in categories or are np.nan.

  • Order is defined by the order of categories, not the lexical order of the values.

Memory structure:

  • Internally, the data structure consists of a categories array and an integer arrays of codes, which point to the values in the categories array.

  • The memory usage of a categorical variable is proportional to the number of categories plus the length of the data, while that for an object dtype is a constant times the length of the data. As the number of categories approaches the length of the data, memory usage approaches that of object type.

Use cases:

  • To save memory (if number of categories is small relative to the number of rows)

  • If logical order differs from lexical order (e.g. ‘small’, ‘medium’, ’large’)

  • To signal to libraries that column should be treated as a category (e.g. for plotting)

General best practices

Operating on categories:

  • Operate on category values directly rather than column elements (e.g. to rename categories use df.catvar.cat.rename_rategories(*args, **kwargs)).

  • If there is no cat method available, consider operating on categories directly with df.catvar.cat.categories.

Merging:

  • Pandas treats categorical variables with different categories as different data types

  • Category merge keys will only be categories in the merged dataframe if they are of the same data types (i.e. have the same categories), otherwise they will be converted back to objects

Grouping:

  • By default, we group on all categories, not just those present in the data.

  • More often than not, you’ll want to use df.groupby(catvar, observed=True) to only use categories observed in the data.

Operations I frequently use

1
2
3
import numpy as np
import pandas as pd
import seaborn as sns
1
2
3
4
df = sns.load_dataset("taxis")
df["pickup"] = pd.to_datetime(df.pickup)
df["dropoff"] = pd.to_datetime(df.dropoff)
df.head(2)

pickupdropoffpassengersdistancefaretiptollstotalcolorpaymentpickup_zonedropoff_zonepickup_boroughdropoff_borough
02019-03-23 20:21:092019-03-23 20:27:2411.607.02.150.012.95yellowcredit cardLenox Hill WestUN/Turtle Bay SouthManhattanManhattan
12019-03-04 16:11:552019-03-04 16:19:0010.795.00.000.09.30yellowcashUpper West Side SouthUpper West Side SouthManhattanManhattan

Convert all string variables to categories

1
2
str_cols = df.select_dtypes("object")
df[str_cols.columns] = str_cols.astype("category")

Convert labels of all categorical variables to lowercase

1
2
cat_cols = df.select_dtypes("category")
df[cat_cols.columns] = cat_cols.apply(lambda col: col.cat.rename_categories(str.lower))

String and datetime accessors

  • When using the str and dt accessors on a variable of type category, pandas applies the operation on the categories rather than the entire array (which is nice) and then creates and returns a new string or date array (which is often not helpful for me).
1
df.payment.str.upper().head(3)
0    CREDIT CARD
1           CASH
2    CREDIT CARD
Name: payment, dtype: object
  • For operations that cat provides methods for (e.g. renaming as used above), the solution is to use those methods.

  • For others (e.g. regex searches) the solution is to operate on the categories directly myself.

Object creation

Convert sex and class to the same categorical type, with categories being the union of all unique values of both columns.

1
2
3
4
5
6
cols = ["sex", "who"]
unique_values = np.unique(titanic[cols].to_numpy().ravel())
categories = pd.CategoricalDtype(categories=unique_values)
titanic[cols] = titanic[cols].astype(categories)
print(titanic.sex.cat.categories)
print(titanic.who.cat.categories)
Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')
Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')
1
2
# restore sex and who to object types
titanic[cols] = titanic[cols].astype("object")

Custom order

1
2
df = pd.DataFrame({"quality": ["good", "excellent", "very good"]})
df.sort_values("quality")

quality
1excellent
0good
2very good
1
2
3
ordered_quality = pd.CategoricalDtype(["good", "very good", "excellent"], ordered=True)
df.quality = df.quality.astype(ordered_quality)
df.sort_values("quality")

quality
0good
2very good
1excellent

Unique values

Series.unique returns values in order of appearance, and only returns values that are present in the data.

1
dfs = df.head(5)
1
assert not len(dfs.pickup_zone.unique()) == len(dfs.pickup_zone.cat.categories)

References