Pandas categories
Basics
Values and order:
All values of a categorical valiable are either in
categories
or arenp.nan
.Order is defined by the order of
categories
, not the lexical order of the values.
Memory structure:
Internally, the data structure consists of a
categories
array and an integer arrays ofcodes
, which point to the values in thecategories
array.The memory usage of a categorical variable is proportional to the number of categories plus the length of the data, while that for an object dtype is a constant times the length of the data. As the number of categories approaches the length of the data, memory usage approaches that of object type.
Use cases:
To save memory (if number of categories is small relative to the number of rows)
If logical order differs from lexical order (e.g. ‘small’, ‘medium’, ’large’)
To signal to libraries that column should be treated as a category (e.g. for plotting)
General best practices
Operating on categories:
Operate on category values directly rather than column elements (e.g. to rename categories use
df.catvar.cat.rename_rategories(*args, **kwargs)
).If there is no
cat
method available, consider operating on categories directly withdf.catvar.cat.categories
.
Merging:
Pandas treats categorical variables with different categories as different data types
Category merge keys will only be categories in the merged dataframe if they are of the same data types (i.e. have the same categories), otherwise they will be converted back to objects
Grouping:
By default, we group on all categories, not just those present in the data.
More often than not, you’ll want to use
df.groupby(catvar, observed=True)
to only use categories observed in the data.
Operations I frequently use
|
|
|
|
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
Convert all string variables to categories
|
|
Convert labels of all categorical variables to lowercase
|
|
String and datetime accessors
- When using the
str
anddt
accessors on a variable of typecategory
, pandas applies the operation on thecategories
rather than the entire array (which is nice) and then creates and returns a new string or date array (which is often not helpful for me).
|
|
0 CREDIT CARD
1 CASH
2 CREDIT CARD
Name: payment, dtype: object
For operations that
cat
provides methods for (e.g. renaming as used above), the solution is to use those methods.For others (e.g. regex searches) the solution is to operate on the categories directly myself.
Object creation
Convert sex and class to the same categorical type, with categories being the union of all unique values of both columns.
|
|
Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')
Index(['child', 'female', 'male', 'man', 'woman'], dtype='object')
|
|
Custom order
|
|
quality | |
---|---|
1 | excellent |
0 | good |
2 | very good |
|
|
quality | |
---|---|
0 | good |
2 | very good |
1 | excellent |
Unique values
Series.unique
returns values in order of appearance, and only returns values that are present in the data.
|
|
|
|