Categoricals

Categoricals

Dask DataFrame divides categorical data into two types:

  • Known categoricals have the categories known statically (on the _meta attribute). Each partition must have the same categories as found on the _meta attribute

  • Unknown categoricals don’t know the categories statically, and may have different categories in each partition. Internally, unknown categoricals are indicated by the presence of dd.utils.UNKNOWN_CATEGORIES in the categories on the _meta attribute. Since most DataFrame operations propagate the categories, the known/unknown status should propagate through operations (similar to how NaN propagates)

For metadata specified as a description (option 2 above), unknown categoricals are created.

Certain operations are only available for known categoricals. For example, df.col.cat.categories would only work if df.col has known categories, since the categorical mapping is only known statically on the metadata of known categoricals.

The known/unknown status for a categorical column can be found using the known property on the categorical accessor:

>>> ddf.col.cat.known
False

Additionally, an unknown categorical can be converted to known using .cat.as_known(). If you have multiple categorical columns in a DataFrame, you may instead want to use df.categorize(columns=...), which will convert all specified columns to known categoricals. Since getting the categories requires a full scan of the data, using df.categorize() is more efficient than calling .cat.as_known() for each column (which would result in multiple scans):

>>> col_known = ddf.col.cat.as_known()  # use for single column
>>> col_known.cat.known
True
>>> ddf_known = ddf.categorize()        # use for multiple columns
>>> ddf_known.col.cat.known
True

To convert a known categorical to an unknown categorical, there is also the .cat.as_unknown() method. This requires no computation as it’s just a change in the metadata.

Non-categorical columns can be converted to categoricals in a few different ways:

# astype operates lazily, and results in unknown categoricals
ddf = ddf.astype({'mycol': 'category', ...})
# or
ddf['mycol'] = ddf.mycol.astype('category')

# categorize requires computation, and results in known categoricals
ddf = ddf.categorize(columns=['mycol', ...])

Additionally, with Pandas 0.19.2 and up, dd.read_csv and dd.read_table can read data directly into unknown categorical columns by specifying a column dtype as 'category':

>>> ddf = dd.read_csv(..., dtype={col_name: 'category'})

Moreover, with Pandas 0.21.0 and up, dd.read_csv and dd.read_table can read data directly into known categoricals by specifying instances of pd.api.types.CategoricalDtype:

>>> dtype = {'col': pd.api.types.CategoricalDtype(['a', 'b', 'c'])}
>>> ddf = dd.read_csv(..., dtype=dtype)

If you write and read to parquet, Dask will forget known categories. This happens because, due to performance concerns, all the categories are saved in every partition rather than in the parquet metadata. It is possible to manually load the categories:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame(data=list('abcaabbcc'), columns=['col'])
>>> df.col = df.col.astype('category')
>>> ddf = dd.from_pandas(df, npartitions=1)
>>> ddf.col.cat.known
True
>>> ddf.to_parquet('tmp')
>>> ddf2 = dd.read_parquet('tmp')
>>> ddf2.col.cat.known
False
>>> ddf2.col = ddf2.col.cat.set_categories(ddf2.col.head(1).cat.categories)
>>> ddf2.col.cat.known
True