dask_expr.get_dummies

dask_expr.get_dummies

dask_expr.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=<class 'bool'>, **kwargs)[source]

Convert categorical variable into dummy/indicator variables.

Data must have category dtype to infer result’s columns.

Parameters
dataSeries, or DataFrame

For Series, the dtype must be categorical. For DataFrame, at least one column must be categorical.

prefixstring, list of strings, or dict of strings, default None

String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sepstring, default ‘_’

If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_nabool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columnslist-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with category dtype will be converted.

sparsebool, default False

Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.

New in version 0.18.2.

drop_firstbool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtypedtype, default bool

Data type for new columns. Only a single dtype is allowed.

New in version 0.18.2.

Returns
dummiesDataFrame

Examples

Dask’s version only works with Categorical data, as this is the only way to know the output shape without computing all the data.

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> s = dd.from_pandas(pd.Series(list('abca')), npartitions=2)
>>> dd.get_dummies(s)
Traceback (most recent call last):
    ...
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported...

With categorical data:

>>> s = dd.from_pandas(pd.Series(list('abca'), dtype='category'), npartitions=2)
>>> dd.get_dummies(s)  
Dask DataFrame Structure:
                   a      b      c
npartitions=2
0              bool  bool  bool
2                ...    ...    ...
3                ...    ...    ...
Dask Name: get_dummies, 2 graph layers
>>> dd.get_dummies(s).compute()  
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False