dask_expr.get_dummies
dask_expr.get_dummies¶
- dask_expr.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=<class 'bool'>, **kwargs)[source]¶
Convert categorical variable into dummy/indicator variables.
Data must have category dtype to infer result’s
columns
.- Parameters
- dataSeries, or DataFrame
For Series, the dtype must be categorical. For DataFrame, at least one column must be categorical.
- prefixstring, list of strings, or dict of strings, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
- prefix_sepstring, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
- dummy_nabool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
- columnslist-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with category dtype will be converted.
- sparsebool, default False
Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.
New in version 0.18.2.
- drop_firstbool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.
- dtypedtype, default bool
Data type for new columns. Only a single dtype is allowed.
New in version 0.18.2.
- Returns
- dummiesDataFrame
See also
Examples
Dask’s version only works with Categorical data, as this is the only way to know the output shape without computing all the data.
>>> import pandas as pd >>> import dask.dataframe as dd >>> s = dd.from_pandas(pd.Series(list('abca')), npartitions=2) >>> dd.get_dummies(s) Traceback (most recent call last): ... NotImplementedError: `get_dummies` with non-categorical dtypes is not supported...
With categorical data:
>>> s = dd.from_pandas(pd.Series(list('abca'), dtype='category'), npartitions=2) >>> dd.get_dummies(s) Dask DataFrame Structure: a b c npartitions=2 0 bool bool bool 2 ... ... ... 3 ... ... ... Dask Name: get_dummies, 2 graph layers >>> dd.get_dummies(s).compute() a b c 0 True False False 1 False True False 2 False False True 3 True False False