dask.dataframe.DataFrame.shuffle

dask.dataframe.DataFrame.shuffle

DataFrame.shuffle(on: str | list | no_default = _NoDefault.no_default, ignore_index: bool = False, npartitions: int | None = None, shuffle_method: str | None = None, on_index: bool = False, force: bool = False, **options)

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters
onstr, list of str, or Series, Index, or DataFrame

Column names to shuffle by.

ignore_indexoptional

Whether to ignore the index. Default is False.

npartitionsoptional

Number of output partitions. The partition count will be preserved by default.

shuffle_methodoptional

Desired shuffle method. Default chosen at optimization time.

on_indexbool, default False

Whether to shuffle on the index. Mutually exclusive with ‘on’. Set this to True if ‘on’ is not provided.

forcebool, default False

This forces the optimizer to keep the shuffle even if the final expression could be further simplified.

**optionsoptional

Algorithm-specific options.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])