dask.dataframe.DataFrame.shuffle

dask.dataframe.DataFrame.shuffle

DataFrame.shuffle(on, npartitions=None, max_branch=None, shuffle_method=None, ignore_index=False, compute=None)

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters
onstr, list of str, or Series, Index, or DataFrame

Column(s) or index to be used to map rows to output partitions

npartitionsint, optional

Number of partitions of output. Partition count will not be changed by default.

max_branch: int, optional

The maximum number of splits per input partition. Used within the staged shuffling algorithm.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p' for distributed operation. Will be inferred by your current scheduler.

ignore_index: bool, default False

Ignore index during shuffle. If True, performance may improve, but index values will not be preserved.

compute: bool

Whether or not to trigger an immediate computation. Defaults to False.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])