dask.dataframe.DataFrame.repartition

dask.dataframe.DataFrame.repartition#

DataFrame.repartition(divisions: tuple | None = None, npartitions: int | None = None, partition_size: str | None = None, freq=None, force: bool = False)#

Repartition a collection

Exactly one of divisions, npartitions or partition_size should be specified. A ValueError will be raised when that is not the case.

Parameters:

divisionslist, optional: The “dividing lines” used to split the dataframe into partitions. For divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions.
npartitionsint | Callable, optional: Approximate number of partitions of output. The number of partitions used may be slightly lower than npartitions depending on data distribution, but will never be higher. The Callable gets the number of partitions of the input as an argument and should return an int.
partition_sizestr, optional: Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored. Note that the size reflects the number of bytes used as computed by pandas.DataFrame.memory_usage, which will not necessarily match the size when storing to disk.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.
forcebool, default False: Allows the expansion of the existing divisions. If False then the new divisions’ lower and upper bounds must be the same as the old divisions’.
freqstr, pd.Timedelta: A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

See also

DataFrame.memory_usage_per_partition
pandas.DataFrame.memory_usage

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Also note that len(divisions) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.

Examples

>>> df = df.repartition(npartitions=10)
>>> df = df.repartition(divisions=[0, 5, 10, 20])
>>> df = df.repartition(freq='7d')

dask.dataframe.DataFrame.repartition

Contents

dask.dataframe.DataFrame.repartition#