dask.dataframe.DataFrame.set_index

dask.dataframe.DataFrame.set_index

DataFrame.set_index(other: str | dask.dataframe.core.Series, drop: bool = True, sorted: bool = False, npartitions: Optional[Union[int, Literal['auto']]] = None, divisions: collections.abc.Sequence | None = None, inplace: bool = False, sort: bool = True, **kwargs)[source]

Set the DataFrame index (row labels) using an existing column.

If sort=False, this function operates exactly like pandas.set_index and sets the index on the DataFrame. If sort=True (default), this function also sorts the DataFrame by the new index. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we set_index once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.

With sort=True, this function is much more expensive. Under normal operation this function does an initial pass over the index column to compute approximate quantiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.

In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.

Parameters
other: string or Dask Series

Column to use as index.

drop: boolean, default True

Delete column to be used as the new index.

sorted: bool, optional

If the index column is already sorted in increasing order. Defaults to False

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Only used when divisions is not given. If divisions is given, the number of output partitions will be len(divisions) - 1.

divisions: list, optional

The “dividing lines” used to split the new index into partitions. For divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. If not given (default), good divisions are calculated by immediately computing the data and looking at the distribution of its values. For large datasets, this can be expensive. Note that if sorted=True, specified divisions are assumed to match the existing partitions in the data; if this is untrue you should leave divisions empty and call repartition after set_index.

inplace: bool, optional

Modifying the DataFrame in place is not supported by Dask. Defaults to False.

sort: bool, optional

If True, sort the DataFrame by the new index. Otherwise set the index on the individual existing partitions. Defaults to True.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p' for distributed operation. Will be inferred by your current scheduler.

compute: bool, default False

Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.

partition_size: int, optional

Desired size of each partitions in bytes. Only used when npartitions='auto'

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h").reset_index()
>>> ddf2 = ddf.set_index("x")
>>> ddf2 = ddf.set_index(ddf.x)
>>> ddf2 = ddf.set_index(ddf.timestamp, sorted=True)

A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated

>>> import pandas as pd
>>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D')
>>> divisions
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07'],
              dtype='datetime64[ns]', freq='D')

Note that len(divisions) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.

>>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions.tolist())

If you’ll be running set_index on the same (or similar) datasets repeatedly, you could save time by letting Dask calculate good divisions once, then copy-pasting them to reuse. This is especially helpful running in a Jupyter notebook:

>>> ddf2 = ddf.set_index("name")  # slow, calculates data distribution
>>> ddf2.divisions  
["Alice", "Laura", "Ursula", "Zelda"]
>>> # ^ Now copy-paste this and edit the line above to:
>>> # ddf2 = ddf.set_index("name", divisions=["Alice", "Laura", "Ursula", "Zelda"])