- DataFrame.set_index(other, drop=True, sorted=False, npartitions=None, divisions=None, inplace=False, **kwargs)¶
Set the DataFrame index (row labels) using an existing column.
This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we
set_indexonce directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.
This function operates exactly like
pandas.set_indexexcept with different performance costs (dask dataframe
set_indexis much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate quantiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.
In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.
- other: string or Dask Series
- drop: boolean, default True
Delete column to be used as the new index.
- sorted: bool, optional
If the index column is already sorted in increasing order. Defaults to False
- npartitions: int, None, or ‘auto’
The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Only used when
divisionsis not given. If
divisionsis given, the number of output partitions will be
len(divisions) - 1.
- divisions: list, optional
The “dividing lines” used to split the new index into partitions. For
divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. If not given (default), good divisions are calculated by immediately computing the data and looking at the distribution of its values. For large datasets, this can be expensive. Note that if
sorted=True, specified divisions are assumed to match the existing partitions in the data; if this is untrue you should leave divisions empty and call
- inplace: bool, optional
Modifying the DataFrame in place is not supported by Dask. Defaults to False.
- shuffle: string, ‘disk’ or ‘tasks’, optional
'disk'for single-node operation or
'tasks'for distributed operation. Will be inferred by your current scheduler.
- compute: bool, default False
Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set
compute=False, an immediate computation will still be triggered if
- partition_size: int, optional
Desired size of each partitions in bytes. Only used when
>>> import dask >>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1H").reset_index() >>> ddf2 = ddf.set_index("x") >>> ddf2 = ddf.set_index(ddf.x) >>> ddf2 = ddf.set_index(ddf.timestamp, sorted=True)
A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated
>>> import pandas as pd >>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D') >>> divisions DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq='D')
len(divisons)is equal to
npartitions + 1. This is because
divisionsrepresents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.
>>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions.tolist())
If you’ll be running set_index on the same (or similar) datasets repeatedly, you could save time by letting Dask calculate good divisions once, then copy-pasting them to reuse. This is especially helpful running in a Jupyter notebook:
>>> ddf2 = ddf.set_index("name") # slow, calculates data distribution >>> ddf2.divisions ["Alice", "Laura", "Ursula", "Zelda"] >>> # ^ Now copy-paste this and edit the line above to: >>> # ddf2 = ddf.set_index("name", divisions=["Alice", "Laura", "Ursula", "Zelda"])