- DataFrame.set_index(other, drop=True, sorted=False, npartitions=None, divisions=None, inplace=False, **kwargs)¶
Set the DataFrame index (row labels) using an existing column.
This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we
set_indexonce directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.
This function operates exactly like
pandas.set_indexexcept with different performance costs (dask dataframe
set_indexis much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate quantiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.
In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.
- other: string or Dask Series
- drop: boolean, default True
Delete column to be used as the new index.
- sorted: bool, optional
If the index column is already sorted in increasing order. Defaults to False
- npartitions: int, None, or ‘auto’
The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use.
- divisions: list, optional
Known values on which to separate index values of the partitions. See https://docs.dask.org/en/latest/dataframe-design.html#partitions Defaults to computing this with a single pass over the data. Note that if
sorted=True, specified divisions are assumed to match the existing partitions in the data. If
sorted=False, you should leave divisions empty and call
- inplace: bool, optional
Modifying the DataFrame in place is not supported by Dask. Defaults to False.
- shuffle: string, ‘disk’ or ‘tasks’, optional
'disk'for single-node operation or
'tasks'for distributed operation. Will be inferred by your current scheduler.
- compute: bool, default False
Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set
compute=False, an immediate computation will still be triggered if
- partition_size: int, optional
Desired size of each partitions in bytes. Only used when
>>> df2 = df.set_index('x') >>> df2 = df.set_index(d.x) >>> df2 = df.set_index(d.timestamp, sorted=True)
A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated
>>> import pandas as pd >>> divisions = pd.date_range('2000', '2010', freq='1D') >>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)