Indexing into Dask DataFrames

Dask DataFrame supports some of Pandas’ indexing behavior.

DataFrame.iloc

Purely integer-location based indexing for selection by position.

DataFrame.loc

Purely label-location based indexer for selection by label.

Label-based Indexing

Just like Pandas, Dask DataFrame supports label-based indexing with the .loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns.

Note

To select rows, the DataFrame’s divisions must be known (see Dask DataFrame Design and Dask DataFrames Best Practices for more information.)

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
...                   index=['a', 'b', 'c'])
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf
Dask DataFrame Structure:
                   A      B
npartitions=1
a              int64  int64
c                ...    ...
Dask Name: from_pandas, 1 tasks

Selecting columns:

>>> ddf[['B', 'A']]
Dask DataFrame Structure:
                   B      A
npartitions=1
a              int64  int64
c                ...    ...
Dask Name: getitem, 2 tasks

Selecting a single column reduces to a Dask Series:

>>> ddf['A']
Dask Series Structure:
npartitions=1
a    int64
c      ...
Name: A, dtype: int64
Dask Name: getitem, 2 tasks

Slicing rows and (optionally) columns with .loc:

>>> ddf.loc[['b', 'c'], ['A']]
Dask DataFrame Structure:
                   A
npartitions=1
b              int64
c                ...
Dask Name: loc, 2 tasks

>>> ddf.loc[df["A"] > 1, ["B"]]
Dask DataFrame Structure:
                   B
npartitions=1
a              int64
c                ...
Dask Name: try_loc, 2 tasks

>>> ddf.loc[lambda df: df["A"] > 1, ["B"]]
Dask DataFrame Structure:
                   B
npartitions=1
a              int64
c                ...
Dask Name: try_loc, 2 tasks

Dask DataFrame supports Pandas’ partial-string indexing:

>>> ts = dd.demo.make_timeseries()
>>> ts
Dask DataFrame Structure:
                   id    name        x        y
npartitions=11
2000-01-31      int64  object  float64  float64
2000-02-29        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-11-30        ...     ...      ...      ...
2000-12-31        ...     ...      ...      ...
Dask Name: make-timeseries, 11 tasks

>>> ts.loc['2000-02-12']
Dask DataFrame Structure:
                                  id    name        x        y
npartitions=1
2000-02-12 00:00:00.000000000  int64  object  float64  float64
2000-02-12 23:59:59.999999999    ...     ...      ...      ...
Dask Name: loc, 12 tasks

Positional Indexing

Dask DataFrame does not track the length of partitions, making positional indexing with .iloc inefficient for selecting rows. DataFrame.iloc() only supports indexers where the row indexer is slice(None) (which : is a shorthand for.)

>>> ddf.iloc[:, [1, 0]]
Dask DataFrame Structure:
                   B      A
npartitions=1
a              int64  int64
c                ...    ...
Dask Name: iloc, 2 tasks

Trying to select specific rows with iloc will raise an exception:

>>> ddf.iloc[[0, 2], [1]]
Traceback (most recent call last)
  File "<stdin>", line 1, in <module>
ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.

Partition Indexing

In addition to pandas-style indexing, Dask DataFrame also supports indexing at a partition level with DataFrame.get_partition() and DataFrame.partitions. These can be used to select subsets of the data by partition, rather than by position in the entire DataFrame or index label.

Use DataFrame.get_partition() to select a single partition by position.

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h")
>>> ddf.get_partition(0)
Dask DataFrame Structure:
                 name     id        x        y
npartitions=1
2021-01-01     object  int64  float64  float64
2021-01-02        ...    ...      ...      ...
Dask Name: get-partition, 2 graph layers

Note that the result is also a Dask DatFrame.

Index into DataFrame.partitions to select one or more partitions. For example, you can select every other partition with a slice:

>>> ddf.partitions[::2]
Dask DataFrame Structure:
                 name     id        x        y
npartitions=3
2021-01-01     object  int64  float64  float64
2021-01-03        ...    ...      ...      ...
2021-01-05        ...    ...      ...      ...
2021-01-06        ...    ...      ...      ...
Dask Name: blocks, 2 graph layers

Or even more complicated selections based on the data in the partitions themselves (at the cost of computing the DataFrame up until that point). For example, we can create a boolean mask with the partitions that have more than some number of unique IDs using DataFrame.map_partitions():

>>> mask = ddf.id.map_partitions(lambda p: len(p.unique()) > 20).compute()
>>> ddf.partitions[mask]
Dask DataFrame Structure:
                 name     id        x        y
npartitions=5
2021-01-01     object  int64  float64  float64
2021-01-02        ...    ...      ...      ...
...               ...    ...      ...      ...
2021-01-06        ...    ...      ...      ...
2021-01-07        ...    ...      ...      ...
Dask Name: blocks, 2 graph layers