.. _dataframe.indexing: Indexing into Dask DataFrames ============================= Dask DataFrame supports some of Pandas' indexing behavior. .. currentmodule:: dask.dataframe .. autosummary:: DataFrame.iloc DataFrame.loc Label-based Indexing -------------------- Just like Pandas, Dask DataFrame supports label-based indexing with the ``.loc`` accessor for selecting rows or columns, and ``__getitem__`` (square brackets) for selecting just columns. .. note:: To select rows, the DataFrame's divisions must be known (see :ref:`dataframe.design` and :ref:`dataframe.performance` for more information.) .. code-block:: python >>> import dask.dataframe as dd >>> import pandas as pd >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]}, ... index=['a', 'b', 'c']) >>> ddf = dd.from_pandas(df, npartitions=2) >>> ddf Dask DataFrame Structure: A B npartitions=1 a int64 int64 c ... ... Dask Name: from_pandas, 1 tasks Selecting columns: .. code-block:: python >>> ddf[['B', 'A']] Dask DataFrame Structure: B A npartitions=1 a int64 int64 c ... ... Dask Name: getitem, 2 tasks Selecting a single column reduces to a Dask Series: .. code-block:: python >>> ddf['A'] Dask Series Structure: npartitions=1 a int64 c ... Name: A, dtype: int64 Dask Name: getitem, 2 tasks Slicing rows and (optionally) columns with ``.loc``: .. code-block:: python >>> ddf.loc[['b', 'c'], ['A']] Dask DataFrame Structure: A npartitions=1 b int64 c ... Dask Name: loc, 2 tasks >>> ddf.loc[df["A"] > 1, ["B"]] Dask DataFrame Structure: B npartitions=1 a int64 c ... Dask Name: try_loc, 2 tasks >>> ddf.loc[lambda df: df["A"] > 1, ["B"]] Dask DataFrame Structure: B npartitions=1 a int64 c ... Dask Name: try_loc, 2 tasks Dask DataFrame supports Pandas' `partial-string indexing `_: .. code-block:: python >>> ts = dd.demo.make_timeseries() >>> ts Dask DataFrame Structure: id name x y npartitions=11 2000-01-31 int64 object float64 float64 2000-02-29 ... ... ... ... ... ... ... ... ... 2000-11-30 ... ... ... ... 2000-12-31 ... ... ... ... Dask Name: make-timeseries, 11 tasks >>> ts.loc['2000-02-12'] Dask DataFrame Structure: id name x y npartitions=1 2000-02-12 00:00:00.000000000 int64 object float64 float64 2000-02-12 23:59:59.999999999 ... ... ... ... Dask Name: loc, 12 tasks Positional Indexing ------------------- Dask DataFrame does not track the length of partitions, making positional indexing with ``.iloc`` inefficient for selecting rows. :meth:`DataFrame.iloc` only supports indexers where the row indexer is ``slice(None)`` (which ``:`` is a shorthand for.) .. code-block:: python >>> ddf.iloc[:, [1, 0]] Dask DataFrame Structure: B A npartitions=1 a int64 int64 c ... ... Dask Name: iloc, 2 tasks Trying to select specific rows with ``iloc`` will raise an exception: .. code-block:: python >>> ddf.iloc[[0, 2], [1]] Traceback (most recent call last) File "", line 1, in ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'. Partition Indexing ------------------ In addition to pandas-style indexing, Dask DataFrame also supports indexing at a partition level with :meth:`DataFrame.get_partition` and :attr:`DataFrame.partitions`. These can be used to select subsets of the data by partition, rather than by position in the entire DataFrame or index label. Use :meth:`DataFrame.get_partition` to select a single partition by position. .. code-block:: python >>> import dask >>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h") >>> ddf.get_partition(0) Dask DataFrame Structure: name id x y npartitions=1 2021-01-01 object int64 float64 float64 2021-01-02 ... ... ... ... Dask Name: get-partition, 2 graph layers Note that the result is also a Dask DatFrame. Index into :attr:`DataFrame.partitions` to select one or more partitions. For example, you can select every other partition with a slice: .. code-block:: python >>> ddf.partitions[::2] Dask DataFrame Structure: name id x y npartitions=3 2021-01-01 object int64 float64 float64 2021-01-03 ... ... ... ... 2021-01-05 ... ... ... ... 2021-01-06 ... ... ... ... Dask Name: blocks, 2 graph layers Or even more complicated selections based on the data in the partitions themselves (at the cost of computing the DataFrame up until that point). For example, we can create a boolean mask with the partitions that have more than some number of unique IDs using :meth:`DataFrame.map_partitions`: .. code-block:: python >>> mask = ddf.id.map_partitions(lambda p: len(p.unique()) > 20).compute() >>> ddf.partitions[mask] Dask DataFrame Structure: name id x y npartitions=5 2021-01-01 object int64 float64 float64 2021-01-02 ... ... ... ... ... ... ... ... ... 2021-01-06 ... ... ... ... 2021-01-07 ... ... ... ... Dask Name: blocks, 2 graph layers