Extending DataFrames

Extending DataFrames¶

Subclass DataFrames¶

There are a few projects that subclass or replicate the functionality of Pandas objects:

GeoPandas: for Geospatial analytics
cuDF: for data analysis on GPUs
…

These projects may also want to produce parallel variants of themselves with Dask, and may want to reuse some of the code in Dask DataFrame. Subclassing Dask DataFrames is intended for maintainers of these libraries and not for general users.

Implement dask, name, meta, and divisions¶

You will need to implement ._meta, .dask, .divisions, and ._name as defined in the DataFrame design docs.

Extend Dispatched Methods¶

If you are going to pass around Pandas-like objects that are not normal Pandas objects, then we ask you to extend a few dispatched methods: make_meta, get_collection_type, and concat.

make_meta¶

This function returns an empty version of one of your non-Dask objects, given a non-empty non-Dask object:

from dask.dataframe.dispatch import make_meta_dispatch

@make_meta_dispatch.register(MyDataFrame)
def make_meta_dataframe(df, index=None):
    return df.head(0)


@make_meta_dispatch.register(MySeries)
def make_meta_series(s, index=None):
    return s.head(0)


@make_meta_dispatch.register(MyIndex)
def make_meta_index(ind, index=None):
    return ind[:0]

For dispatching any arbitrary object types to a respective back-end, we recommend registering a dispatch for make_meta_obj:

from dask.dataframe.dispatch import make_meta_obj

@make_meta_obj.register(MyDataFrame)
def make_meta_object(x, index=None):
    if isinstance(x, dict):
        return MyDataFrame()
    elif isinstance(x, int):
        return MySeries
    .
    .
    .

Additionally, you should create a similar function that returns a non-empty version of your non-Dask DataFrame objects filled with a few rows of representative or random data. This is used to guess types when they are not provided. It should expect an empty version of your object with columns, dtypes, index name, and it should return a non-empty version:

from dask.dataframe.utils import meta_nonempty

@meta_nonempty.register(MyDataFrame)
def meta_nonempty_dataframe(df):
    ...
    return MyDataFrame(..., columns=df.columns,
                       index=MyIndex(..., name=df.index.name)


@meta_nonempty.register(MySeries)
def meta_nonempty_series(s):
    ...


@meta_nonempty.register(MyIndex)
def meta_nonempty_index(ind):
    ...

get_collection_type¶

Given a non-Dask DataFrame object, return the Dask equivalent:

from dask.dataframe import get_collection_type

@get_collection_type.register(MyDataFrame)
def get_collection_type_dataframe(df):
    return MyDaskDataFrame


@get_collection_type.register(MySeries)
def get_collection_type_series(s):
    return MyDaskSeries


@get_collection_type.register(MyIndex)
def get_collection_type_index(ind):
    return MyDaskIndex

concat¶

Concatenate many of your non-Dask DataFrame objects together. It should expect a list of your objects (homogeneously typed):

from dask.dataframe.methods import concat_dispatch

@concat_dispatch.register((MyDataFrame, MySeries, MyIndex))
def concat_pandas(dfs, axis=0, join='outer', uniform=False, filter_warning=True):
    ...

Extension Arrays¶

Rather than subclassing Pandas DataFrames, you may be interested in extending Pandas with Extension Arrays.

All of the first-party extension arrays (those implemented in pandas itself) are supported directly by dask.

Developers implementing third-party extension arrays (outside of pandas) will need to do register their ExtensionDtype with Dask so that it works correctly in dask.dataframe.

For example, we’ll register the test-only DecimalDtype from pandas test suite.

from decimal import Decimal
from dask.dataframe.extensions import make_array_nonempty, make_scalar
from pandas.tests.extension.decimal import DecimalArray, DecimalDtype

@make_array_nonempty.register(DecimalDtype)
def _(dtype):
    return DecimalArray._from_sequence([Decimal('0'), Decimal('NaN')],
                                       dtype=dtype)


@make_scalar.register(Decimal)
def _(x):
   return Decimal('1')

Internally, Dask will use this to create a small dummy Series for tracking metadata through operations.

>>> make_array_nonempty(DecimalDtype())
<DecimalArray>
[Decimal('0'), Decimal('NaN')]
Length: 2, dtype: decimal

So you (or your users) can now create and store a dask DataFrame or Series with your extension array contained within.

>>> from decimal import Decimal
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> from pandas.tests.extension.decimal import DecimalArray
>>> s = pd.Series(DecimalArray([Decimal('0.0')] * 10))
>>> ds = dd.from_pandas(s, 3)
>>> ds
Dask Series Structure:
npartitions=3
0    decimal
4        ...
8        ...
9        ...
dtype: decimal
Dask Name: from_pandas, 3 tasks

Notice the decimal dtype.

Accessors¶

Many extension arrays expose their functionality on Series or DataFrame objects using accessors. Dask provides decorators to register accessors similar to pandas. See the pandas documentation on accessors for more.

dask.dataframe.extensions.register_dataframe_accessor(name)[source]¶

See pandas.api.extensions.register_dataframe_accessor() for more.

dask.dataframe.extensions.register_series_accessor(name)[source]¶

See pandas.api.extensions.register_series_accessor() for more.

dask.dataframe.extensions.register_index_accessor(name)[source]¶

See pandas.api.extensions.register_index_accessor() for more.

Categoricals

Using Hive Partitioning with Dask