Selecting the collection backend

Warning: Backend-library disptaching at the collection level is still an experimental feature. Both the DaskBackendEntrypoint API and the set of “dispatchable” functions are expected to change.

Changing the default backend library

The Dask-Dataframe and Dask-Array modules were originally designed with the Pandas and Numpy backend libraries in mind, respectively. However, other dataframe and array libraries can take advantage of the same collection APIs for out-of-core and parallel processing. For example, users with cupy installed can change their default Dask-Array backend to cupy with the "array.backend" configuration option:

>>> import dask
>>> import dask.array as da
>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = da.ones(10, chunks=(5,))  # Get cupy-backed collection

This code opts out of the default ("numpy") backend for dispatchable Dask-Array creation functions, and uses the creation functions registered for "cupy" instead. The current set of dispatchable creation functions for Dask-Array is:

  • ones

  • zeros

  • empty

  • full

  • arange

The Dask-Array API can also dispatch the backend RandomState class to be used for random-number generation. This means all creation functions in dask.array.random are also dispatchable.

The current set of dispatchable creation functions for Dask-Dataframe is:

  • from_dict

  • read_parquet

  • read_json

  • read_orc

  • read_csv

  • read_hdf

As the backend-library disptaching system becomes more mature, this set of dispatchable creation functions is likely to grow.

For an existing collection, the underlying data can be forcibly moved to a desired backend using the to_backend method:

>>> import dask
>>> import dask.array as da
>>> darr = da.ones(10, chunks=(5,))  # Creates numpy-backed collection
>>> with dask.config.set({"array.backend": "cupy"}):
...     darr = darr.to_backend()  # Moves numpy data to cupy

Defining a new collection backend

Warning: Defining a custom backend is not yet recommended for most users and down-stream libraries. The backend-entrypoint system should still be treated as experimental.

Dask currently exposes an entrypoint under the group dask.array.backends and dask.dataframe.backends to enable users and third-party libraries to develop and maintain backend implementations for Dask-Array and Dask-Dataframe. A custom Dask-Array backend should define a subclass of DaskArrayBackendEntrypoint (defined in dask.array.backends), while a custom Dask-DataFrame backend should define a subclass of DataFrameBackendEntrypoint (defined in dask.dataframe.backends).

For example, a cudf-based backend definition for Dask-Dataframe would look something like the CudfBackendEntrypoint definition below:

from dask.dataframe.backends import DataFrameBackendEntrypoint
from dask.dataframe.dispatch import (
   ...
   make_meta_dispatch,
   ...
)
...

def make_meta_cudf(x, index=None):
   return x.head(0)
...

class CudfBackendEntrypoint(DataFrameBackendEntrypoint):

   def __init__(self):
      # Register compute-based dispatch functions
      # (e.g. make_meta_dispatch, sizeof_dispatch)
      ...
      make_meta_dispatch.register(
         (cudf.Series, cudf.DataFrame),
         func=make_meta_cudf,
      )
      # NOTE: Registration may also be outside __init__
      # if it is in the same module as this class
   ...

   @staticmethod
   def read_orc(*args, **kwargs):
      from .io import read_orc

      # Use dask_cudf version of read_orc
      return read_orc(*args, **kwargs)
   ...

In order to support pandas-to-cudf conversion with DataFrame.to_backend, this class also needs to implement the proper to_backend and to_backend_dispatch methods.

To expose this class as a dask.dataframe.backends entrypoint, the necessary setup.cfg configuration in cudf (or dask_cudf) would be as follows:

[options.entry_points]
dask.dataframe.backends =
   cudf = <module-path>:CudfBackendEntrypoint

Compute dispatch

Note

The primary dispatching mechanism for array-like compute operations in both Dask-Array and Dask-DataFrame is the __array_function__ protocol defined in NEP-18. For a custom collection backend to be functional, this protocol must cover many common numpy functions for the desired array backend. For example, the cudf backend for Dask-DataFrame depends on the __array_function__ protocol being defined for both cudf and its complementary array backend (cupy). The compute-based dispatch functions discussed in this section correspond to functionality that is not already captured by NEP-18.

Notice that the CudfBackendEntrypoint definition must define a distinct method definition for each dispatchable creation routine, and register all non-creation (compute-based) dispatch functions within the __init__ logic. These compute dispatch functions do not operate at the collection-API level, but at computation time (within a task). The list of all current “compute” dispatch functions are listed below.

Dask-Array compute-based dispatch functions (as defined in dask.array.dispatch, and defined for Numpy in dask.array.backends):

  • concatenate_lookup

  • divide_lookup

  • einsum_lookup

  • empty_lookup

  • nannumel_lookup

  • numel_lookup

  • percentile_lookup

  • tensordot_lookup

Dask-Dataframe compute-based dispatch functions (as defined in dask.dataframe.dispatch, and defined for Pandas in dask.dataframe.backends):

  • categorical_dtype_dispatch

  • concat_dispatch

  • get_parallel_type

  • group_split_dispatch

  • grouper_dispatch

  • hash_object_dispatch

  • is_categorical_dtype_dispatch

  • make_meta_dispatch

  • make_meta_obj

  • meta_lib_from_array

  • meta_nonempty

  • pyarrow_schema_dispatch

  • tolist_dispatch

  • union_categoricals_dispatch

Note that the compute-based dispatching system is subject to change. Implementing a complete backend is still expected to require significant effort. However, the long-term goal is to bring further simplicity to this process.