Working with Collections

Working with Collections

Often we want to do a bit of custom work with dask.delayed (for example, for complex data ingest), then leverage the algorithms in dask.array or dask.dataframe, and then switch back to custom work. To this end, all collections support from_delayed functions and to_delayed methods.

As an example, consider the case where we store tabular data in a custom format not known by Dask DataFrame. This format is naturally broken apart into pieces and we have a function that reads one piece into a Pandas DataFrame. We use dask.delayed to lazily read these files into Pandas DataFrames, use dd.from_delayed to wrap these pieces up into a single Dask DataFrame, use the complex algorithms within the DataFrame (groupby, join, etc.), and then switch back to dask.delayed to save our results back to the custom format:

import dask.dataframe as dd
from dask.delayed import delayed

from my_custom_library import load, save

filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]

df = dd.from_delayed(dfs)
df = ... # do work with dask.dataframe

dfs = df.to_delayed()
writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)]

dd.compute(*writes)

Data science is often complex, and dask.delayed provides a release valve for users to manage this complexity on their own, and solve the last mile problem for custom formats and complex situations.