Working with Collections¶
Often we want to do a bit of custom work with
dask.delayed (for example
for complex data ingest), then leverage the algorithms in
dask.dataframe, and then switch back to custom work. To this end, all
from_delayed functions and
As an example, consider the case where we store tabular data in a custom format
not known by
dask.dataframe. This format is naturally broken apart into
pieces and we have a function that reads one piece into a Pandas DataFrame.
dask.delayed to lazily read these files into Pandas DataFrames,
dd.from_delayed to wrap these pieces up into a single
dask.dataframe, use the complex algorithms within
(groupby, join, etc..) and then switch back to delayed to save our results
back to the custom format.
import dask.dataframe as dd from dask.delayed import delayed from my_custom_library import load, save filenames = ... dfs = [delayed(load)(fn) for fn in filenames] df = dd.from_delayed(dfs) df = ... # do work with dask.dataframe dfs = df.to_delayed() writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)] dd.compute(*writes)
Data science is often complex,
dask.delayed provides a release valve for
users to manage this complexity on their own, and solve the last mile problem
for custom formats and complex situations.