Ecosystem
Contents
Ecosystem¶
There are a number of open source projects that extend the Dask interface and provide different mechanisms for deploying Dask clusters. This is likely an incomplete list so if you spot something missing - please suggest a fix!
Building on Dask¶
Many packages include built-in support for Dask collections or wrap Dask collections internally to enable parallelization.
Array¶
xarray: Wraps Dask Array, offering the same scalability, but with axis labels which add convenience when dealing with complex datasets.
cupy: Part of the Rapids project, GPU-enabled arrays can be used as the blocks of Dask Arrays. See the section GPUs for more information.
sparse: Implements sparse arrays of arbitrary dimension on top of
numpy
andscipy.sparse
.pint: Allows arithmetic operations between them and conversions from and to different units.
DataFrame¶
cudf: Part of the Rapids project, implements GPU-enabled dataframes which can be used as partitions in Dask Dataframes.
dask-geopandas: Early-stage subproject of geopandas, enabling parallelization of geopandas dataframes.
SQL¶
blazingSQL: Part of the Rapids project, implements SQL queries using
cuDF
and Dask, for execution on CUDA/GPU-enabled hardware, including referencing externally-stored data.dask-sql: Adds a SQL query layer on top of Dask. The API matches blazingSQL but it uses CPU instead of GPU. It still under development and not ready for a production use-case.
fugue-sql: Adds an abstract layer that makes code portable between across differing computing frameworks such as Pandas, Spark and Dask.
Machine Learning¶
dask-ml: Implements distributed versions of common machine learning algorithms.
scikit-learn: Provide ‘dask’ to the joblib backend to parallelize scikit-learn algorithms with dask as the processor.
xgboost: Powerful and popular library for gradient boosted trees; includes native support for distributed training using dask.
lightgbm: Similar to XGBoost; lightgmb also natively supplies native distributed training for decision trees.
Deploying Dask¶
There are many different implementations of the Dask distributed cluster.
dask-jobqueue: Deploy Dask on job queuing systems like PBS, Slurm, MOAB, SGE, LSF, and HTCondor.
dask-kubernetes: Deploy Dask workers on Kubernetes from within a Python script or interactive session.
dask-helm: Deploy Dask and (optionally) Jupyter or JupyterHub on Kubernetes easily using Helm.
dask-yarn / Hadoop: Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
dask-cloudprovider: Deploy Dask on various cloud platforms such as AWS, Azure, and GCP leveraging cloud native APIs.
dask-gateway: Secure, multi-tenant server for managing Dask clusters. Launch and use Dask clusters in a shared, centrally managed cluster environment, without requiring users to have direct access to the underlying cluster backend.
dask-cuda: Construct a Dask cluster which resembles
LocalCluster
and is specifically optimized for GPUs.