Ecosystem

There are a number of open source projects that extend the Dask interface and provide different mechanisms for deploying Dask clusters. This is likely an incomplete list so if you spot something missing - please suggest a fix!

Building on Dask

Many packages include built-in support for Dask collections or wrap Dask collections internally to enable parallelization.

Array

  • xarray: Wraps Dask Array, offering the same scalability, but with axis labels which add convenience when dealing with complex datasets.

  • cupy: Part of the Rapids project, GPU-enabled arrays can be used as the blocks of Dask Arrays. See the section GPUs for more information.

  • sparse: Implements sparse arrays of arbitrary dimension on top of numpy and scipy.sparse.

  • pint: Allows arithmetic operations between them and conversions from and to different units.

DataFrame

  • cudf: Part of the Rapids project, implements GPU-enabled dataframes which can be used as partitions in Dask Dataframes.

  • dask-geopandas: Early-stage subproject of geopandas, enabling parallelization of geopandas dataframes.

SQL

  • blazingSQL: Part of the Rapids project, implements SQL queries using cuDF and Dask, for execution on CUDA/GPU-enabled hardware, including referencing externally-stored data.

  • dask-sql: Adds a SQL query layer on top of Dask. The API matches blazingSQL but it uses CPU instead of GPU. It still under development and not ready for a production use-case.

  • fugue-sql: Adds an abstract layer that makes code portable between across differing computing frameworks such as Pandas, Spark and Dask.

Machine Learning

  • dask-ml: Implements distributed versions of common machine learning algorithms.

  • scikit-learn: Provide ‘dask’ to the joblib backend to parallelize scikit-learn algorithms with dask as the processor.

  • xgboost: Powerful and popular library for gradient boosted trees; includes native support for distributed training using dask.

  • lightgbm: Similar to XGBoost; lightgmb also natively supplies native distributed training for decision trees.

Deploying Dask

There are many different implementations of the Dask distributed cluster.

  • dask-jobqueue: Deploy Dask on job queuing systems like PBS, Slurm, MOAB, SGE, LSF, and HTCondor.

  • dask-kubernetes: Deploy Dask workers on Kubernetes from within a Python script or interactive session.

  • dask-helm: Deploy Dask and (optionally) Jupyter or JupyterHub on Kubernetes easily using Helm.

  • dask-yarn / Hadoop: Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.

  • dask-cloudprovider: Deploy Dask on various cloud platforms such as AWS, Azure, and GCP leveraging cloud native APIs.

  • dask-gateway: Secure, multi-tenant server for managing Dask clusters. Launch and use Dask clusters in a shared, centrally managed cluster environment, without requiring users to have direct access to the underlying cluster backend.

  • dask-cuda: Construct a Dask cluster which resembles LocalCluster and is specifically optimized for GPUs.

Commercial Dask Deployment Options

  • You can use Coiled to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).