Deploy Dask Clusters
Deploy Dask Clusters¶
dask.distributed scheduler works well on a single machine and scales to many machines
in a cluster. We recommend using
dask.distributed clusters at all scales for the following
It provides access to asynchronous APIs, notably Futures.
It provides a diagnostic dashboard that can provide valuable insight on performance and progress (see Dashboard Diagnostics).
It handles data locality with sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.
This page describes various ways to set up Dask clusters on different hardware, either locally on your own machine or on a distributed cluster.
You can continue reading or watch the screencast below:
If you import Dask, set up a computation, and call
compute, then you
will use the single-machine scheduler by default.
import dask.dataframe as dd df = dd.read_csv(...) df.x.sum().compute() # This uses the single-machine scheduler by default
To use the
dask.distributed scheduler you must set up a
from dask.distributed import Client client = Client(...) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
There are many ways to start the distributed scheduler and worker components, however, the most straight forward way is to use a cluster manager utility class.
from dask.distributed import Client, LocalCluster cluster = LocalCluster() # Launches a scheduler and workers locally client = Client(cluster) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
These cluster managers deploy a scheduler and the necessary workers as determined by communicating with the resource manager. All cluster managers follow the same interface, but with platform-specific configuration options, so you can switch from your local machine to a remote cluster with very minimal code changes.
Dask Jobqueue, for example, is a set of cluster managers for HPC users and works with job queueing systems (in this case, the resource manager) such as PBS, Slurm, and SGE. Those workers are then allocated physical hardware resources.
from dask.distributed import Client from dask_jobqueue import PBSCluster cluster = PBSCluster() # Launches a scheduler and workers on HPC via PBS client = Client(cluster) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
The following resources explain how to set up Dask on a variety of local and distributed hardware.
Dask runs perfectly well on a single machine with or without a distributed scheduler. But once you start using Dask in anger you’ll find a lot of benefit both in terms of scaling and debugging by using the distributed scheduler.
There are a number of ways to run Dask on a distributed cluster (see the Beginner’s Guide to Configuring a Distributed Dask Cluster).
High Performance Computing¶
See High Performance Computers for more details.
Provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.
Deploy Dask from within an existing MPI environment.
- Dask Gateway for Jobqueue
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying HPC backend.
See Kubernetes for more details.
- Dask Kubernetes Operator
For native Kubernetes integration for fast moving or ephemeral deployments.
- Dask Gateway for Kubernetes
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying Kubernetes backend.
- Single Cluster Helm Chart
Single Dask cluster and (optionally) Jupyter on deployed with Helm.
See Cloud for more details.
Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
- Dask Cloud Provider
Constructing and managing ephemeral Dask clusters on AWS, DigitalOcean, Google Cloud, Azure, and Hetzner
You can use Coiled, a commercial Dask deployment option, to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).