Deploy Dask Clusters
Deploy Dask Clusters¶
dask.distributed scheduler works well on a single machine and scales to many machines
in a cluster. We recommend using
dask.distributed clusters at all scales for the following
It provides access to asynchronous APIs, notably Futures.
It provides a diagnostic dashboard that can provide valuable insight on performance and progress (see Dashboard Diagnostics).
It handles data locality with sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.
This page describes various ways to set up Dask clusters on different hardware, either locally on your own machine or on a distributed cluster.
You can continue reading or watch the screencast below:
If you import Dask, set up a computation, and call
compute, then you
will use the single-machine scheduler by default.
import dask.dataframe as dd df = dd.read_csv(...) df.x.sum().compute() # This uses the single-machine scheduler by default
To use the
dask.distributed scheduler you must set up a
from dask.distributed import Client client = Client(...) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
There are many ways to start the distributed scheduler and worker components, however, the most straight forward way is to use a cluster manager utility class.
from dask.distributed import Client, LocalCluster cluster = LocalCluster() # Launches a scheduler and workers locally client = Client(cluster) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
These cluster managers deploy a scheduler and the necessary workers as determined by communicating with the resource manager. All cluster managers follow the same interface, but with platform-specific configuration options, so you can switch from your local machine to a remote cluster with very minimal code changes.
Dask Jobqueue, for example, is a set of cluster managers for HPC users and works with job queueing systems (in this case, the resource manager) such as PBS, Slurm, and SGE. Those workers are then allocated physical hardware resources.
from dask.distributed import Client from dask_jobqueue import PBSCluster cluster = PBSCluster() # Launches a scheduler and workers on HPC via PBS client = Client(cluster) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system
The following resources explain how to set up Dask on a variety of local and distributed hardware.
Dask runs perfectly well on a single machine with or without a distributed scheduler. But once you start using Dask in anger you’ll find a lot of benefit both in terms of scaling and debugging by using the distributed scheduler.
- Default Scheduler
The no-setup default. Uses local threads or processes for larger-than-memory processing
The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.
There are a number of ways to run Dask on a distributed cluster (see the Beginner’s Guide to Configuring a Distributed Dask Cluster).
High Performance Computing¶
See High Performance Computers for more details.
Provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.
Deploy Dask from within an existing MPI environment.
- Dask Gateway for Jobqueue
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying HPC backend.
See Kubernetes for more details.
An easy way to stand up a long-running Dask cluster.
- Dask Kubernetes
For native Kubernetes integration for fast moving or ephemeral deployments.
- Dask Gateway for Kubernetes
Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying Kubernetes backend.
See Cloud for more details.
Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
- Dask Cloud Provider
Constructing and managing ephemeral Dask clusters on AWS, DigitalOcean, GCP, Azure, and Hertzner
You can use Coiled, a commercial Dask deployment option, to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
- Manual Setup
The command line interface to set up
Use SSH to set up Dask across an un-managed cluster.
- Python API (advanced)
Workerobjects from Python as part of a distributed Tornado TCP application.
You can use Coiled to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
Domino Data Lab lets users create Dask clusters in a hosted platform.
Saturn Cloud lets users create Dask clusters in a hosted platform or within their own AWS accounts.