Cloud

There are a variety of ways to deploy Dask on the cloud. Cloud providers offer managed services, like VMs, Kubernetes, Yarn, or custom APIs with which Dask can connect easily.

Some common deployment options you may want to consider are:

  • A commercial Dask deployment option like Coiled to handle the creation and management of Dask clusters on AWS, GCP, and Azure.

  • A managed Kubernetes service and Dask’s Kubernetes integration.

  • Directly launching cloud resources such as VMs or containers via a cluster manager with Dask Cloud Provider.

  • A managed Yarn service, like Amazon EMR or Google Cloud DataProc and Dask-Yarn (specific documentation for the popular Amazon EMR service can be found here.)

_images/cloud-provider-logos.svg

Cloud Deployment Examples

Coiled deploys managed Dask clusters on AWS, GCP, and Azure. It’s free for most users and has several features that address common deployment pain points like:

  • Easy to use API

  • Automatic software synchronization

  • Easy access to any cloud hardware (like GPUs) in any region

  • Robust logging, cost controls, and metrics collection

>>> import coiled
>>> cluster = coiled.Cluster(
...     n_workers=100,             # Size of cluster
...     region="us-west-2",        # Same region as data
...     vm_type="m6i.xlarge",      # Hardware of your choosing
... )
>>> client = cluster.get_client()

Coiled is recommended for deploying Dask on the cloud. Though there are non-commercial, open source options like Dask Cloud Provider, Dask-Gateway, and Dask-Yarn that are also available (see cloud deployment options for additional options.)

Using Dask Cloud Provider to launch a cluster of VMs on a platform like DigitalOcean can be as convenient as launching a local cluster.

>>> import dask.config
>>> dask.config.set({"cloudprovider.digitalocean.token": "yourAPItoken"})
>>> from dask_cloudprovider.digitalocean import DropletCluster
>>> cluster = DropletCluster(n_workers=1)
Creating scheduler instance
Created droplet dask-38b817c1-scheduler
Waiting for scheduler to run
Scheduler is running
Creating worker instance
Created droplet dask-38b817c1-worker-dc95260d

Many of the cluster managers in Dask Cloud Provider work by launching VMs with a startup script that pulls down the Dask Docker image and runs Dask components within that container. As with all cluster managers the VM resources, Docker image, etc are all configurable.

You can then connect a client and work with the cluster as if it were on your local machine.

>>> client = cluster.get_client()

Data Access

In addition to deploying Dask clusters on the cloud, most cloud users will also want to access cloud-hosted data on their respective cloud provider.

We recommend installing additional libraries (listed below) for easy data access on your cloud provider. See Connect to remote data for more information.

Use s3fs for accessing data on Amazon’s S3.

pip install s3fs

Use gcsfs for accessing data on Google’s GCS.

pip install gcsfs

Use adlfs for accessing data on Microsoft’s Data Lake or Blob Storage.

pip install adlfs