Configuration

Configuration¶

Taking full advantage of Dask sometimes requires user configuration. This might be to control logging verbosity, specify cluster configuration, provide credentials for security, or any of several other options that arise in production.

Configuration is specified in one of the following ways:

YAML files in ~/.config/dask/ or /etc/dask/
Environment variables like DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=True
Default settings within sub-libraries

This combination makes it easy to specify configuration in a variety of settings ranging from personal workstations, to IT-mandated configuration, to docker images.

Access Configuration¶

dask.config.get(key[, default, config, ...])

Get elements from global config

Dask’s configuration system is usually accessed using the dask.config.get function. You can use . for nested access, for example:

>>> import dask
>>> import dask.distributed  # populate config with distributed defaults

>>> dask.config.get("distributed.client") # use `.` for nested access
{'heartbeat': '5s', 'scheduler-info-interval': '2s'}

>>> dask.config.get("distributed.scheduler.unknown-task-duration")
'500ms'

You may wish to inspect the dask.config.config dictionary to get a sense for what configuration is being used by your current system.

Note that the get function treats underscores and hyphens identically. For example, dask.config.get("temporary-directory") is equivalent to dask.config.get("temporary_directory").

Values like "128 MiB" and "10s" are parsed using the functions in Utilities.

Specify Configuration¶

YAML files¶

You can specify configuration values in YAML files. For example:

array:
  chunk-size: 128 MiB

distributed:
  worker:
    memory:
      spill: 0.85  # default: 0.7
      target: 0.75  # default: 0.6
      terminate: 0.98  # default: 0.95

  dashboard:
    # Locate the dashboard if working on a Jupyter Hub server
    link: /user/<user>/proxy/8787/status

These files can live in any of the following locations:

The ~/.config/dask directory in the user’s home directory
The {sys.prefix}/etc/dask directory local to Python
The {prefix}/etc/dask directories with {prefix} in site.PREFIXES
The root directory (specified by the DASK_ROOT_CONFIG environment variable or /etc/dask/ by default)

Dask searches for all YAML files within each of these directories and merges them together, preferring configuration files closer to the user over system configuration files (preference follows the order in the list above). Additionally, users can specify a path with the DASK_CONFIG environment variable, which takes precedence at the top of the list above.

The contents of these YAML files are merged together, allowing different Dask subprojects like dask-kubernetes or dask-ml to manage configuration files separately, but have them merge into the same global configuration.

Environment Variables¶

You can also specify configuration values with environment variables like the following:

export DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=True
export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=5
export DASK_DISTRIBUTED__DASHBOARD__LINK="/user/<user>/proxy/8787/status"

resulting in configuration values like the following:

{
    'distributed': {
        'scheduler': {
            'work-stealing': True,
            'allowed-failures': 5
        }
    }
}

Dask searches for all environment variables that start with DASK_, then transforms keys by converting to lower case and changing double-underscores to nested structures.

Dask tries to parse all values with ast.literal_eval, letting users pass numeric and boolean values (such as True in the example above) as well as lists, dictionaries, and so on with normal Python syntax.

Environment variables take precedence over configuration values found in YAML files.

Defaults¶

Additionally, individual subprojects may add their own default values when they are imported. These are always added with lower priority than the YAML files or environment variables mentioned above:

>>> import dask.config
>>> dask.config.config  # no configuration by default
{}

>>> import dask.distributed
>>> dask.config.config  # New values have been added
{
    'scheduler': ...,
    'worker': ...,
    'tls': ...
}

Directly within Python¶

dask.config.set(arg, config, lock, **kwargs)

Temporarily set configuration values within a context manager

Configuration is stored within a normal Python dictionary in dask.config.config and can be modified using normal Python operations.

Additionally, you can temporarily set a configuration value using the dask.config.set function. This function accepts a dictionary as an input and interprets "." as nested access:

>>> dask.config.set({'optimization.fuse.ave-width': 4})

This function can also be used as a context manager for consistent cleanup:

>>> with dask.config.set({'optimization.fuse.ave-width': 4}):
...     arr2, = dask.optimize(arr)

Note that the set function treats underscores and hyphens identically. For example, dask.config.set({'optimization.fuse.ave_width': 4}) is equivalent to dask.config.set({'optimization.fuse.ave-width': 4}).

Finally, note that persistent objects may acquire configuration settings when they are initialized. These settings may also be cached for performance reasons. This is particularly true for dask.distributed objects such as Client, Scheduler, Worker, and Nanny.

Directly from CLI¶

Configuration can also be set and viewed from the CLI.

$ dask config set optimization.fuse.ave-width 4
Updated [optimization.fuse.ave-width] to [4], config saved to ~/dask/dask.yaml

$ dask config get optimization.fuse.ave-width
4

Distributing configuration¶

It may also be desirable to package up your whole Dask configuration for use on another machine. This is used in some Dask Distributed libraries to ensure remote components have the same configuration as your local system.

This is typically handled by the downstream libraries which use base64 encoding to pass config via the DASK_INTERNAL_INHERIT_CONFIG environment variable.

`dask.config.serialize`(data)	Serialize config data into a string.
`dask.config.deserialize`(data)	De-serialize config data into the original object.

Conversion Utility¶

It is possible to configure Dask inline with dot notation, with YAML or via environment variables. You can enter your own configuration items below to convert back and forth.

Warning

This utility is designed to improve understanding of converting between different notations and does not claim to be a perfect implementation. Please use for reference only.

YAML

Environment variable

Inline with dot notation

Updating Configuration¶

Manipulating configuration dictionaries¶

`dask.config.merge`(*dicts)	Update a sequence of nested dictionaries
`dask.config.update`(old, new[, priority, ...])	Update a nested dictionary with values from another
`dask.config.expand_environment_variables`(config)	Expand environment variables in a nested config dictionary

As described above, configuration can come from many places, including several YAML files, environment variables, and project defaults. Each of these provides a configuration that is possibly nested like the following:

x = {'a': 0, 'c': {'d': 4}}
y = {'a': 1, 'b': 2, 'c': {'e': 5}}

Dask will merge these configurations respecting nested data structures, and respecting order:

>>> dask.config.merge(x, y)
{'a': 1, 'b': 2, 'c': {'d': 4, 'e': 5}}

You can also use the update function to update the existing configuration in place with a new configuration. This can be done with priority being given to either config. This is often used to update the global configuration in dask.config.config:

dask.config.update(dask.config, new, priority='new')  # Give priority to new values
dask.config.update(dask.config, new, priority='old')  # Give priority to old values

Sometimes it is useful to expand environment variables stored within a configuration. This can be done with the expand_environment_variables function:

dask.config.config = dask.config.expand_environment_variables(dask.config.config)

Refreshing Configuration¶

`dask.config.collect`([paths, env])	Collect configuration from paths and environment variables
`dask.config.refresh`([config, defaults])	Update configuration by re-reading yaml files and env variables

If you change your environment variables or YAML files, Dask will not immediately see the changes. Instead, you can call refresh to go through the configuration collection process and update the default configuration:

>>> dask.config.config
{}

>>> # make some changes to yaml files

>>> dask.config.refresh()
>>> dask.config.config
{...}

This function uses dask.config.collect, which returns the configuration without modifying the global configuration. You might use this to determine the configuration of particular paths not yet on the config path:

>>> dask.config.collect(paths=[...])
{...}

Downstream Libraries¶

`dask.config.ensure_file`(source[, ...])	Copy file to default location if it does not already exist
`dask.config.update`(old, new[, priority, ...])	Update a nested dictionary with values from another
`dask.config.update_defaults`(new[, config, ...])	Add a new set of defaults to the configuration

Downstream Dask libraries often follow a standard convention to use the central Dask configuration. This section provides recommendations for integration using a fictional project, dask-foo, as an example.

Downstream projects typically follow the following convention:

Maintain default configuration in a YAML file within their source directory:

setup.py
dask_foo/__init__.py
dask_foo/config.py
dask_foo/core.py
dask_foo/foo.yaml  # <---

Place configuration in that file within a namespace for the project:

# dask_foo/foo.yaml

foo:
  color: red
  admin:
    a: 1
    b: 2

Within a config.py file (or anywhere) load that default config file and update it into the global configuration:

# dask_foo/config.py
import os
import yaml

import dask.config

fn = os.path.join(os.path.dirname(__file__), 'foo.yaml')

with open(fn) as f:
    defaults = yaml.safe_load(f)

dask.config.update_defaults(defaults)

Ensure that this file is run on import by including it in __init__.py:
```
# dask_foo/__init__.py

from . import config
```

Within dask_foo code, use the dask.config.get function to access configuration values:

# dask_foo/core.py

def process(fn, color=dask.config.get('foo.color')):
    ...

You may also want to ensure that your yaml configuration files are included in your package. This can be accomplished by including the following line in your MANIFEST.in:
```
recursive-include <PACKAGE_NAME> *.yaml
```
and the following in your setup.py setup call:
```
from setuptools import setup

setup(...,
      include_package_data=True,
      ...)
```

This process keeps configuration in a central place, but also keeps it safe within namespaces. It places config files in an easy to access location by default (~/.config/dask/\*.yaml), so that users can easily discover what they can change, but maintains the actual defaults within the source code, so that they more closely track changes in the library.

However, downstream libraries may choose alternative solutions, such as isolating their configuration within their library, rather than using the global dask.config system. All functions in the dask.config module also work with parameters, and do not need to mutate global state.

API¶

dask.config.get(key: str, default: Any = _NoDefault.no_default, config: dict | None = None, override_with: Any = None) → Any[source]¶

Get elements from global config

If override_with is not None this value will be passed straight back. Useful for getting kwarg defaults from Dask config.

Use ‘.’ for nested access

See also

dask.config.set

Examples

>>> from dask import config
>>> config.get('foo')  
{'x': 1, 'y': 2}

>>> config.get('foo.x')  
1

>>> config.get('foo.x.y', default=123)  
123

>>> config.get('foo.y', override_with=None)  
2

>>> config.get('foo.y', override_with=3)  
3

dask.config.set(arg: Mapping | None = None, config: dict = None, lock: threading.Lock = <unlocked _thread.lock object>, **kwargs)[source]¶

Temporarily set configuration values within a context manager

Parameters

argmapping or None, optional: A mapping of configuration key-value pairs to set.
**kwargs: Additional key-value pairs to set. If arg is provided, values set in arg will be applied before those in kwargs. Double-underscores (__) in keyword arguments will be replaced with ., allowing nested values to be easily set.

See also

dask.config.get

Examples

>>> import dask

Set 'foo.bar' in a context, by providing a mapping.

>>> with dask.config.set({'foo.bar': 123}):
...     pass

Set 'foo.bar' in a context, by providing a keyword argument.

>>> with dask.config.set(foo__bar=123):
...     pass

Set 'foo.bar' globally.

>>> dask.config.set(foo__bar=123)  

dask.config.merge(*dicts: collections.abc.Mapping) → dict[source]¶

Update a sequence of nested dictionaries

This prefers the values in the latter dictionaries to those in the former

See also

dask.config.update

Examples

>>> a = {'x': 1, 'y': {'a': 2}}
>>> b = {'y': {'b': 3}}
>>> merge(a, b)  
{'x': 1, 'y': {'a': 2, 'b': 3}}

dask.config.update(old: dict, new: collections.abc.Mapping, priority: Literal['old', 'new', 'new-defaults'] = 'new', defaults: collections.abc.Mapping | None = None) → dict[source]¶

Update a nested dictionary with values from another

This is like dict.update except that it smoothly merges nested values

This operates in-place and modifies old

Parameters

priority: string {‘old’, ‘new’, ‘new-defaults’}: If new (default) then the new dictionary has preference. Otherwise the old dictionary does. If ‘new-defaults’, a mapping should be given of the current defaults. Only if a value in old matches the current default, it will be updated with new.

See also

dask.config.merge

Examples

>>> a = {'x': 1, 'y': {'a': 2}}
>>> b = {'x': 2, 'y': {'b': 3}}
>>> update(a, b)  
{'x': 2, 'y': {'a': 2, 'b': 3}}

>>> a = {'x': 1, 'y': {'a': 2}}
>>> b = {'x': 2, 'y': {'b': 3}}
>>> update(a, b, priority='old')  
{'x': 1, 'y': {'a': 2, 'b': 3}}

>>> d = {'x': 0, 'y': {'a': 2}}
>>> a = {'x': 1, 'y': {'a': 2}}
>>> b = {'x': 2, 'y': {'a': 3, 'b': 3}}
>>> update(a, b, priority='new-defaults', defaults=d)  
{'x': 1, 'y': {'a': 3, 'b': 3}}

dask.config.collect(paths: list[str] = ['/etc/dask', '/home/docs/checkouts/readthedocs.org/user_builds/dask/envs/latest/etc/dask', '/home/docs/.config/dask'], env: collections.abc.Mapping[str, str] | None = None) → dict[source]¶

Collect configuration from paths and environment variables

Parameters

pathslist[str]: A list of paths to search for yaml config files
envMapping[str, str]: The system environment variables

Returns

config: dict

See also

dask.config.refresh: collect configuration and update into primary config

dask.config.refresh(config: dict | None = None, defaults: list[collections.abc.Mapping] = [{'temporary-directory': None, 'visualization': {'engine': None}, 'tokenize': {'ensure-deterministic': False}, 'dataframe': {'backend': 'pandas', 'shuffle': {'method': None, 'compression': None}, 'parquet': {'metadata-task-size-local': 512, 'metadata-task-size-remote': 1, 'minimum-partition-size': 75000000}, 'convert-string': None, 'query-planning': None}, 'array': {'backend': 'numpy', 'chunk-size': '128MiB', 'rechunk': {'method': 'tasks', 'threshold': 4}, 'svg': {'size': 120}, 'slicing': {'split-large-chunks': None}}, 'optimization': {'annotations': {'fuse': True}, 'fuse': {'active': None, 'ave-width': 1, 'max-width': None, 'max-height': inf, 'max-depth-new-edges': None, 'subgraphs': None, 'rename-keys': True}}, 'admin': {'traceback': {'shorten': ['concurrent[\\\\\\/]futures[\\\\\\/]', 'dask[\\\\\\/](base|core|local|multiprocessing|optimization|threaded|utils)\\.py', 'dask[\\\\\\/]array[\\\\\\/]core\\.py', 'dask[\\\\\\/]dataframe[\\\\\\/](core|methods)\\.py', 'distributed[\\\\\\/](client|scheduler|utils|worker)\\.py', 'tornado[\\\\\\/]gen\\.py', 'pandas[\\\\\\/]core[\\\\\\/]']}}}, {'distributed': {'version': 2, 'scheduler': {'allowed-failures': 3, 'bandwidth': 100000000, 'blocked-handlers': [], 'contact-address': None, 'default-data-size': '1kiB', 'events-cleanup-delay': '1h', 'idle-timeout': None, 'no-workers-timeout': None, 'work-stealing': True, 'work-stealing-interval': '100ms', 'worker-saturation': 1.1, 'worker-ttl': '5 minutes', 'preload': [], 'preload-argv': [], 'unknown-task-duration': '500ms', 'default-task-durations': {'rechunk-split': '1us', 'split-shuffle': '1us', 'split-taskshuffle': '1us', 'split-stage': '1us'}, 'validate': False, 'dashboard': {'status': {'task-stream-length': 1000}, 'tasks': {'task-stream-length': 100000}, 'tls': {'ca-file': None, 'key': None, 'cert': None}, 'bokeh-application': {'allow_websocket_origin': ['*'], 'keep_alive_milliseconds': 500, 'check_unused_sessions_milliseconds': 500}}, 'locks': {'lease-validation-interval': '10s', 'lease-timeout': '30s'}, 'http': {'routes': ['distributed.http.scheduler.prometheus', 'distributed.http.scheduler.info', 'distributed.http.scheduler.json', 'distributed.http.health', 'distributed.http.proxy', 'distributed.http.statics']}, 'allowed-imports': ['dask', 'distributed'], 'active-memory-manager': {'start': True, 'interval': '2s', 'measure': 'optimistic', 'policies': [{'class': 'distributed.active_memory_manager.ReduceReplicas'}]}}, 'worker': {'blocked-handlers': [], 'multiprocessing-method': 'spawn', 'use-file-locking': True, 'transfer': {'message-bytes-limit': '50MB'}, 'connections': {'outgoing': 50, 'incoming': 10}, 'preload': [], 'preload-argv': [], 'daemon': True, 'validate': False, 'resources': {}, 'lifetime': {'duration': None, 'stagger': '0 seconds', 'restart': False}, 'profile': {'enabled': True, 'interval': '10ms', 'cycle': '1000ms', 'low-level': False}, 'memory': {'recent-to-old-time': '30s', 'rebalance': {'measure': 'optimistic', 'sender-min': 0.3, 'recipient-max': 0.6, 'sender-recipient-gap': 0.1}, 'transfer': 0.1, 'target': 0.6, 'spill': 0.7, 'pause': 0.8, 'terminate': 0.95, 'max-spill': False, 'spill-compression': 'auto', 'monitor-interval': '100ms'}, 'http': {'routes': ['distributed.http.worker.prometheus', 'distributed.http.health', 'distributed.http.statics']}}, 'nanny': {'preload': [], 'preload-argv': [], 'environ': {}, 'pre-spawn-environ': {'MALLOC_TRIM_THRESHOLD_': 65536, 'OMP_NUM_THREADS': 1, 'MKL_NUM_THREADS': 1, 'OPENBLAS_NUM_THREADS': 1}}, 'client': {'heartbeat': '5s', 'scheduler-info-interval': '2s', 'security-loader': None, 'preload': [], 'preload-argv': []}, 'deploy': {'lost-worker-timeout': '15s', 'cluster-repr-interval': '500ms'}, 'adaptive': {'interval': '1s', 'target-duration': '5s', 'minimum': 0, 'maximum': inf, 'wait-count': 3}, 'comm': {'retry': {'count': 0, 'delay': {'min': '1s', 'max': '20s'}}, 'compression': False, 'shard': '64MiB', 'offload': '10MiB', 'default-scheme': 'tcp', 'socket-backlog': 2048, 'ucx': {'cuda-copy': None, 'tcp': None, 'nvlink': None, 'infiniband': None, 'rdmacm': None, 'create-cuda-context': None, 'environment': {}}, 'zstd': {'level': 3, 'threads': 0}, 'timeouts': {'connect': '30s', 'tcp': '30s'}, 'require-encryption': None, 'tls': {'ciphers': None, 'min-version': 1.2, 'max-version': None, 'ca-file': None, 'scheduler': {'cert': None, 'key': None}, 'worker': {'key': None, 'cert': None}, 'client': {'key': None, 'cert': None}}, 'websockets': {'shard': '8MiB'}}, 'diagnostics': {'nvml': True, 'cudf': False, 'computations': {'max-history': 100, 'nframes': 0, 'ignore-modules': ['asyncio', 'functools', 'threading', 'datashader', 'dask', 'debugpy', 'distributed', 'coiled', 'cudf', 'cuml', 'matplotlib', 'pluggy', 'prefect', 'rechunker', 'xarray', 'xgboost', 'xdist'], 'ignore-files': ['runpy\\.py', 'pytest', 'py\\.test', 'pytest-script\\.py', '_pytest', 'pycharm', 'vscode_pytest', 'get_output_via_markers\\.py']}, 'erred-tasks': {'max-history': 100}}, 'p2p': {'comm': {'retry': {'count': 10, 'delay': {'min': '1s', 'max': '30s'}}}, 'disk': True}, 'dashboard': {'link': '{scheme}://{host}:{port}/status', 'export-tool': False, 'graph-max-items': 5000, 'prometheus': {'namespace': 'dask'}}, 'admin': {'large-graph-warning-threshold': '10MB', 'tick': {'interval': '20ms', 'limit': '3s', 'cycle': '1s'}, 'max-error-length': 10000, 'log-length': 10000, 'log-format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', 'low-level-log-length': 1000, 'pdb-on-err': False, 'system-monitor': {'interval': '500ms', 'log-length': 7200, 'disk': True, 'host-cpu': False, 'gil': {'enabled': True, 'interval': '1ms'}}, 'event-loop': 'tornado'}, 'rmm': {'pool-size': None}}}], **kwargs) → None[source]¶

Update configuration by re-reading yaml files and env variables

This mutates the global dask.config.config, or the config parameter if passed in.

This goes through the following stages:

Clearing out all old configuration
Updating from the stored defaults from downstream libraries (see update_defaults)
Updating from yaml files and environment variables
Automatically renaming deprecated keys (with a warning)

Note that some functionality only checks configuration once at startup and may not change behavior, even if configuration changes. It is recommended to restart your python process if convenient to ensure that new configuration changes take place.

See also

dask.config.collect: for parameters
dask.config.update_defaults

dask.config.ensure_file(source: str, destination: str | None = None, comment: bool = True) → None[source]¶

Copy file to default location if it does not already exist

This tries to move a default configuration file to a default location if if does not already exist. It also comments out that file by default.

This is to be used by downstream modules (like dask.distributed) that may have default configuration files that they wish to include in the default configuration path.

Parameters

sourcestring, filename: Source configuration file, typically within a source directory.
destinationstring, directory: Destination directory. Configurable by DASK_CONFIG environment variable, falling back to ~/.config/dask.
commentbool, True by default: Whether or not to comment out the config file when copying.

dask.config.expand_environment_variables(config: Any) → Any[source]¶

Expand environment variables in a nested config dictionary

This function will recursively search through any nested dictionaries and/or lists.

Parameters

configdict, iterable, or str: Input object to search for environment variables

Returns

configsame type as input

Examples

>>> expand_environment_variables({'x': [1, 2, '$USER']})  
{'x': [1, 2, 'my-username']}

Note

It is possible to configure Dask inline with dot notation, with YAML or via environment variables. See the conversion utility for converting the following dot notation to other forms.

Dask ¶

temporary-directory None ¶: Temporary directory for local disk storage /tmp, /scratch, or /local. This directory is used during dask spill-to-disk operations. When the value is "null" (default), dask will create a directory from where dask was launched: `cwd/dask-worker-space`

visualization.engine None ¶: Visualization engine to use when calling ``.visualize()`` on a Dask collection. Currently supports ``'graphviz'``, ``'ipycytoscape'``, and ``'cytoscape'`` (alias for ``'ipycytoscape'``)

tokenize.ensure-deterministic False ¶: If ``true``, tokenize will error instead of falling back to uuids when a deterministic token cannot be generated. Defaults to ``false``.

dataframe.backend pandas ¶: Backend to use for supported dataframe-creation functions. Default is "pandas".

dataframe.shuffle.method None ¶: The default shuffle method to choose. Possible values are disk, tasks, p2p. If null, pick best method depending on application.

dataframe.shuffle.compression None ¶: Compression algorithm used for on disk-shuffling. Partd, the library used for compression supports ZLib, BZ2, and SNAPPY

dataframe.parquet.metadata-task-size-local 512 ¶: The number of files to handle within each metadata-processing task when reading a parquet dataset from a LOCAL file system. Specifying 0 will result in serial execution on the client.

dataframe.parquet.metadata-task-size-remote 1 ¶: The number of files to handle within each metadata-processing task when reading a parquet dataset from a REMOTE file system. Specifying 0 will result in serial execution on the client.

dataframe.parquet.minimum-partition-size 75000000 ¶: The minimum in-memory size of a single partition after reading from parquet. Smaller parquet files will be combined into a single partitions to reach this threshold.

dataframe.convert-string None ¶: Whether to convert string-like data to pyarrow strings.

dataframe.query-planning None ¶: Whether to use query planning.

array.backend numpy ¶: Backend to use for supported array-creation functions. Default is "numpy".

array.chunk-size 128MiB ¶: The default chunk size to target. Default is "128MiB".

array.rechunk.method tasks ¶: The method to use for rechunking. Must be either "tasks" or "p2p"; default is "tasks". Using "p2p" requires a distributed cluster.

array.rechunk.threshold 4 ¶: The graph growth factor above which task-based shuffling introduces an intermediate step.

array.svg.size 120 ¶: The size of pixels used when displaying a dask array as an SVG image. This is used, for example, for nice rendering in a Jupyter notebook

array.slicing.split-large-chunks None ¶: How to handle large chunks created when slicing Arrays. By default a warning is produced. Set to ``False`` to silence the warning and allow large output chunks. Set to ``True`` to silence the warning and avoid large output chunks.

optimization.annotations.fuse True ¶: If adjacent blockwise layers have different annotations (e.g., one has retries=3 and another has retries=4), Dask can make an attempt to merge those annotations according to some simple rules. ``retries`` is set to the max of the layers, ``priority`` is set to the max of the layers, ``resources`` are set to the max of all the resources, ``workers`` is set to the intersection of the requested workers. If this setting is disabled, then adjacent blockwise layers with different annotations will *not* be fused.

optimization.fuse.active None ¶: Turn task fusion on/off. This option refers to the fusion of a fully-materialized task graph (not a high-Level graph). By default (None), the active task-fusion option will be treated as ``False`` for Dask-Dataframe collections, and as ``True`` for all other graphs (including Dask-Array collections).

optimization.fuse.ave-width 1 ¶: Upper limit for width, where width = num_nodes / height, a good measure of parallelizability

optimization.fuse.max-width None ¶: Don't fuse if total width is greater than this. Set to null to dynamically adjust to 1.5 + ave_width * log(ave_width + 1)

optimization.fuse.max-height inf ¶: Don't fuse more than this many levels

optimization.fuse.max-depth-new-edges None ¶: Don't fuse if new dependencies are added after this many levels. Set to null to dynamically adjust to ave_width * 1.5.

optimization.fuse.subgraphs None ¶: Set to True to fuse multiple tasks into SubgraphCallable objects. Set to None to let the default optimizer of individual dask collections decide. If no collection-specific default exists, None defaults to False.

optimization.fuse.rename-keys True ¶: Set to true to rename the fused keys with `default_fused_keys_renamer`. Renaming fused keys can keep the graph more understandable and comprehensible, but it comes at the cost of additional processing. If False, then the top-most key will be used. For advanced usage, a function to create the new name is also accepted.

admin.traceback.shorten ['concurrent[\\\\\\/]futures[\\\\\\/]', 'dask[\\\\\\/](base|core|local|multiprocessing|optimization|threaded|utils)\\.py', 'dask[\\\\\\/]array[\\\\\\/]core\\.py', 'dask[\\\\\\/]dataframe[\\\\\\/](core|methods)\\.py', 'distributed[\\\\\\/](client|scheduler|utils|worker)\\.py', 'tornado[\\\\\\/]gen\\.py', 'pandas[\\\\\\/]core[\\\\\\/]'] ¶: Clean up Dask tracebacks for readability. Remove all modules that match one of the listed regular expressions. Always preserve the first and last frame.

Distributed Client ¶

distributed.client.heartbeat 5s ¶: This value is the time between heartbeats The client sends a periodic heartbeat message to the scheduler. If it misses enough of these then the scheduler assumes that it has gone.

distributed.client.scheduler-info-interval 2s ¶: Interval between scheduler-info updates

distributed.client.security-loader None ¶: A fully qualified name (e.g. ``module.submodule.function``) of a callback to use for loading security credentials for the client. If no security object is explicitly passed when creating a ``Client``, this callback is called with a dict containing client information (currently just ``address``), and should return a ``Security`` object to use for this client, or ``None`` to fallback to the default security configuration.

distributed.client.preload [] ¶: Run custom modules during the lifetime of the client You can run custom modules when the client starts up and closes down. See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.client.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

Distributed Comm ¶

distributed.comm.retry.count 0 ¶: The number of times to retry a connection

distributed.comm.retry.delay.min 1s ¶: The first non-zero delay between retry attempts

distributed.comm.retry.delay.max 20s ¶: The maximum delay between retries

distributed.comm.compression False ¶: The compression algorithm to use. 'auto' defaults to lz4 if installed, otherwise to snappy if installed, otherwise to false. zlib and zstd are only used if explicitly requested here. Uncompressible data and transfers on localhost are always uncompressed, regardless of this setting. See also distributed.worker.memory.spill-compression.

distributed.comm.shard 64MiB ¶: The maximum size of a frame to send through a comm Some network infrastructure doesn't like sending through very large messages. Dask comms will cut up these large messages into many small ones. This attribute determines the maximum size of such a shard.

distributed.comm.offload 10MiB ¶: The size of message after which we choose to offload serialization to another thread In some cases, you may also choose to disable this altogether with the value false This is useful if you want to include serialization in profiling data, or if you have data types that are particularly sensitive to deserialization

distributed.comm.default-scheme tcp ¶: The default protocol to use, like tcp or tls

distributed.comm.socket-backlog 2048 ¶: When shuffling data between workers, there can really be O(cluster size) connection requests on a single worker socket, make sure the backlog is large enough not to lose any.

distributed.comm.ucx.cuda-copy None ¶: Set environment variables to enable CUDA support over UCX. This may be used even if InfiniBand and NVLink are not supported or disabled, then transferring data over TCP.

distributed.comm.ucx.tcp None ¶: Set environment variables to enable TCP over UCX, even if InfiniBand and NVLink are not supported or disabled.

distributed.comm.ucx.nvlink None ¶: Set environment variables to enable UCX over NVLink, implies ``distributed.comm.ucx.tcp=True``.

distributed.comm.ucx.infiniband None ¶: Set environment variables to enable UCX over InfiniBand, implies ``distributed.comm.ucx.tcp=True``.

distributed.comm.ucx.rdmacm None ¶: Set environment variables to enable UCX RDMA connection manager support, requires ``distributed.comm.ucx.infiniband=True``.

distributed.comm.ucx.create-cuda-context None ¶: Creates a CUDA context before UCX is initialized. This is necessary to enable UCX to properly identify connectivity of GPUs with specialized networking hardware, such as InfiniBand. This permits UCX to choose transports automatically, without specifying additional variables for each transport, while ensuring optimal connectivity. When ``True``, a CUDA context will be created on the first device listed in ``CUDA_VISIBLE_DEVICES``.

distributed.comm.zstd.level 3 ¶: Compression level, between 1 and 22.

distributed.comm.zstd.threads 0 ¶: Number of threads to use. 0 for single-threaded, -1 to infer from cpu count.

distributed.comm.timeouts.connect 30s ¶: No Comment

distributed.comm.timeouts.tcp 30s ¶: No Comment

distributed.comm.require-encryption None ¶: Whether to require encryption on non-local comms

distributed.comm.tls.ciphers None ¶: Allowed ciphers, specified as an OpenSSL cipher string.

distributed.comm.tls.min-version 1.2 ¶: The minimum TLS version to support. Defaults to TLS 1.2.

distributed.comm.tls.max-version None ¶: The maximum TLS version to support. Defaults to the maximum version supported by the platform.

distributed.comm.tls.ca-file None ¶: Path to a CA file, in pem format

distributed.comm.tls.scheduler.cert None ¶: Path to certificate file

distributed.comm.tls.scheduler.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.worker.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.worker.cert None ¶: Path to certificate file

distributed.comm.tls.client.key None ¶: Path to key file. Alternatively, the key can be appended to the cert file above, and this field left blank

distributed.comm.tls.client.cert None ¶: Path to certificate file

distributed.comm.websockets.shard 8MiB ¶: The maximum size of a websocket frame to send through a comm. This is somewhat duplicative of distributed.comm.shard, but websockets often have much smaller maximum message sizes than other protocols, so this attribute is used to set a smaller default shard size and to allow separate control of websocket message sharding.

Distributed Dashboard ¶

distributed.dashboard.link {scheme}://{host}:{port}/status ¶: The form for the dashboard links This is used wherever we print out the link for the dashboard It is filled in with relevant information like the schema, host, and port number

distributed.dashboard.export-tool False ¶: No Comment

distributed.dashboard.graph-max-items 5000 ¶: maximum number of tasks to try to plot in "graph" view

distributed.dashboard.prometheus.namespace dask ¶: Namespace prefix to use for all prometheus metrics.

Distributed Deploy ¶

distributed.deploy.lost-worker-timeout 15s ¶: Interval after which to hard-close a lost worker job Otherwise we wait for a while to see if a worker will reappear

distributed.deploy.cluster-repr-interval 500ms ¶: Interval between calls to update cluster-repr for the widget

Distributed Scheduler ¶

distributed.scheduler.allowed-failures 3 ¶: The number of retries before a task is considered bad When a worker dies when a task is running that task is rerun elsewhere. If many workers die while running this same task then we call the task bad, and raise a KilledWorker exception. This is the number of workers that are allowed to die before this task is marked as bad.

distributed.scheduler.bandwidth 100000000 ¶: The expected bandwidth between any pair of workers This is used when making scheduling decisions. The scheduler will use this value as a baseline, but also learn it over time.

distributed.scheduler.blocked-handlers [] ¶: A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on those messages. Each message has an operation like "close-worker" or "task-finished". In some high security situations administrators may choose to block certain handlers from running. Those handlers can be listed here. For a list of handlers see the `dask.distributed.Scheduler.handlers` attribute.

distributed.scheduler.contact-address None ¶: The address that the scheduler advertises to workers for communication with it. To be specified when the address to which the scheduler binds cannot be the same as the address that workers use to contact the scheduler (e.g. because the former is private and the scheduler is in a different network than the workers).

distributed.scheduler.default-data-size 1kiB ¶: The default size of a piece of data if we don't know anything about it. This is used by the scheduler in some scheduling decisions

distributed.scheduler.events-cleanup-delay 1h ¶: The amount of time to wait until workers or clients are removed from the event log after they have been removed from the scheduler

distributed.scheduler.idle-timeout None ¶: Shut down the scheduler after this duration if no activity has occurred

distributed.scheduler.no-workers-timeout None ¶: Shut down the scheduler after this duration if there are pending tasks, but no workers that can process them. This can either mean that there are no workers running at all, or that there are idle workers but they've been excluded through worker or resource restrictions. In adaptive clusters, this timeout must be set to be safely higher than the time it takes for workers to spin up. Works in conjunction with idle-timeout.

distributed.scheduler.work-stealing True ¶: Whether or not to balance work between workers dynamically Some times one worker has more work than we expected. The scheduler will move these tasks around as necessary by default. Set this to false to disable this behavior

distributed.scheduler.work-stealing-interval 100ms ¶: How frequently to balance worker loads

distributed.scheduler.worker-saturation 1.1 ¶: Controls how many root tasks are sent to workers (like a `readahead`). Up to worker-saturation * nthreads root tasks are sent to a worker at a time. If `.inf`, all runnable tasks are immediately sent to workers. The target number is rounded up, so any `worker-saturation` value > 1.0 guarantees at least one extra task will be sent to workers. Allowing oversaturation (> 1.0) means a worker may start running a new root task as soon as it completes the previous, even if there is a higher-priority downstream task to run. This reduces worker idleness, by letting workers do something while waiting for further instructions from the scheduler, even if it's not the most efficient thing to do. This generally comes at the expense of increased memory usage. It leads to "wider" (more breadth-first) execution of the graph. Compute-bound workloads may benefit from oversaturation. Memory-bound workloads should generally leave `worker-saturation` at 1.0, though 1.25-1.5 could slightly improve performance if ample memory is available.

distributed.scheduler.worker-ttl 5 minutes ¶: Time to live for workers. If we don't receive a heartbeat faster than this then we assume that the worker has died.

distributed.scheduler.preload [] ¶: Run custom modules during the lifetime of the scheduler You can run custom modules when the scheduler starts up and closes down. See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.scheduler.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.scheduler.unknown-task-duration 500ms ¶: Default duration for all tasks with unknown durations Over time the scheduler learns a duration for tasks. However when it sees a new type of task for the first time it has to make a guess as to how long it will take. This value is that guess.

distributed.scheduler.default-task-durations.rechunk-split 1us ¶: No Comment

distributed.scheduler.default-task-durations.split-shuffle 1us ¶: No Comment

distributed.scheduler.default-task-durations.split-taskshuffle 1us ¶: No Comment

distributed.scheduler.default-task-durations.split-stage 1us ¶: No Comment

distributed.scheduler.validate False ¶: Whether or not to run consistency checks during execution. This is typically only used for debugging.

distributed.scheduler.dashboard.status.task-stream-length 1000 ¶: The maximum number of tasks to include in the task stream plot

distributed.scheduler.dashboard.tasks.task-stream-length 100000 ¶: The maximum number of tasks to include in the task stream plot

distributed.scheduler.dashboard.tls.ca-file None ¶: No Comment

distributed.scheduler.dashboard.tls.key None ¶: No Comment

distributed.scheduler.dashboard.tls.cert None ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.allow_websocket_origin ['*'] ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.keep_alive_milliseconds 500 ¶: No Comment

distributed.scheduler.dashboard.bokeh-application.check_unused_sessions_milliseconds 500 ¶: No Comment

distributed.scheduler.locks.lease-validation-interval 10s ¶: The interval in which the scheduler validates staleness of all acquired leases. Must always be smaller than the lease-timeout itself.

distributed.scheduler.locks.lease-timeout 30s ¶: Maximum interval to wait for a Client refresh before a lease is invalidated and released.

distributed.scheduler.http.routes ['distributed.http.scheduler.prometheus', 'distributed.http.scheduler.info', 'distributed.http.scheduler.json', 'distributed.http.health', 'distributed.http.proxy', 'distributed.http.statics'] ¶: A list of modules like "prometheus" and "health" that can be included or excluded as desired These modules will have a ``routes`` keyword that gets added to the main HTTP Server. This is also a list that can be extended with user defined modules.

distributed.scheduler.allowed-imports ['dask', 'distributed'] ¶: A list of trusted root modules the schedular is allowed to import (incl. submodules). For security reasons, the scheduler does not import arbitrary Python modules.

distributed.scheduler.active-memory-manager.start True ¶: set to true to auto-start the AMM on Scheduler init

distributed.scheduler.active-memory-manager.interval 2s ¶: Time expression, e.g. "2s". Run the AMM cycle every .

distributed.scheduler.active-memory-manager.measure optimistic ¶: One of the attributes of distributed.scheduler.MemoryState

distributed.scheduler.active-memory-manager.policies [{'class': 'distributed.active_memory_manager.ReduceReplicas'}] ¶: No Comment

Distributed Worker ¶

distributed.worker.blocked-handlers [] ¶: A list of handlers to exclude The scheduler operates by receiving messages from various workers and clients and then performing operations based on those messages. Each message has an operation like "close-worker" or "task-finished". In some high security situations administrators may choose to block certain handlers from running. Those handlers can be listed here. For a list of handlers see the `dask.distributed.Scheduler.handlers` attribute.

distributed.worker.multiprocessing-method spawn ¶: How we create new workers, one of "spawn", "forkserver", or "fork" This is passed to the ``multiprocessing.get_context`` function.

distributed.worker.use-file-locking True ¶: Whether or not to use lock files when creating workers Workers create a local directory in which to place temporary files. When many workers are created on the same process at once these workers can conflict with each other by trying to create this directory all at the same time. To avoid this, Dask usually used a file-based lock. However, on some systems file-based locks don't work. This is particularly common on HPC NFS systems, where users may want to set this to false.

distributed.worker.transfer.message-bytes-limit 50MB ¶: The maximum amount of data for a worker to request from another in a single gather operation Tasks are gathered in batches, and if the first task in a batch is larger than this value, the task will still be gathered to ensure progress. Hence, this limit is not absolute. Note that this limit applies to a single gather operation and a worker may gather data from multiple workers in parallel.

distributed.worker.connections.outgoing 50 ¶: No Comment

distributed.worker.connections.incoming 10 ¶: No Comment

distributed.worker.preload [] ¶: Run custom modules during the lifetime of the worker You can run custom modules when the worker starts up and closes down. See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.worker.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.worker.daemon True ¶: Whether or not to run our process as a daemon process

distributed.worker.validate False ¶: Whether or not to run consistency checks during execution. This is typically only used for debugging.

distributed.worker.lifetime.duration None ¶: The time after creation to close the worker, like "1 hour"

distributed.worker.lifetime.stagger 0 seconds ¶: Random amount by which to stagger lifetimes If you create many workers at the same time, you may want to avoid having them kill themselves all at the same time. To avoid this you might want to set a stagger time, so that they close themselves with some random variation, like "5 minutes" That way some workers can die, new ones can be brought up, and data can be transferred over smoothly.

distributed.worker.lifetime.restart False ¶: Do we try to resurrect the worker after the lifetime deadline?

distributed.worker.profile.enabled True ¶: Whether or not to enable profiling

distributed.worker.profile.interval 10ms ¶: The time between polling the worker threads, typically short like 10ms

distributed.worker.profile.cycle 1000ms ¶: The time between bundling together this data and sending it to the scheduler This controls the granularity at which people can query the profile information on the time axis.

distributed.worker.profile.low-level False ¶: Whether or not to use the libunwind and stacktrace libraries to gather profiling information at the lower level (beneath Python) To get this to work you will need to install the experimental stacktrace library at conda install -c numba stacktrace See https://github.com/numba/stacktrace

distributed.worker.memory.recent-to-old-time 30s ¶: When there is an increase in process memory (as observed by the operating system) that is not accounted for by the dask keys stored on the worker, ignore it for this long before considering it in non-time-sensitive heuristics. This should be set to be longer than the duration of most dask tasks.

distributed.worker.memory.rebalance.measure optimistic ¶: Which of the properties of distributed.scheduler.MemoryState should be used for measuring worker memory usage

distributed.worker.memory.rebalance.sender-min 0.3 ¶: Fraction of worker process memory at which we start potentially transferring data to other workers.

distributed.worker.memory.rebalance.recipient-max 0.6 ¶: Fraction of worker process memory at which we stop potentially receiving data from other workers. Ignored when max_memory is not set.

distributed.worker.memory.rebalance.sender-recipient-gap 0.1 ¶: Fraction of worker process memory, around the cluster mean, where a worker is neither a sender nor a recipient of data during a rebalance operation. E.g. if the mean cluster occupation is 50%, sender-recipient-gap=0.1 means that only nodes above 55% will donate data and only nodes below 45% will receive them. This helps avoid data from bouncing around the cluster repeatedly.

distributed.worker.memory.transfer 0.1 ¶: When the total size of incoming data transfers gets above this amount, we start throttling incoming data transfers

distributed.worker.memory.target 0.6 ¶: When the process memory (as observed by the operating system) gets above this amount, we start spilling the dask keys holding the oldest chunks of data to disk

distributed.worker.memory.spill 0.7 ¶: When the process memory (as observed by the operating system) gets above this amount, we spill data to disk, starting from the dask keys holding the oldest chunks of data, until the process memory falls below the target threshold.

distributed.worker.memory.pause 0.8 ¶: When the process memory (as observed by the operating system) gets above this amount, we no longer start new tasks or fetch new data on the worker.

distributed.worker.memory.terminate 0.95 ¶: When the process memory reaches this level the nanny process will kill the worker (if a nanny is present)

distributed.worker.memory.max-spill False ¶: Limit of number of bytes to be spilled on disk.

distributed.worker.memory.spill-compression auto ¶: The compression algorithm to use. 'auto' defaults to lz4 if installed, otherwise to snappy if installed, otherwise to false. zlib and zstd are only used if explicitly requested here. Uncompressible data is always uncompressed, regardless of this setting. See also distributed.comm.compression.

distributed.worker.memory.monitor-interval 100ms ¶: Interval between checks for the spill, pause, and terminate thresholds

distributed.worker.http.routes ['distributed.http.worker.prometheus', 'distributed.http.health', 'distributed.http.statics'] ¶: A list of modules like "prometheus" and "health" that can be included or excluded as desired These modules will have a ``routes`` keyword that gets added to the main HTTP Server. This is also a list that can be extended with user defined modules.

Distributed Nanny ¶

distributed.nanny.preload [] ¶: Run custom modules during the lifetime of the nanny You can run custom modules when the nanny starts up and closes down. See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.nanny.preload-argv [] ¶: Arguments to pass into the preload scripts described above See https://docs.dask.org/en/latest/how-to/customize-initialization.html for more information

distributed.nanny.pre-spawn-environ.MALLOC_TRIM_THRESHOLD_ 65536 ¶: No Comment

distributed.nanny.pre-spawn-environ.OMP_NUM_THREADS 1 ¶: No Comment

distributed.nanny.pre-spawn-environ.MKL_NUM_THREADS 1 ¶: No Comment

distributed.nanny.pre-spawn-environ.OPENBLAS_NUM_THREADS 1 ¶: No Comment

Distributed Admin ¶

distributed.admin.large-graph-warning-threshold 10MB ¶: Threshold in bytes for when a warning is raised about a large submitted task graph. Default is 10MB.

distributed.admin.tick.interval 20ms ¶: The time between ticks, default 20ms

distributed.admin.tick.limit 3s ¶: The time allowed before triggering a warning

distributed.admin.tick.cycle 1s ¶: The time in between verifying event loop speed

distributed.admin.max-error-length 10000 ¶: Maximum length of traceback as text Some Python tracebacks can be very very long (particularly in stack overflow errors) If the traceback is larger than this size (in bytes) then we truncate it.

distributed.admin.log-length 10000 ¶: Maximum length of worker/scheduler logs to keep in memory. They can be retrieved with get_scheduler_logs() / get_worker_logs(). Set to null for unlimited.

distributed.admin.log-format %(asctime)s - %(name)s - %(levelname)s - %(message)s ¶: The log format to emit. See https://docs.python.org/3/library/logging.html#logrecord-attributes

distributed.admin.low-level-log-length 1000 ¶: Maximum length of various event logs for developers. Set to null for unlimited.

distributed.admin.pdb-on-err False ¶: Enter Python Debugger on scheduling error

distributed.admin.system-monitor.interval 500ms ¶: Polling time to query cpu/memory statistics default 500ms

distributed.admin.system-monitor.log-length 7200 ¶: Maximum number of samples to keep in memory. Multiply by `interval` to obtain log duration. Set to null for unlimited.

distributed.admin.system-monitor.disk True ¶: Should we include disk metrics? (they can cause issues in some systems)

distributed.admin.system-monitor.host-cpu False ¶: Should we include host-wide CPU usage, with very granular breakdown?

distributed.admin.system-monitor.gil.enabled True ¶: Enable monitoring of GIL contention

distributed.admin.system-monitor.gil.interval 1ms ¶: GIL polling interval. More frequent polling will reflect a more accurate GIL contention metric but will be more likely to impact runtime performance.

distributed.admin.event-loop tornado ¶: The event loop to use, Must be one of tornado, asyncio, or uvloop

Distributed RMM ¶

distributed.rmm.pool-size None ¶: The size of the memory pool in bytes.

Changelog

How To…

Configuration

Contents

Configuration¶

Access Configuration¶

Specify Configuration¶

YAML files¶

Environment Variables¶

Defaults¶

Directly within Python¶

Directly from CLI¶

Distributing configuration¶

Conversion Utility¶

Updating Configuration¶

Manipulating configuration dictionaries¶

Refreshing Configuration¶

Downstream Libraries¶

API¶

Configuration Reference¶