Dask is a community maintained project. We welcome contributions in the form of bug reports, documentation, code, design proposals, and more. This page provides resources on how best to contribute.
Where to ask for help¶
Dask conversation happens in the following places:
- Stack Overflow #dask tag: for usage questions
- GitHub Issue Tracker: for discussions around new features or established bugs
- Gitter chat: for real-time discussion
For usage questions and bug reports we strongly prefer the use of Stack Overflow and GitHub issues over gitter chat. GitHub and Stack Overflow are more easily searchable by future users and so is more efficient for everyone’s time. Gitter chat is generally reserved for community discussion.
Separate Code Repositories¶
Dask maintains code and documentation in a few git repositories hosted on the
dask organization, https://github.com/dask. This includes the primary
repository and several other repositories for different components. A
non-exhaustive list follows:
- https://github.com/dask/dask: The main code repository holding parallel algorithms, the single-machine scheduler, and most documentation
- https://github.com/dask/distributed: The distributed memory scheduler
- https://github.com/dask/dask-ml: Machine learning algorithms
- https://github.com/dask/s3fs: S3 Filesystem interface
- https://github.com/dask/gcsfs: GCS Filesystem interface
- https://github.com/dask/hdfs3: Hadoop Filesystem interface
Git and GitHub can be challenging at first. Fortunately good materials exist on the internet. Rather than repeat these materials here, we refer you to Pandas’ documentation and links on this subject at https://pandas.pydata.org/pandas-docs/stable/contributing.html
The community discusses and tracks known bugs and potential features in the GitHub Issue Tracker. If you have a new idea or have identified a bug, then you should raise it there to start public discussion.
If you are looking for an introductory issue to get started with development, then check out the “good first issue” label, which contains issues that are good for starting developers. Generally, familiarity with Python, NumPy, Pandas, and some parallel computing are assumed.
Make a fork of the main Dask repository and clone the fork:
git clone https://github.com/<your-github-username>/dask
Contributions to Dask can then be made by submitting pull requests on GitHub.
To build the library you can install the necessary requirements using pip or conda:
python -m pip install -e ".[complete]"
conda env create -n dask-dev -f continuous_integration/environment-latest.yaml conda activate dask-dev python -m pip install --no-deps -e .
Contributing to Code¶
Dask maintains development standards that are similar to most PyData projects. These standards include language support, testing, documentation, and style.
Dask supports Python versions 3.5, 3.6, and 3.7.
Name changes are handled by the
Dask employs extensive unit tests to ensure correctness of code both for today and for the future. Test coverage is expected for all code contributions.
Tests are written in a py.test style with bare functions:
def test_fibonacci(): assert fib(0) == 0 assert fib(1) == 0 assert fib(10) == 55 assert fib(8) == fib(7) + fib(6) for x in [-3, 'cat', 1.5]: with pytest.raises(ValueError): fib(x)
These tests should compromise well between covering all branches and fail cases and running quickly (slow test suites get run less often).
You can run tests locally by running
py.test in the local dask directory:
py.test dask --verbose
You can also test certain modules or individual tests for faster response:
py.test dask/dataframe --verbose py.test dask/dataframe/tests/test_dataframe.py::test_rename_index
Tests run automatically on the Travis.ci and Appveyor continuous testing frameworks on every push to every pull request on GitHub.
Tests are organized within the various modules’ subdirectories:
dask/array/tests/test_*.py dask/bag/tests/test_*.py dask/bytes/tests/test_*.py dask/dataframe/tests/test_*.py dask/diagnostics/tests/test_*.py
For the Dask collections like Dask Array and Dask DataFrame, behavior is
typically tested directly against the NumPy or Pandas libraries using the
import numpy as np import dask.array as da from dask.array.utils import assert_eq def test_aggregations(): nx = np.random.random(100) dx = da.from_array(nx, chunks=(10,)) assert_eq(nx.sum(), dx.sum()) assert_eq(nx.min(), dx.min()) assert_eq(nx.max(), dx.max()) ...
This technique helps to ensure compatibility with upstream libraries and tends
to be simpler than testing correctness directly. Additionally, by passing Dask
collections directly to the
assert_eq function rather than call compute
manually, the testing suite is able to run a number of checks on the lazy
User facing functions should roughly follow the numpydoc standard, including
Examples, and general explanatory prose.
By default, examples will be doc-tested. Reproducible examples in documentation
is valuable both for testing and, more importantly, for communication of common
usage to the user. Documentation trumps testing in this case and clear
examples should take precedence over using the docstring as testing space.
To skip a test in the examples add the comment
# doctest: +SKIP directly
after the line.
def fib(i): """ A single line with a brief explanation A more thorough description of the function, consisting of multiple lines or paragraphs. Parameters ---------- i: int A short description of the argument if not immediately clear Examples -------- >>> fib(4) 3 >>> fib(5) 5 >>> fib(6) 8 >>> fib(-1) # Robust to bad inputs ValueError(...) """
Docstrings are currently tested under Python 3.6 on Travis.ci. You can test docstrings with pytest as follows:
py.test dask --doctest-modules
Docstring testing requires
graphviz to be installed. This can be done via:
conda install -y graphviz
python -m pip install black flake8
and then run from the root of the Dask repository:
black dask flake8 dask
to auto-format your code. Additionally, many editors have plugins that will
black as you edit files.
Optionally, you may wish to setup pre-commit hooks
to automatically run
flake8 when you make a git commit. This
can be done by installing
python -m pip install pre-commit
and then running:
from the root of the Dask repository. Now
flake8 will be run
each time you commit changes. You can skip these checks with
git commit --no-verify.
Contributing to Documentation¶
Dask uses Sphinx for documentation, hosted on https://readthedocs.org .
Documentation is maintained in the RestructuredText markup language (
dask/docs/source. The documentation consists both of prose
and API documentation.
To build the documentation locally, clone this repository and install
the necessary requirements using
git clone https://github.com/dask/dask.git cd dask/docs
python -m pip install -r requirements-docs.txt
conda create -n daskdocs -c conda-forge --file requirements-docs.txt conda activate daskdocs
Then build the documentation with
The resulting HTML files end up in the
You can now make edits to rst files and run
make html again to update
the affected pages.