Contents

Changelog

Contents

Changelog

2024.4.2

Highlights

Trivial Merge Implementation

The Query Optimizer will inspect quires to determine if a merge(...) or groupby(...).apply(...) requires a shuffle. A shuffle can be avoided, if the DataFrame was shuffled on the same columns in a previous step without any operations in between that change the partitioning layout or the relevant values in each partition.

>>> result = df.merge(df2, on="a")
>>> result = result.merge(df3, on="a")

The Query optimizer will identify that result was previously shuffled on "a" as well and thus only shuffle df3 in the second merge operation before doing a blockwise merge.

Auto-partitioning in read_parquet

The Query Optimizer will automatically repartition datasets read from Parquet files if individual partitions are too small. This will reduce the number of partitions in consequentially also the size of the task graph.

The Optimizer aims to produce partitions of at least 75MB and will combine multiple files together if necessary to reach this threshold. The value can be configured by using

>>> dask.config.set({"dataframe.parquet.minimum-partition-size": 100_000_000})

The value is given in bytes. The default threshold is relatively conservative to avoid memory issues on worker nodes with a relatively small amount of memory per thread.

Additional changes

2024.4.1

This is a minor bugfix release that that fixes an error when importing dask.dataframe with Python 3.11.9.

See GH#11035 and GH#11039 from Richard (Rick) Zamora for details.

Additional changes

2024.4.0

Highlights

Query planning fixes

This release contains a variety of bugfixes in Dask DataFrame’s new query planner.

GPU metric dashboard fixes

GPU memory and utilization dashboard functionality has been restored. Previously these plots were unintentionally left blank.

See GH#8572 from Benjamin Zaitlen for details.

Additional changes

2024.3.1

This is a minor release that primarily demotes an exception to a warning if dask-expr is not installed when upgrading.

Additional changes

2024.3.0

Released on March 11, 2024

Highlights

Query planning

This release is enabling query planning by default for all users of dask.dataframe.

The query planning functionality represents a rewrite of the DataFrame using dask-expr. This is a drop-in replacement and we expect that most users will not have to adjust any of their code. Any feedback can be reported on the Dask issue tracker or on the query planning feedback issue.

If you are encountering any issues you are still able to opt-out by setting

>>> import dask
>>> dask.config.set({'dataframe.query-planning': False})

Sunset of Pandas 1.X support

The new query planning backend is requiring at least pandas 2.0. This pandas version will automatically be installed if you are installing from conda or if you are installing using dask[complete] or dask[dataframe] from pip.

The legacy DataFrame implementation is still supporting pandas 1.X if you install dask without extras.

Additional changes

2024.2.1

Released on February 23, 2024

Highlights

Allow silencing dask.DataFrame deprecation warning

The last release contained a DeprecationWarning that alerts users to an upcoming switch of dask.dafaframe to use the new backend with support for query planning (see also GH#10934).

This DeprecationWarning is triggered in import of the dask.dataframe module and the community raised concerns about this being to verbose.

It is now possible to silence this warning

# via Python
>>> dask.config.set({'dataframe.query-planning-warning': False})

# via CLI
dask config set dataframe.query-planning-warning False

See GH#10936 and GH#10925 from Miles for details.

More robust distributed scheduler for rare key collisions

Blockwise fusion optimization can cause a task key collision that is not being handled properly by the distributed scheduler (see GH#9888). Users will typically notice this by seeing one of various internal exceptions that cause a system deadlock or critical failure. While this issue could not be fixed, the scheduler now implements a mechanism that should mitigate most occurences and issues a warning if the issue is detected.

See GH#8185 from crusaderky and Florian Jetter for details.

Over the course of this, various improvements to tokenization have been implemented. See GH#10913, GH#10884, GH#10919, GH#10896 and primarily GH#10883 from crusaderky for more details.

More robust adaptive scaling on large clusters

Adaptive scaling could previously lose data during downscaling if many tasks had to be moved. This typically, but not exclusively, occured on large clusters and would manifest as a recomputation of tasks and could cause clusters to oscillate between up- and downscaling without ever finishing.

See GH#8522 from crusaderky for more details.

Additional changes

2024.2.0

Released on February 9, 2024

Highlights

Deprecate Dask DataFrame implementation

The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by installing the dask-expr library:

$ pip install dask-expr

and turning the query planning option on:

>>> import dask
>>> dask.config.set({'dataframe.query-planning': True})
>>> import dask.dataframe as dd

API documentation for the new implementation is available at https://docs.dask.org/en/stable/dataframe-api.html

Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues

See GH#10912 from Patrick Hoefler for details.

Improved tokenization

This release contains several improvements to Dask’s object tokenization logic. More objects now produce deterministic tokens, which can lead to improved performance through caching of intermediate results.

See GH#10898, GH#10904, GH#10876, GH#10874, and GH#10865 from crusaderky for details.

Additional changes

2024.1.1

Released on January 26, 2024

Highlights

Pandas 2.2 and Scipy 1.12 support

This release contains compatibility updates for the latest pandas and scipy releases.

See GH#10834, GH#10849, GH#10845, and GH#8474 from crusaderky for details.

Deprecations

Additional changes

2024.1.0

Released on January 12, 2024

Highlights

Partial rechunks within P2P

P2P rechunking now utilizes the relationships between input and output chunks. For situations that do not require all-to-all data transfer, this may significantly reduce the runtime and memory/disk footprint. It also enables task culling.

See GH#8330 from Hendrik Makait for details.

Fastparquet engine deprecated

The fastparquet Parquet engine has been deprecated. Users should migrate to the pyarrow engine by installing PyArrow and removing engine="fastparquet" in read_parquet or to_parquet calls.

See GH#10743 from crusaderky for details.

Improved serialization for arbitrary data

This release improves serialization robustness for arbitrary data. Previously there were some cases where serialization could fail for non-msgpack serializable data. In those cases we now fallback to using pickle.

See GH#8447 from Hendrik Makait for details.

Additional deprecations

Additional changes

2023.12.1

Released on December 15, 2023

Highlights

Logical Query Planning now available for Dask DataFrames

Dask DataFrames are now much more performant by using a logical query planner. This feature is currently off by default, but can be turned on with:

dask.config.set({"dataframe.query-planning": True})

You also need to have dask-expr installed:

pip install dask-expr

We’ve seen promising performance improvements so far, see this blog post and these regularly updated benchmarks for more information. A more detailed explanation of how the query optimizer works can be found in this blog post.

This feature is still under active development and the API isn’t stable yet, so breaking changes can occur. We expect to make the query optimizer the default early next year.

See GH#10634 from Patrick Hoefler for details.

Dtype inference in read_parquet

read_parquet will now infer the Arrow types pa.date32(), pa.date64() and pa.decimal() as a ArrowDtype in pandas. These dtypes are backed by the original Arrow array, and thus avoid the conversion to NumPy object. Additionally, read_parquet will no longer infer nested and binary types as strings, they will be stored in NumPy object arrays.

See GH#10698 and GH#10705 from Patrick Hoefler for details.

Scheduling improvements to reduce memory usage

This release includes a major rewrite to a core part of our scheduling logic. It includes a new approach to the topological sorting algorithm in dask.order which determines the order in which tasks are run. Improper ordering is known to be a major contributor to too large cluster memory pressure.

Updates in this release fix a couple of performance regressions that were introduced in the release 2023.10.0 (see GH#10535). Generally, computations should now be much more eager to release data if it is no longer required in memory.

See GH#10660, GH#10697 from Florian Jetter for details.

Improved P2P-based merging robustness and performance

This release contains several updates that fix a possible deadlock introduced in 2023.9.2 and improve the robustness of P2P-based merging when the cluster is dynamically scaling up.

See GH#8415, GH#8416, and GH#8414 from Hendrik Makait for details.

Removed disabling pickle option

The distributed.scheduler.pickle configuration option is no longer supported. As of the 2023.4.0 release, pickle is used to transmit task graphs, so can no longer be disabled. We now raise an informative error when distributed.scheduler.pickle is set to False.

See GH#8401 from Florian Jetter for details.

Additional changes

2023.12.0

Released on December 1, 2023

Highlights

PipInstall restart and environment variables

The distributed.PipInstall plugin now has more robust restart logic and also supports environment variables.

Below shows how users can use the distributed.PipInstall plugin and a TOKEN environment variable to securely install a package from a private repository:

from dask.distributed import PipInstall
plugin = PipInstall(packages=["private_package@git+https://${TOKEN}@github.com/dask/private_package.git])
client.register_plugin(plugin)

See GH#8374, GH#8357, and GH#8343 from Hendrik Makait for details.

Bokeh 3.3.0 compatibility

This release contains compatibility updates for using bokeh>=3.3.0 with proxied Dask dashboards. Previously the contents of dashboard plots wouldn’t be displayed.

See GH#8347 and GH#8381 from Jacob Tomlinson for details.

Additional changes

2023.11.0

Released on November 10, 2023

Highlights

Zero-copy P2P Array Rechunking

Users should see significant performance improvements when using in-memory P2P array rechunking. This is due to no longer copying underlying data buffers.

Below shows a simple example where we compare performance of different rechunking methods.

shape = (30_000, 6_000, 150) # 201.17 GiB
input_chunks = (60, -1, -1) # 411.99 MiB
output_chunks = (-1, 6, -1) # 205.99 MiB

arr = da.random.random(size, chunks=input_chunks)
with dask.config.set({
    "array.rechunk.method": "p2p",
    "distributed.p2p.disk": True,
}):
    (
      da.random.random(size, chunks=input_chunks)
      .rechunk(output_chunks)
      .sum()
      .compute()
    )
A comparison of rechunking performance between the different methods tasks, p2p with disk and p2p without disk on different cluster sizes. The graph shows that p2p without disk is up to 60% faster than the default tasks based approach.

See GH#8282, GH#8318, GH#8321 from crusaderky and (GH#8322) from Hendrik Makait for details.

Deprecating PyArrow <14.0.1

pyarrow<14.0.1 usage is deprecated starting in this release. It’s recommended for all users to upgrade their version of pyarrow or install pyarrow-hotfix. See this CVE for full details.

See GH#10622 from Florian Jetter for details.

Improved PyArrow filesystem for Parquet

Using filesystem="arrow" when reading Parquet datasets now properly inferrs the correct cloud region when accessing remote, cloud-hosted data.

See GH#10590 from Richard (Rick) Zamora for details.

Improve Type Reconciliation in P2P Shuffling

See GH#8332 from Hendrik Makait for details.

Additional changes

2023.10.1

Released on October 27, 2023

Highlights

Python 3.12

This release adds official support for Python 3.12.

See GH#10544 and GH#8223 from Thomas Grainger for details.

Additional changes

2023.10.0

Released on October 13, 2023

Highlights

Reduced memory pressure for multi array reductions

This release contains major updates to Dask’s task graph scheduling logic. The updates here significantly reduce memory pressure on array reductions. We anticipate this will have a strong impact on the array computing community.

See GH#10535 from Florian Jetter for details.

Improved P2P shuffling robustness

There are several updates (listed below) that make P2P shuffling much more robust and less likely to fail.

See GH#8262, GH#8264, GH#8242, GH#8244, and GH#8235 from Hendrik Makait and GH#8124 from Charles Blackmon-Luca for details.

Reduced scheduler CPU load for large graphs

Users should see reduced CPU load on their scheduler when computing large task graphs.

See GH#8238 and GH#10547 from Florian Jetter and GH#8240 from crusaderky for details.

Additional changes

2023.9.3

Released on September 29, 2023

Highlights

Restore previous configuration override behavior

The 2023.9.2 release introduced an unintentional breaking change in how configuration options are overriden in dask.config.get with the override_with= keyword (see GH#10519). This release restores the previous behavior.

See GH#10521 from crusaderky for details.

Complex dtypes in Dask Array reductions

This release includes improved support for using common reductions in Dask Array (e.g. var, std, moment) with complex dtypes.

See GH#10009 from wkrasnicki for details.

Additional changes

2023.9.2

Released on September 15, 2023

Highlights

P2P shuffling now raises when outdated PyArrow is installed

Previously the default shuffling method would silently fallback from P2P to task-based shuffling if an older version of pyarrow was installed. Now we raise an informative error with the minimum required pyarrow version for P2P instead of silently falling back.

See GH#10496 from Hendrik Makait for details.

Deprecation cycle for admin.traceback.shorten

The 2023.9.0 release modified the admin.traceback.shorten configuration option without introducing a deprecation cycle. This resulted in failures to create Dask clusters in some cases. This release introduces a deprecation cycle for this configuration change.

See GH#10509 from crusaderky for details.

Additional changes

2023.9.1

Released on September 6, 2023

Note

This is a hotfix release that fixes a P2P shuffling bug introduced in the 2023.9.0 release (see GH#10493).

Enhancements

Bug Fixes

Maintenance

2023.9.0

Released on September 1, 2023

Bug Fixes

Documentation

Maintenance

2023.8.1

Released on August 18, 2023

Enhancements

Bug Fixes

  • Fix ValueError when running to_csv in append mode with single_file as True (GH#10441) Ben

Maintenance

2023.8.0

Released on August 4, 2023

Enhancements

Documentation

Maintenance

2023.7.1

Released on July 20, 2023

Note

This release updates Dask DataFrame to automatically convert text data using object data types to string[pyarrow] if pandas>=2 and pyarrow>=12 are installed.

This should result in significantly reduced memory consumption and increased computation performance in many workflows that deal with text data.

You can disable this change by setting the dataframe.convert-string configuration value to False with

dask.config.set({"dataframe.convert-string": False})

Enhancements

Bug Fixes

2023.7.0

Released on July 7, 2023

Enhancements

Bug Fixes

Documentation

  • Add clarification about output shape and reshaping in rechunk documentation (GH#10377) Swayam Patil

Maintenance

2023.6.1

Released on June 26, 2023

Enhancements

Bug Fixes

Deprecations

Maintenance

2023.6.0

Released on June 9, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.5.1

Released on May 26, 2023

Note

This release drops support for Python 3.8. As of this release Dask supports Python 3.9, 3.10, and 3.11. See this community issue for more details.

Enhancements

Bug Fixes

Documentation

Maintenance

2023.5.0

Released on May 12, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.4.1

Released on April 28, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.4.0

Released on April 14, 2023

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2023.3.2

Released on March 24, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.3.1

Released on March 10, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.3.0

Released on March 1, 2023

Bug Fixes

Documentation

Maintenance

2023.2.1

Released on February 24, 2023

Note

This release changes the default DataFrame shuffle algorithm to p2p to improve stability and performance. Learn more here and please provide any feedback on this discussion.

If you encounter issues with this new algorithm, please see the documentation for more information, and how to switch back to the old mode.

Enhancements

Bug Fixes

Documentation

Maintenance

2023.2.0

Released on February 10, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.1.1

Released on January 27, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.1.0

Released on January 13, 2023

Enhancements

Documentation

Maintenance

2022.12.1

Released on December 16, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.12.0

Released on December 2, 2022

Enhancements

Bug Fixes

Maintenance

2022.11.1

Released on November 18, 2022

Enhancements

Maintenance

2022.11.0

Released on November 15, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.10.2

Released on October 31, 2022

This was a hotfix and has no changes in this repository. The necessary fix was in dask/distributed, but we decided to bump this version number for consistency.

2022.10.1

Released on October 28, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.10.0

Released on October 14, 2022

New Features

Enhancements

Bug Fixes

Documentation

Maintenance

2022.9.2

Released on September 30, 2022

Enhancements

Documentation

Maintenance

2022.9.1

Released on September 16, 2022

New Features

Enhancements

Bug Fixes

Deprecations

  • Allow split_out to be None, which then defaults to 1 in groupby().aggregate() (GH#9491) Ian Rose

Documentation

Maintenance

2022.9.0

Released on September 2, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.8.1

Released on August 19, 2022

New Features

Enhancements

Bug Fixes

Documentation

Maintenance

2022.8.0

Released on August 5, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.7.1

Released on July 22, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.7.0

Released on July 8, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.6.1

Released on June 24, 2022

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.6.0

Released on June 10, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.05.2

Released on May 26, 2022

Enhancements

Documentation

Maintenance

2022.05.1

Released on May 24, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.05.0

Released on May 2, 2022

Highlights

This is a bugfix release for this issue.

Documentation

2022.04.2

Released on April 29, 2022

Highlights

This release includes several deprecations/breaking API changes to dask.dataframe.read_parquet and dask.dataframe.to_parquet:

  • to_parquet no longer writes _metadata files by default. If you want to write a _metadata file, you can pass in write_metadata_file=True.

  • read_parquet now defaults to split_row_groups=False, which results in one Dask dataframe partition per parquet file when reading in a parquet dataset. If you’re working with large parquet files you may need to set split_row_groups=True to reduce your partition size.

  • read_parquet no longer calculates divisions by default. If you require read_parquet to return dataframes with known divisions, please set calculate_divisions=True.

  • read_parquet has deprecated the gather_statistics keyword argument. Please use the calculate_divisions keyword argument instead.

  • read_parquet has deprecated the require_extensions keyword argument. Please use the parquet_file_extension keyword argument instead.

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.04.1

Released on April 15, 2022

New Features

  • Add missing NumPy ufuncs: abs, left_shift, right_shift, positive. (GH#8920) Tom White

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.04.0

Released on April 1, 2022

Note

This is the first release with support for Python 3.10

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.03.0

Released on March 18, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.02.1

Released on February 25, 2022

New Features

  • Add aggregate functions first and last to dask.dataframe.pivot_table (GH#8649) Knut Nordanger

  • Add std() support for datetime64 dtype for pandas-like objects (GH#8523) Ben Glossner

  • Add materialized task counts to HighLevelGraph and Layer html reprs (GH#8589) kori73

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.02.0

Released on February 11, 2022

Note

This is the last release with support for Python 3.7

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.01.1

Released on January 28, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.01.0

Released on January 14, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2021.12.0

Released on December 10, 2021

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2021.11.2

Released on November 19, 2021

2021.11.1

Released on November 8, 2021

Patch release to update distributed dependency to version 2021.11.1.

2021.11.0

Released on November 5, 2021

2021.10.0

Released on October 22, 2021

2021.09.1

Released on September 21, 2021

2021.09.0

Released on September 3, 2021

2021.08.1

Released on August 20, 2021

2021.08.0

Released on August 13, 2021

2021.07.2

Released on July 30, 2021

Note

This is the last release with support for NumPy 1.17 and pandas 0.25. Beginning with the next release, NumPy 1.18 and pandas 1.0 will be the minimum supported versions.

2021.07.1

Released on July 23, 2021

2021.07.0

Released on July 9, 2021

2021.06.2

Released on June 22, 2021

2021.06.1

Released on June 18, 2021

2021.06.0

Released on June 4, 2021

2021.05.1

Released on May 28, 2021

2021.05.0

Released on May 14, 2021

2021.04.1

Released on April 23, 2021

2021.04.0

Released on April 2, 2021

2021.03.1

Released on March 26, 2021

2021.03.0

Released on March 5, 2021

Note

This is the first release with support for Python 3.9 and the last release with support for Python 3.6

2021.02.0

Released on February 5, 2021

2021.01.1

Released on January 22, 2021

2021.01.0

Released on January 15, 2021

2020.12.0

Released on December 10, 2020

Highlights

  • Switched to CalVer for versioning scheme.

  • Introduced new APIs for HighLevelGraph to enable sending high-level representations of task graphs to the distributed scheduler.

  • Introduced new HighLevelGraph layer objects including BasicLayer, Blockwise, BlockwiseIO, ShuffleLayer, and more.

  • Added support for applying custom Layer-level annotations like priority, retries, etc. with the dask.annotations context manager.

  • Updated minimum supported version of pandas to 0.25.0 and NumPy to 1.15.1.

  • Support for the pyarrow.dataset API to read_parquet.

  • Several fixes to Dask Array’s SVD.

All changes

2.30.0 / 2020-10-06

Array

2.29.0 / 2020-10-02

Array

Bag

Core

DataFrame

Documentation

2.28.0 / 2020-09-25

Array

Core

DataFrame

2.27.0 / 2020-09-18

Array

Core

DataFrame

Documentation

2.26.0 / 2020-09-11

Array

Core

DataFrame

Documentation

2.25.0 / 2020-08-28

Core

DataFrame

Documentation