dask.bag.Bag.fold

dask.bag.Bag.fold

Bag.fold(binop, combine=None, initial=_NoDefault.no_default, split_every=None, out_type=<class 'dask.bag.core.Item'>)[source]

Parallelizable reduction

Fold is like the builtin function reduce except that it works in parallel. Fold takes two binary operator functions, one to reduce each partition of our dataset and another to combine results between partitions

  1. binop: Binary operator to reduce within each partition

  2. combine: Binary operator to combine results from binop

Sequentially this would look like the following:

>>> intermediates = [reduce(binop, part) for part in partitions]  
>>> final = reduce(combine, intermediates)  

If only one function is given then it is used for both functions binop and combine as in the following example to compute the sum:

>>> def add(x, y):
...     return x + y
>>> import dask.bag as db
>>> b = db.from_sequence(range(5))
>>> b.fold(add).compute()
10

In full form we provide both binary operators as well as their default arguments

>>> b.fold(binop=add, combine=add, initial=0).compute()
10

More complex binary operators are also doable

>>> def add_to_set(acc, x):
...     ''' Add new element x to set acc '''
...     return acc | set([x])
>>> b.fold(add_to_set, set.union, initial=set()).compute()
{0, 1, 2, 3, 4}

See also

Bag.foldby