dask.bag.Bag.fold
dask.bag.Bag.fold¶
- Bag.fold(binop, combine=None, initial=_NoDefault.no_default, split_every=None, out_type=<class 'dask.bag.core.Item'>)[source]¶
Parallelizable reduction
Fold is like the builtin function
reduce
except that it works in parallel. Fold takes two binary operator functions, one to reduce each partition of our dataset and another to combine results between partitionsbinop
: Binary operator to reduce within each partitioncombine
: Binary operator to combine results from binop
Sequentially this would look like the following:
>>> intermediates = [reduce(binop, part) for part in partitions] >>> final = reduce(combine, intermediates)
If only one function is given then it is used for both functions
binop
andcombine
as in the following example to compute the sum:>>> def add(x, y): ... return x + y
>>> import dask.bag as db >>> b = db.from_sequence(range(5)) >>> b.fold(add).compute() 10
In full form we provide both binary operators as well as their default arguments
>>> b.fold(binop=add, combine=add, initial=0).compute() 10
More complex binary operators are also doable
>>> def add_to_set(acc, x): ... ''' Add new element x to set acc ''' ... return acc | set([x]) >>> b.fold(add_to_set, set.union, initial=set()).compute() {0, 1, 2, 3, 4}
See also