dask.bag.Bag

dask.bag.Bag

class dask.bag.Bag(dsk: Graph, name: str, npartitions: int)[source]

Parallel collection of Python objects

Examples

Create Bag from sequence

>>> import dask.bag as db
>>> b = db.from_sequence(range(5))
>>> list(b.filter(lambda x: x % 2 == 0).map(lambda x: x * 10))
[0, 20, 40]

Create Bag from filename or globstring of filenames

>>> b = db.read_text('/path/to/mydata.*.json.gz').map(json.loads)  

Create manually (expert use)

>>> dsk = {('x', 0): (range, 5),
...        ('x', 1): (range, 5),
...        ('x', 2): (range, 5)}
>>> b = db.Bag(dsk, 'x', npartitions=3)
>>> sorted(b.map(lambda x: x * 10))
[0, 0, 0, 10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
>>> int(b.fold(lambda x, y: x + y))
30
__init__(dsk: Graph, name: str, npartitions: int)[source]

Methods

__init__(dsk, name, npartitions)

accumulate(binop[, initial])

Repeatedly apply binary function to a sequence, accumulating results.

all([split_every])

Are all elements truthy?

any([split_every])

Are any of the elements truthy?

compute(**kwargs)

Compute this dask collection

count([split_every])

Count the number of elements.

distinct([key])

Distinct elements of collection

filter(predicate)

Filter elements in collection by a predicate function.

flatten()

Concatenate nested lists into one long list.

fold(binop[, combine, initial, split_every, ...])

Parallelizable reduction

foldby(key, binop[, initial, combine, ...])

Combined reduction and groupby.

frequencies([split_every, sort])

Count number of occurrences of each distinct element.

groupby(grouper[, method, npartitions, ...])

Group collection by key function

join(other, on_self[, on_other])

Joins collection with another collection.

map(func, *args, **kwargs)

Apply a function elementwise across one or more bags.

map_partitions(func, *args, **kwargs)

Apply a function to every partition across one or more bags.

max([split_every])

Maximum element

mean()

Arithmetic mean

min([split_every])

Minimum element

persist(**kwargs)

Persist this dask collection into memory

pluck(key[, default])

Select item from all tuples/dicts in collection.

product(other)

Cartesian product between two bags.

random_sample(prob[, random_state])

Return elements from bag with probability of prob.

reduction(perpartition, aggregate[, ...])

Reduce collection with reduction operators.

remove(predicate)

Remove elements in collection that match predicate.

repartition([npartitions, partition_size])

Repartition Bag across new divisions.

starmap(func, **kwargs)

Apply a function using argument tuples from the given bag.

std([ddof])

Standard deviation

sum([split_every])

Sum all elements

take(k[, npartitions, compute, warn])

Take the first k elements.

to_avro(filename, schema[, name_function, ...])

Write bag to set of avro files

to_dataframe([meta, columns, optimize_graph])

Create Dask Dataframe from a Dask Bag.

to_delayed([optimize_graph])

Convert into a list of dask.delayed objects, one per partition.

to_textfiles(path[, name_function, ...])

Write dask Bag to disk, one filename per partition, one line per element.

topk(k[, key, split_every])

K largest elements in collection

unzip(n)

Transform a bag of tuples to n bags of their elements.

var([ddof])

Variance

visualize([filename, format, optimize_graph])

Render the computation of this object's task graph using graphviz.

Attributes

str

String processing functions