API

Create Bags

from_sequence(seq[, partition_size, npartitions])

Create a dask Bag from Python sequence.

from_delayed(values)

Create bag from many dask Delayed objects.

from_url(urls)

Create a dask Bag from a url.

range(n, npartitions)

Numbers from zero to n

read_text(urlpath[, blocksize, compression, ...])

Read lines from text files

read_avro(urlpath[, blocksize, ...])

Read set of avro files

From dataframe

DataFrame.to_bag([index, format])

Create Dask Bag from a Dask DataFrame

Series.to_bag([index, format])

Create a Dask Bag from a Series

Top-level functions

concat(bags)

Concatenate many bags together, unioning all elements.

map(func, *args, **kwargs)

Apply a function elementwise across one or more bags.

map_partitions(func, *args, **kwargs)

Apply a function to every partition across one or more bags.

to_textfiles(b, path[, name_function, ...])

Write dask Bag to disk, one filename per partition, one line per element.

zip(*bags)

Partition-wise bag zip

Random Sampling

random.choices(population[, k, split_every])

Return a k sized list of elements chosen with replacement.

random.sample(population, k[, split_every])

Chooses k unique random elements from a bag.

Turn Bags into other things

Bag.to_textfiles(path[, name_function, ...])

Write dask Bag to disk, one filename per partition, one line per element.

Bag.to_dataframe([meta, columns, optimize_graph])

Create Dask Dataframe from a Dask Bag.

Bag.to_delayed([optimize_graph])

Convert into a list of dask.delayed objects, one per partition.

Bag.to_avro(filename, schema[, ...])

Write bag to set of avro files

Bag Methods

Bag(dsk, name, npartitions)

Parallel collection of Python objects

Bag.accumulate(binop[, initial])

Repeatedly apply binary function to a sequence, accumulating results.

Bag.all([split_every])

Are all elements truthy?

Bag.any([split_every])

Are any of the elements truthy?

Bag.compute(**kwargs)

Compute this dask collection

Bag.count([split_every])

Count the number of elements.

Bag.distinct([key])

Distinct elements of collection

Bag.filter(predicate)

Filter elements in collection by a predicate function.

Bag.flatten()

Concatenate nested lists into one long list.

Bag.fold(binop[, combine, initial, ...])

Parallelizable reduction

Bag.foldby(key, binop[, initial, combine, ...])

Combined reduction and groupby.

Bag.frequencies([split_every, sort])

Count number of occurrences of each distinct element.

Bag.groupby(grouper[, method, npartitions, ...])

Group collection by key function

Bag.join(other, on_self[, on_other])

Joins collection with another collection.

Bag.map(func, *args, **kwargs)

Apply a function elementwise across one or more bags.

Bag.map_partitions(func, *args, **kwargs)

Apply a function to every partition across one or more bags.

Bag.max([split_every])

Maximum element

Bag.mean()

Arithmetic mean

Bag.min([split_every])

Minimum element

Bag.persist(**kwargs)

Persist this dask collection into memory

Bag.pluck(key[, default])

Select item from all tuples/dicts in collection.

Bag.product(other)

Cartesian product between two bags.

Bag.reduction(perpartition, aggregate[, ...])

Reduce collection with reduction operators.

Bag.random_sample(prob[, random_state])

Return elements from bag with probability of prob.

Bag.remove(predicate)

Remove elements in collection that match predicate.

Bag.repartition([npartitions, partition_size])

Repartition Bag across new divisions.

Bag.starmap(func, **kwargs)

Apply a function using argument tuples from the given bag.

Bag.std([ddof])

Standard deviation

Bag.sum([split_every])

Sum all elements

Bag.take(k[, npartitions, compute, warn])

Take the first k elements.

Bag.to_avro(filename, schema[, ...])

Write bag to set of avro files

Bag.to_dataframe([meta, columns, optimize_graph])

Create Dask Dataframe from a Dask Bag.

Bag.to_delayed([optimize_graph])

Convert into a list of dask.delayed objects, one per partition.

Bag.to_textfiles(path[, name_function, ...])

Write dask Bag to disk, one filename per partition, one line per element.

Bag.topk(k[, key, split_every])

K largest elements in collection

Bag.var([ddof])

Variance

Bag.visualize([filename, format, optimize_graph])

Render the computation of this object's task graph using graphviz.

Item Methods

Item(dsk, key[, layer])

Item.apply(func)

Item.compute(**kwargs)

Compute this dask collection

Item.from_delayed(value)

Create bag item from a dask.delayed value.

Item.persist(**kwargs)

Persist this dask collection into memory

Item.to_delayed([optimize_graph])

Convert into a dask.delayed object.

Item.visualize([filename, format, ...])

Render the computation of this object's task graph using graphviz.