Bag.join(other, on_self, on_other=None)[source]

Joins collection with another collection.

Other collection must be one of the following:

  1. An iterable. We recommend tuples over lists for internal performance reasons.

  2. A delayed object, pointing to a tuple. This is recommended if the other collection is sizable and you’re using the distributed scheduler. Dask is able to pass around data wrapped in delayed objects with greater sophistication.

  3. A Bag with a single partition

You might also consider Dask Dataframe, whose join operations are much more heavily optimized.

other: Iterable, Delayed, Bag

Other collection on which to join

on_self: callable

Function to call on elements in this collection to determine a match

on_other: callable (defaults to on_self)

Function to call on elements in the other collection to determine a match


>>> import dask.bag as db
>>> people = db.from_sequence(['Alice', 'Bob', 'Charlie'])
>>> fruit = ['Apple', 'Apricot', 'Banana']
>>> list(people.join(fruit, lambda x: x[0]))
[('Apple', 'Alice'), ('Apricot', 'Alice'), ('Banana', 'Bob')]