dask.bag.Bag.join
dask.bag.Bag.join¶
- Bag.join(other, on_self, on_other=None)[source]¶
Joins collection with another collection.
Other collection must be one of the following:
An iterable. We recommend tuples over lists for internal performance reasons.
A delayed object, pointing to a tuple. This is recommended if the other collection is sizable and you’re using the distributed scheduler. Dask is able to pass around data wrapped in delayed objects with greater sophistication.
A Bag with a single partition
You might also consider Dask Dataframe, whose join operations are much more heavily optimized.
- Parameters
- other: Iterable, Delayed, Bag
Other collection on which to join
- on_self: callable
Function to call on elements in this collection to determine a match
- on_other: callable (defaults to on_self)
Function to call on elements in the other collection to determine a match
Examples
>>> import dask.bag as db >>> people = db.from_sequence(['Alice', 'Bob', 'Charlie']) >>> fruit = ['Apple', 'Apricot', 'Banana'] >>> list(people.join(fruit, lambda x: x[0])) [('Apple', 'Alice'), ('Apricot', 'Alice'), ('Banana', 'Bob')]