dask.dataframe.DataFrame.merge

dask.dataframe.DataFrame.merge

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle_method=None, broadcast=None)[source]

Merge the DataFrame with another DataFrame

This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.

Parameters
right: dask.dataframe.DataFrame
how{‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘inner’

How to handle the operation of the two objects:

  • left: use calling frame’s index (or column if on is specified)

  • right: use other frame’s index

  • outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically

  • inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one

onlabel or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

right_onlabel or list, or array-like

Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

left_indexboolean, default False

Use the index from the left DataFrame as the join key.

right_indexboolean, default False

Use the index from the right DataFrame as the join key.

suffixes2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively

indicatorboolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.

npartitions: int or None, optional

The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If None then npartitions = max(lhs.npartitions, rhs.npartitions). Default is None.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p'` for distributed operation. Will be inferred by your current scheduler.

broadcast: boolean or float, optional

Whether to use a broadcast-based join in lieu of a shuffle-based join for supported cases. By default, a simple heuristic will be used to select the underlying algorithm. If a floating-point value is specified, that number will be used as the broadcast_bias within the simple heuristic (a large number makes Dask more likely to choose the broacast_join code path). See broadcast_join for more information.

Notes

There are three ways to join dataframes:

  1. Joining on indices. In this case the divisions are aligned using the function dask.dataframe.multi.align_partitions. Afterwards, each partition is merged with the pandas merge function.

  2. Joining one on index and one on column. In this case the divisions of dataframe merged by index (\(d_i\)) are used to divide the column merged dataframe (\(d_c\)) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)).

  3. Joining both on columns. In this case a hash join is performed using dask.dataframe.multi.hash_join.

In some cases, you may see a MemoryError if the merge operation requires an internal shuffle, because shuffling places all rows that have the same index in the same partition. To avoid this error, make sure all rows with the same on-column value can fit on a single partition.