dask.dataframe.DataFrame.merge
dask.dataframe.DataFrame.merge¶
- DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, shuffle_method=None, npartitions=None, broadcast=None)[source]¶
Merge the DataFrame with another DataFrame
This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.
- Parameters
- right: dask.dataframe.DataFrame
- how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘leftsemi’}, default: ‘inner’
How to handle the operation of the two objects:
left: use calling frame’s index (or column if on is specified)
right: use other frame’s index
outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
leftsemi: Choose all rows in left where the join keys can be found in right. Won’t duplicate rows if the keys are duplicated in right. Drops all columns from right.
- onlabel or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
- left_onlabel or list, or array-like
Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.
- right_onlabel or list, or array-like
Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.
- left_indexboolean, default False
Use the index from the left DataFrame as the join key.
- right_indexboolean, default False
Use the index from the right DataFrame as the join key.
- suffixes2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively
- indicatorboolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.
- npartitions: int or None, optional
The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If
None
thennpartitions = max(lhs.npartitions, rhs.npartitions)
. Default isNone
.- shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional
Either
'disk'
for single-node operation or'tasks'
and'p2p'`
for distributed operation. Will be inferred by your current scheduler.- broadcast: boolean or float, optional
Whether to use a broadcast-based join in lieu of a shuffle-based join for supported cases. By default, a simple heuristic will be used to select the underlying algorithm. If a floating-point value is specified, that number will be used as the
broadcast_bias
within the simple heuristic (a large number makes Dask more likely to choose thebroacast_join
code path). Seebroadcast_join
for more information.
Notes
There are three ways to join dataframes:
Joining on indices. In this case the divisions are aligned using the function
dask.dataframe.multi.align_partitions
. Afterwards, each partition is merged with the pandas merge function.Joining one on index and one on column. In this case the divisions of dataframe merged by index (\(d_i\)) are used to divide the column merged dataframe (\(d_c\)) one using
dask.dataframe.multi.rearrange_by_divisions
. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)).Joining both on columns. In this case a hash join is performed using
dask.dataframe.multi.hash_join
.
In some cases, you may see a
MemoryError
if themerge
operation requires an internalshuffle
, because shuffling places all rows that have the same index in the same partition. To avoid this error, make sure all rows with the sameon
-column value can fit on a single partition.