dask.dataframe.Series.drop_duplicates

dask.dataframe.Series.drop_duplicates

Series.drop_duplicates(subset=None, split_every=None, split_out=1, shuffle_method=None, ignore_index=False, **kwargs)

Return DataFrame with duplicate rows removed.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Known inconsistencies:

keep=False will raise a NotImplementedError

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask)

Determines which duplicates (if any) to keep.

  • ‘first’ : Drop duplicates except for the first occurrence.

  • ‘last’ : Drop duplicates except for the last occurrence.

  • False : Drop all duplicates.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns
DataFrame or None

DataFrame with duplicates removed or None if inplace=True.

See also

DataFrame.value_counts

Count unique combinations of columns.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({  
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df  
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')  
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0