API

Dataframe

DataFrame(dsk, name, meta, divisions) Parallel Pandas DataFrame
DataFrame.add(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator add).
DataFrame.append(other[, interleave_partitions]) Append rows of other to the end of caller, returning a new object.
DataFrame.apply(func[, axis, broadcast, …]) Parallel version of pandas.DataFrame.apply
DataFrame.assign(**kwargs) Assign new columns to a DataFrame.
DataFrame.astype(dtype) Cast a pandas object to a specified dtype dtype.
DataFrame.categorize([columns, index, …]) Convert columns of the DataFrame to category dtype.
DataFrame.columns
DataFrame.compute(**kwargs) Compute this dask collection
DataFrame.corr([method, min_periods, …]) Compute pairwise correlation of columns, excluding NA/null values.
DataFrame.count([axis, split_every]) Count non-NA cells for each column or row.
DataFrame.cov([min_periods, split_every]) Compute pairwise covariance of columns, excluding NA/null values.
DataFrame.cummax([axis, skipna, out]) Return cumulative maximum over a DataFrame or Series axis.
DataFrame.cummin([axis, skipna, out]) Return cumulative minimum over a DataFrame or Series axis.
DataFrame.cumprod([axis, skipna, dtype, out]) Return cumulative product over a DataFrame or Series axis.
DataFrame.cumsum([axis, skipna, dtype, out]) Return cumulative sum over a DataFrame or Series axis.
DataFrame.describe([split_every, …]) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
DataFrame.div(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.drop([labels, axis, columns, errors]) Drop specified labels from rows or columns.
DataFrame.drop_duplicates([subset, …]) Return DataFrame with duplicate rows removed, optionally only considering certain columns.
DataFrame.dropna([how, subset, thresh]) Remove missing values.
DataFrame.dtypes Return data types
DataFrame.explode(column)
DataFrame.fillna([value, method, limit, axis]) Fill NA/NaN values using the specified method.
DataFrame.floordiv(other[, axis, level, …]) Integer division of dataframe and other, element-wise (binary operator floordiv).
DataFrame.get_partition(n) Get a dask DataFrame/Series representing the nth partition.
DataFrame.groupby([by]) Group DataFrame or Series using a mapper or by a Series of columns.
DataFrame.head([n, npartitions, compute]) First n rows of the dataset
DataFrame.iloc Purely integer-location based indexing for selection by position.
DataFrame.index Return dask Index instance
DataFrame.isna() Detect missing values.
DataFrame.isnull() Detect missing values.
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples([index, name]) Iterate over DataFrame rows as namedtuples.
DataFrame.join(other[, on, how, lsuffix, …]) Join columns of another DataFrame.
DataFrame.known_divisions Whether divisions are already known
DataFrame.loc Purely label-location based indexer for selection by label.
DataFrame.map_partitions(func, *args, **kwargs) Apply Python function on each DataFrame partition.
DataFrame.mask(cond[, other]) Replace values where the condition is True.
DataFrame.max([axis, skipna, split_every, out]) Return the maximum of the values for the requested axis.
DataFrame.mean([axis, skipna, split_every, …]) Return the mean of the values for the requested axis.
DataFrame.merge(right[, how, on, left_on, …]) Merge the DataFrame with another DataFrame
DataFrame.min([axis, skipna, split_every, out]) Return the minimum of the values for the requested axis.
DataFrame.mod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator mod).
DataFrame.mul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.ndim Return dimensionality
DataFrame.nlargest([n, columns, split_every]) Return the first n rows ordered by columns in descending order.
DataFrame.npartitions Return number of partitions
DataFrame.partitions Slice dataframe by partitions
DataFrame.pop(item) Return item and drop from frame.
DataFrame.pow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator pow).
DataFrame.prod([axis, skipna, split_every, …]) Return the product of the values for the requested axis.
DataFrame.quantile([q, axis, method]) Approximate row-wise and precise column-wise quantiles of DataFrame
DataFrame.query(expr, **kwargs) Filter dataframe with complex expression
DataFrame.radd(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator radd).
DataFrame.random_split(frac[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
DataFrame.rdiv(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.rename([index, columns]) Alter axes labels.
DataFrame.repartition([divisions, …]) Repartition dataframe along new divisions
DataFrame.replace([to_replace, value, regex]) Replace values given in to_replace with value.
DataFrame.reset_index([drop]) Reset the index to the default index.
DataFrame.rfloordiv(other[, axis, level, …]) Integer division of dataframe and other, element-wise (binary operator rfloordiv).
DataFrame.rmod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator rmod).
DataFrame.rmul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator rmul).
DataFrame.rpow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator rpow).
DataFrame.rsub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator rsub).
DataFrame.rtruediv(other[, axis, level, …]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.sample([n, frac, replace, …]) Random sample of items
DataFrame.set_index(other[, drop, sorted, …]) Set the DataFrame index (row labels) using an existing column.
DataFrame.shape Return a tuple representing the dimensionality of the DataFrame.
DataFrame.std([axis, skipna, ddof, …]) Return sample standard deviation over requested axis.
DataFrame.sub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator sub).
DataFrame.sum([axis, skipna, split_every, …]) Return the sum of the values for the requested axis.
DataFrame.tail([n, compute]) Last n rows of the dataset
DataFrame.to_bag([index]) Create Dask Bag from a Dask DataFrame
DataFrame.to_csv(filename, **kwargs) Store Dask DataFrame to CSV files
DataFrame.to_dask_array([lengths]) Convert a dask DataFrame to a dask array.
DataFrame.to_delayed([optimize_graph]) Convert into a list of dask.delayed objects, one per partition.
DataFrame.to_hdf(path_or_buf, key[, mode, …]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
DataFrame.to_json(filename, *args, **kwargs) See dd.to_json docstring for more information
DataFrame.to_parquet(path, *args, **kwargs) Store Dask.dataframe to Parquet files
DataFrame.to_records([index, lengths]) Create Dask Array from a Dask Dataframe
DataFrame.truediv(other[, axis, level, …]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.values Return a dask.array of the values of this dataframe
DataFrame.var([axis, skipna, ddof, …]) Return unbiased variance over requested axis.
DataFrame.visualize([filename, format, …]) Render the computation of this object’s task graph using graphviz.
DataFrame.where(cond[, other]) Replace values where the condition is False.

Series

Series(dsk, name, meta, divisions) Parallel Pandas Series
Series.add(other[, level, fill_value, axis]) Addition of series and other, element-wise (binary operator add).
Series.align(other[, join, axis, fill_value]) Align two objects on their axes with the specified join method for each axis Index.
Series.all([axis, skipna, split_every, out]) Return whether all elements are True, potentially over an axis.
Series.any([axis, skipna, split_every, out]) Return whether any element is True, potentially over an axis.
Series.append(other[, interleave_partitions]) Concatenate two or more Series.
Series.apply(func[, convert_dtype, meta, args]) Parallel version of pandas.Series.apply
Series.astype(dtype) Cast a pandas object to a specified dtype dtype.
Series.autocorr([lag, split_every]) Compute the lag-N autocorrelation.
Series.between(left, right[, inclusive]) Return boolean Series equivalent to left <= series <= right.
Series.bfill([axis, limit]) Synonym for DataFrame.fillna() with method='bfill'.
Series.cat
Series.clear_divisions() Forget division information
Series.clip([lower, upper, out]) Trim values at input threshold(s).
Series.clip_lower(threshold) Trim values below a given threshold.
Series.clip_upper(threshold) Trim values above a given threshold.
Series.compute(**kwargs) Compute this dask collection
Series.copy() Make a copy of the dataframe
Series.corr(other[, method, min_periods, …]) Compute correlation with other Series, excluding missing values.
Series.count([split_every]) Return number of non-NA/null observations in the Series.
Series.cov(other[, min_periods, split_every]) Compute covariance with Series, excluding missing values.
Series.cummax([axis, skipna, out]) Return cumulative maximum over a DataFrame or Series axis.
Series.cummin([axis, skipna, out]) Return cumulative minimum over a DataFrame or Series axis.
Series.cumprod([axis, skipna, dtype, out]) Return cumulative product over a DataFrame or Series axis.
Series.cumsum([axis, skipna, dtype, out]) Return cumulative sum over a DataFrame or Series axis.
Series.describe([split_every, percentiles, …]) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Series.diff([periods, axis]) First discrete difference of element.
Series.div(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator truediv).
Series.drop_duplicates([subset, …]) Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.dropna() Return a new Series with missing values removed.
Series.dt Namespace of datetime methods
Series.dtype Return data type
Series.eq(other[, level, fill_value, axis]) Equal to of series and other, element-wise (binary operator eq).
Series.explode()
Series.ffill([axis, limit]) Synonym for DataFrame.fillna() with method='ffill'.
Series.fillna([value, method, limit, axis]) Fill NA/NaN values using the specified method.
Series.first(offset) Convenience method for subsetting initial periods of time series data based on a date offset.
Series.floordiv(other[, level, fill_value, axis]) Integer division of series and other, element-wise (binary operator floordiv).
Series.ge(other[, level, fill_value, axis]) Greater than or equal to of series and other, element-wise (binary operator ge).
Series.get_partition(n) Get a dask DataFrame/Series representing the nth partition.
Series.groupby([by]) Group DataFrame or Series using a mapper or by a Series of columns.
Series.gt(other[, level, fill_value, axis]) Greater than of series and other, element-wise (binary operator gt).
Series.head([n, npartitions, compute]) First n rows of the dataset
Series.idxmax([axis, skipna, split_every]) Return index of first occurrence of maximum over requested axis.
Series.idxmin([axis, skipna, split_every]) Return index of first occurrence of minimum over requested axis.
Series.isin(values) Check whether values are contained in Series.
Series.isna() Detect missing values.
Series.isnull() Detect missing values.
Series.iteritems() Lazily iterate over (index, value) tuples.
Series.known_divisions Whether divisions are already known
Series.last(offset) Convenience method for subsetting final periods of time series data based on a date offset.
Series.le(other[, level, fill_value, axis]) Less than or equal to of series and other, element-wise (binary operator le).
Series.loc Purely label-location based indexer for selection by label.
Series.lt(other[, level, fill_value, axis]) Less than of series and other, element-wise (binary operator lt).
Series.map(arg[, na_action, meta]) Map values of Series according to input correspondence.
Series.map_overlap(func, before, after, …) Apply a function to each partition, sharing rows with adjacent partitions.
Series.map_partitions(func, *args, **kwargs) Apply Python function on each DataFrame partition.
Series.mask(cond[, other]) Replace values where the condition is True.
Series.max([axis, skipna, split_every, out]) Return the maximum of the values for the requested axis.
Series.mean([axis, skipna, split_every, …]) Return the mean of the values for the requested axis.
Series.memory_usage([index, deep]) Return the memory usage of the Series.
Series.min([axis, skipna, split_every, out]) Return the minimum of the values for the requested axis.
Series.mod(other[, level, fill_value, axis]) Modulo of series and other, element-wise (binary operator mod).
Series.mul(other[, level, fill_value, axis]) Multiplication of series and other, element-wise (binary operator mul).
Series.nbytes Number of bytes
Series.ndim Return dimensionality
Series.ne(other[, level, fill_value, axis]) Not equal to of series and other, element-wise (binary operator ne).
Series.nlargest([n, split_every]) Return the largest n elements.
Series.notnull() Detect existing (non-missing) values.
Series.nsmallest([n, split_every]) Return the smallest n elements.
Series.nunique([split_every]) Return number of unique elements in the object.
Series.nunique_approx([split_every]) Approximate number of unique rows.
Series.persist(**kwargs) Persist this dask collection into memory
Series.pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs).
Series.pow(other[, level, fill_value, axis]) Exponential power of series and other, element-wise (binary operator pow).
Series.prod([axis, skipna, split_every, …]) Return the product of the values for the requested axis.
Series.quantile([q, method]) Approximate quantiles of Series
Series.radd(other[, level, fill_value, axis]) Addition of series and other, element-wise (binary operator radd).
Series.random_split(frac[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
Series.rdiv(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator rtruediv).
Series.reduction(chunk[, aggregate, …]) Generic row-wise reductions.
Series.repartition([divisions, npartitions, …]) Repartition dataframe along new divisions
Series.replace([to_replace, value, regex]) Replace values given in to_replace with value.
Series.rename([index, inplace, sorted_index]) Alter Series index labels or name
Series.resample(rule[, closed, label]) Resample time-series data.
Series.reset_index([drop]) Reset the index to the default index.
Series.rolling(window[, min_periods, …]) Provides rolling transformations.
Series.round([decimals]) Round each value in a Series to the given number of decimals.
Series.sample([n, frac, replace, random_state]) Random sample of items
Series.sem([axis, skipna, ddof, split_every]) Return unbiased standard error of the mean over requested axis.
Series.shape Return a tuple representing the dimensionality of a Series.
Series.shift([periods, freq, axis]) Shift index by desired number of periods with an optional time freq.
Series.size Size of the Series or DataFrame as a Delayed object.
Series.std([axis, skipna, ddof, …]) Return sample standard deviation over requested axis.
Series.str Namespace for string methods
Series.sub(other[, level, fill_value, axis]) Subtraction of series and other, element-wise (binary operator sub).
Series.sum([axis, skipna, split_every, …]) Return the sum of the values for the requested axis.
Series.to_bag([index]) Create a Dask Bag from a Series
Series.to_csv(filename, **kwargs) Store Dask DataFrame to CSV files
Series.to_dask_array([lengths]) Convert a dask DataFrame to a dask array.
Series.to_delayed([optimize_graph]) Convert into a list of dask.delayed objects, one per partition.
Series.to_frame([name]) Convert Series to DataFrame.
Series.to_hdf(path_or_buf, key[, mode, append]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
Series.to_string([max_rows]) Render a string representation of the Series.
Series.to_timestamp([freq, how, axis]) Cast to DatetimeIndex of timestamps, at beginning of period.
Series.truediv(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator truediv).
Series.unique([split_every, split_out]) Return Series of unique values in the object.
Series.value_counts([split_every, split_out]) Return a Series containing counts of unique values.
Series.values Return a dask.array of the values of this dataframe
Series.var([axis, skipna, ddof, …]) Return unbiased variance over requested axis.
Series.visualize([filename, format, …]) Render the computation of this object’s task graph using graphviz.
Series.where(cond[, other]) Replace values where the condition is False.

Groupby Operations

DataFrameGroupBy.aggregate(arg[, …]) Aggregate using one or more operations over the specified axis.
DataFrameGroupBy.apply(func, *args, **kwargs) Parallel version of pandas GroupBy.apply
DataFrameGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values.
DataFrameGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
DataFrameGroupBy.cumprod([axis]) Cumulative product for each group.
DataFrameGroupBy.cumsum([axis]) Cumulative sum for each group.
DataFrameGroupBy.get_group(key) Constructs NDFrame from group with provided name.
DataFrameGroupBy.max([split_every, split_out]) Compute max of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
DataFrameGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values.
DataFrameGroupBy.min([split_every, split_out]) Compute min of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
DataFrameGroupBy.size([split_every, split_out]) Compute group sizes.
DataFrameGroupBy.std([ddof, split_every, …]) Compute standard deviation of groups, excluding missing values.
DataFrameGroupBy.sum([split_every, …]) Compute sum of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
DataFrameGroupBy.var([ddof, split_every, …]) Compute variance of groups, excluding missing values.
DataFrameGroupBy.cov([ddof, split_every, …]) Compute pairwise covariance of columns, excluding NA/null values.
DataFrameGroupBy.corr([ddof, split_every, …]) Compute pairwise correlation of columns, excluding NA/null values.
DataFrameGroupBy.first([split_every, split_out]) Compute first of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
DataFrameGroupBy.last([split_every, split_out]) Compute last of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
DataFrameGroupBy.idxmin([split_every, …]) Return index of first occurrence of minimum over requested axis.
DataFrameGroupBy.idxmax([split_every, …]) Return index of first occurrence of maximum over requested axis.
SeriesGroupBy.aggregate(arg[, split_every, …]) Aggregate using one or more operations over the specified axis.
SeriesGroupBy.apply(func, *args, **kwargs) Parallel version of pandas GroupBy.apply
SeriesGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values.
SeriesGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
SeriesGroupBy.cumprod([axis]) Cumulative product for each group.
SeriesGroupBy.cumsum([axis]) Cumulative sum for each group.
SeriesGroupBy.get_group(key) Constructs NDFrame from group with provided name.
SeriesGroupBy.max([split_every, split_out]) Compute max of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
SeriesGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values.
SeriesGroupBy.min([split_every, split_out]) Compute min of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
SeriesGroupBy.nunique([split_every, split_out])
SeriesGroupBy.size([split_every, split_out]) Compute group sizes.
SeriesGroupBy.std([ddof, split_every, split_out]) Compute standard deviation of groups, excluding missing values.
SeriesGroupBy.sum([split_every, split_out, …]) Compute sum of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
SeriesGroupBy.var([ddof, split_every, split_out]) Compute variance of groups, excluding missing values.
SeriesGroupBy.first([split_every, split_out]) Compute first of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
SeriesGroupBy.last([split_every, split_out]) Compute last of group values See Also ——– pandas.Series.groupby pandas.DataFrame.groupby pandas.Panel.groupby
SeriesGroupBy.idxmin([split_every, …]) Return index of first occurrence of minimum over requested axis.
SeriesGroupBy.idxmax([split_every, …]) Return index of first occurrence of maximum over requested axis.
Aggregation(name, chunk, agg[, finalize]) User defined groupby-aggregation.

Rolling Operations

rolling.map_overlap(func, df, before, after, …) Apply a function to each partition, sharing rows with adjacent partitions.
Series.rolling(window[, min_periods, …]) Provides rolling transformations.
DataFrame.rolling(window[, min_periods, …]) Provides rolling transformations.
Rolling.apply(func[, args, kwargs]) The rolling function’s apply function.
Rolling.count() The rolling count of any non-NaN observations inside the window.
Rolling.kurt() Calculate unbiased rolling kurtosis.
Rolling.max() Calculate the rolling maximum.
Rolling.mean() Calculate the rolling mean of the values.
Rolling.median() Calculate the rolling median.
Rolling.min() Calculate the rolling minimum.
Rolling.quantile(quantile) Calculate the rolling quantile.
Rolling.skew() Unbiased rolling skewness.
Rolling.std([ddof]) Calculate rolling standard deviation.
Rolling.sum() Calculate rolling sum of given DataFrame or Series.
Rolling.var([ddof]) Calculate unbiased rolling variance.

Create DataFrames

read_csv(urlpath[, blocksize, collection, …]) Read CSV files into a Dask.DataFrame
read_table(urlpath[, blocksize, collection, …]) Read delimited files into a Dask.DataFrame
read_fwf(urlpath[, blocksize, collection, …]) Read fixed-width files into a Dask.DataFrame
read_parquet(path[, columns, filters, …]) Read a Parquet file into a Dask DataFrame
read_hdf(pattern, key[, start, stop, …]) Read HDF files into a Dask DataFrame
read_json(url_path[, orient, lines, …]) Create a dataframe from a set of JSON files
read_orc(path[, columns, storage_options]) Read dataframe from ORC file(s)
read_sql_table(table, uri, index_col[, …]) Create dataframe from an SQL table.
from_array(x[, chunksize, columns]) Read any slicable array into a Dask Dataframe
from_bcolz(x[, chunksize, categorize, …]) Read BColz CTable into a Dask Dataframe
from_dask_array(x[, columns, index]) Create a Dask DataFrame from a Dask Array.
from_delayed(dfs[, meta, divisions, prefix, …]) Create Dask DataFrame from many Dask Delayed objects
from_pandas(data[, npartitions, chunksize, …]) Construct a Dask DataFrame from a Pandas DataFrame
dask.bag.core.Bag.to_dataframe([meta, columns]) Create Dask Dataframe from a Dask Bag.

Store DataFrames

to_csv(df, filename[, single_file, …]) Store Dask DataFrame to CSV files
to_parquet(df, path[, engine, compression, …]) Store Dask.dataframe to Parquet files
to_hdf(df, path, key[, mode, append, …]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
to_records(df) Create Dask Array from a Dask Dataframe
to_bag(df[, index]) Create Dask Bag from a Dask DataFrame
to_json(df, url_path[, orient, lines, …]) Write dataframe into JSON text files

Convert DataFrames

to_dask_array
to_delayed

Reshape DataFrames

get_dummies(data[, prefix, prefix_sep, …]) Convert categorical variable into dummy/indicator variables.
pivot_table(df[, index, columns, values, …]) Create a spreadsheet-style pivot table as a DataFrame.
melt(frame[, id_vars, value_vars, var_name, …]) Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

DataFrame Methods

class dask.dataframe.DataFrame(dsk, name, meta, divisions)

Parallel Pandas DataFrame

Do not use this class directly. Instead use functions like dd.read_csv, dd.read_parquet, or dd.from_pandas.

Parameters:
dsk: dict

The dask graph to compute this DataFrame

name: str

The key prefix that specifies which keys in the dask comprise this particular DataFrame

meta: pandas.DataFrame

An empty pandas.DataFrame with names, dtypes, and index matching the expected output.

divisions: tuple of index values

Values along which we partition our blocks on the index

abs()

Return a Series/DataFrame with absolute numeric value of each element.

This docstring was copied from pandas.core.frame.DataFrame.abs.

Some inconsistencies with the Dask version may exist.

This function only applies to elements that are all numeric.

Returns:
abs

Series/DataFrame containing the absolute value of each element.

See also

numpy.absolute
Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])  # doctest: +SKIP
>>> s.abs()  # doctest: +SKIP
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])  # doctest: +SKIP
>>> s.abs()  # doctest: +SKIP
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])  # doctest: +SKIP
>>> s.abs()  # doctest: +SKIP
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from StackOverflow).

>>> df = pd.DataFrame({  # doctest: +SKIP
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df  # doctest: +SKIP
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]  # doctest: +SKIP
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
align(other, join='outer', axis=None, fill_value=None)

Align two objects on their axes with the specified join method for each axis Index.

This docstring was copied from pandas.core.frame.DataFrame.align.

Some inconsistencies with the Dask version may exist.

Parameters:
other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axis : allowed axis of the other object, default None

Align on index (0), columns (1), or both (None)

level : int or level name, default None (Not supported in Dask)

Broadcast across a level, matching Index values on the passed MultiIndex level

copy : boolean, default True (Not supported in Dask)

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_value : scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

limit : int, default None (Not supported in Dask)

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Filling axis, method and limit

broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None (Not supported in Dask)

Broadcast values along this axis, if aligning two objects of different dimensions

Returns:
(left, right) : (DataFrame, type of other)

Aligned objects

all(axis=None, skipna=True, split_every=False, out=None)

Return whether all elements are True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.all.

Some inconsistencies with the Dask version may exist.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
  • None : reduce all axes, return a scalar.
bool_only : bool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipna : bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargs : any, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

Series.all
Return True if all elements are True.
DataFrame.any
Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()  # doctest: +SKIP
True
>>> pd.Series([True, False]).all()  # doctest: +SKIP
False
>>> pd.Series([]).all()  # doctest: +SKIP
True
>>> pd.Series([np.nan]).all()  # doctest: +SKIP
True
>>> pd.Series([np.nan]).all(skipna=False)  # doctest: +SKIP
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})  # doctest: +SKIP
>>> df  # doctest: +SKIP
   col1   col2
0  True   True
1  True  False

Default behaviour checks if column-wise values all return True.

>>> df.all()  # doctest: +SKIP
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if row-wise values all return True.

>>> df.all(axis='columns')  # doctest: +SKIP
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)  # doctest: +SKIP
False
any(axis=None, skipna=True, split_every=False, out=None)

Return whether any element is True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.any.

Some inconsistencies with the Dask version may exist.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
  • None : reduce all axes, return a scalar.
bool_only : bool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipna : bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargs : any, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

numpy.any
Numpy version of this method.
Series.any
Return whether any element is True.
Series.all
Return whether all elements are True.
DataFrame.any
Return whether any element is True over requested axis.
DataFrame.all
Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()  # doctest: +SKIP
False
>>> pd.Series([True, False]).any()  # doctest: +SKIP
True
>>> pd.Series([]).any()  # doctest: +SKIP
False
>>> pd.Series([np.nan]).any()  # doctest: +SKIP
False
>>> pd.Series([np.nan]).any(skipna=False)  # doctest: +SKIP
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})  # doctest: +SKIP
>>> df  # doctest: +SKIP
   A  B  C
0  1  0  0
1  2  2  0
>>> df.any()  # doctest: +SKIP
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})  # doctest: +SKIP
>>> df  # doctest: +SKIP
       A  B
0   True  1
1  False  2
>>> df.any(axis='columns')  # doctest: +SKIP
0    True
1    True
dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})  # doctest: +SKIP
>>> df  # doctest: +SKIP
       A  B
0   True  1
1  False  0
>>> df.any(axis='columns')  # doctest: +SKIP
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)  # doctest: +SKIP
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()  # doctest: +SKIP
Series([], dtype: bool)
append(other, interleave_partitions=False)

Append rows of other to the end of caller, returning a new object.

This docstring was copied from pandas.core.frame.DataFrame.append.

Some inconsistencies with the Dask version may exist.

Columns in other that are not in the caller are added as new columns.

Parameters:
other : DataFrame or Series/dict-like object, or list of these

The data to append.

ignore_index : boolean, default False (Not supported in Dask)

If True, do not use the index labels.

verify_integrity : boolean, default False (Not supported in Dask)

If True, raise ValueError on creating index with duplicates.

sort : boolean, default None (Not supported in Dask)

Sort columns if the columns of self and other are not aligned. The default sorting is deprecated and will change to not-sorting in a future version of pandas. Explicitly pass sort=True to silence the warning and sort. Explicitly pass sort=False to silence the warning and not sort.

New in version 0.23.0.

Returns:
appended : DataFrame

See also

pandas.concat
General function to concatenate DataFrame, Series or Panel objects.

Notes

If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))  # doctest: +SKIP
>>> df  # doctest: +SKIP
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))  # doctest: +SKIP
>>> df.append(df2)  # doctest: +SKIP
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)  # doctest: +SKIP
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.

Less efficient:

>>> df = pd.DataFrame(columns=['A'])  # doctest: +SKIP
>>> for i in range(5):  # doctest: +SKIP
...     df = df.append({'A': i}, ignore_index=True)
>>> df  # doctest: +SKIP
   A
0  0
1  1
2  2
3  3
4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],  # doctest: +SKIP
...           ignore_index=True)
   A
0  0
1  1
2  2
3  3
4  4
apply(func, axis=0, broadcast=None, raw=False, reduce=None, args=(), meta='__no_default__', **kwds)

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  1. Only axis=1 is supported (and must be specified explicitly).
  2. The user should provide output metadata via the meta keyword.
Parameters:
func : function

Function to apply to each column/row

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
  • 0 or ‘index’: apply function to each column (NOT SUPPORTED)
  • 1 or ‘columns’: apply function to each row
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

args : tuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function
Returns:
applied : Series or DataFrame

See also

dask.DataFrame.map_partitions

Examples

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Apply a function to row-wise passing in extra arguments in args and kwargs:

>>> def myadd(row, a, b=1):
...     return row.sum() + a + b
>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)  # doctest: +SKIP

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
applymap(func, meta='__no_default__')

Apply a function to a Dataframe elementwise.

This docstring was copied from pandas.core.frame.DataFrame.applymap.

Some inconsistencies with the Dask version may exist.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
func : callable

Python function, returns a single value from a single value.

Returns:
DataFrame

Transformed DataFrame.

See also

DataFrame.apply
Apply a function along input axis of DataFrame.

Notes

In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Examples

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])  # doctest: +SKIP
>>> df  # doctest: +SKIP
       0      1
0  1.000  2.120
1  3.356  4.567
>>> df.applymap(lambda x: len(str(x)))  # doctest: +SKIP
   0  1
0  3  4
1  5  5

Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)  # doctest: +SKIP
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it’s better to avoid applymap in that case.

>>> df ** 2  # doctest: +SKIP
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
assign(**kwargs)

Assign new columns to a DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.assign.

Some inconsistencies with the Dask version may exist.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:
**kwargs : dict of {str: callable or Series}

The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:
DataFrame

A new DataFrame with the new columns in addition to all the existing columns.

Notes

Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.

Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.

Examples

>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},  # doctest: +SKIP
...                   index=['Portland', 'Berkeley'])
>>> df  # doctest: +SKIP
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)  # doctest: +SKIP
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)  # doctest: +SKIP
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

In Python 3.6+, you can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,  # doctest: +SKIP
...           temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
astype(dtype)

Cast a pandas object to a specified dtype dtype.

This docstring was copied from pandas.core.frame.DataFrame.astype.

Some inconsistencies with the Dask version may exist.

Parameters:
dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copy : bool, default True (Not supported in Dask)

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors : {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

New in version 0.20.0.

kwargs : keyword arguments to pass on to the constructor
Returns:
casted : same type as caller

See also

to_datetime
Convert argument to datetime.
to_timedelta
Convert argument to timedelta.
to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.

Examples

>>> ser = pd.Series([1, 2], dtype='int32')  # doctest: +SKIP
>>> ser  # doctest: +SKIP
0    1
1    2
dtype: int32
>>> ser.astype('int64')  # doctest: +SKIP
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  # doctest: +SKIP
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = pd.api.types.CategoricalDtype(  # doctest: +SKIP
...                     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)  # doctest: +SKIP
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1,2])  # doctest: +SKIP
>>> s2 = s1.astype('int64', copy=False)  # doctest: +SKIP
>>> s2[0] = 10  # doctest: +SKIP
>>> s1  # note that s1[0] has changed too  # doctest: +SKIP
0    10
1     2
dtype: int64
bfill(axis=None, limit=None)

Synonym for DataFrame.fillna() with method='bfill'.

categorize(columns=None, index=None, split_every=None, **kwargs)

Convert columns of the DataFrame to category dtype.

Parameters:
columns : list, optional

A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.

index : bool, optional

Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.

kwargs

Keyword arguments are passed on to compute.

clear_divisions()

Forget division information

clip(lower=None, upper=None, out=None)

Trim values at input threshold(s).

This docstring was copied from pandas.core.frame.DataFrame.clip.

Some inconsistencies with the Dask version may exist.

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
lower : float or array_like, default None

Minimum threshold value. All values below this threshold will be set to it.

upper : float or array_like, default None

Maximum threshold value. All values above this threshold will be set to it.

axis : int or string axis name, optional (Not supported in Dask)

Align object with lower and upper along the given axis.

inplace : boolean, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

New in version 0.21.0.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:
Series or DataFrame

Same type as calling object with the values outside the clip boundaries replaced

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}  # doctest: +SKIP
>>> df = pd.DataFrame(data)  # doctest: +SKIP
>>> df  # doctest: +SKIP
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)  # doctest: +SKIP
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])  # doctest: +SKIP
>>> t  # doctest: +SKIP
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)  # doctest: +SKIP
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3
clip_lower(threshold)

Trim values below a given threshold.

This docstring was copied from pandas.core.frame.DataFrame.clip_lower.

Some inconsistencies with the Dask version may exist.

Deprecated since version 0.24.0: Use clip(lower=threshold) instead.

Elements below the threshold will be changed to match the threshold value(s). Threshold can be a single value or an array, in the latter case it performs the truncation element-wise.

Parameters:
threshold : numeric or array-like

Minimum value allowed. All values below threshold will be set to this value.

  • float : every value is compared to threshold.
  • array-like : The shape of threshold should match the object it’s compared to. When self is a Series, threshold should be the length. When self is a DataFrame, threshold should 2-D and the same shape as self for axis=None, or 1-D and the same length as the axis being compared.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Align self with threshold along the given axis.

inplace : boolean, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

New in version 0.21.0.

Returns:
Series or DataFrame

Original data with values trimmed.

See also

Series.clip
General purpose method to trim Series values to given threshold(s).
DataFrame.clip
General purpose method to trim DataFrame values to given threshold(s).

Examples

Series single threshold clipping:

>>> s = pd.Series([5, 6, 7, 8, 9])  # doctest: +SKIP
>>> s.clip(lower=8)  # doctest: +SKIP
0    8
1    8
2    8
3    8
4    9
dtype: int64

Series clipping element-wise using an array of thresholds. threshold should be the same length as the Series.

>>> elemwise_thresholds = [4, 8, 7, 2, 5]  # doctest: +SKIP
>>> s.clip(lower=elemwise_thresholds)  # doctest: +SKIP
0    5
1    8
2    7
3    8
4    9
dtype: int64

DataFrames can be compared to a scalar.

>>> df = pd.DataFrame({"A": [1, 3, 5], "B": [2, 4, 6]})  # doctest: +SKIP
>>> df  # doctest: +SKIP
   A  B
0  1  2
1  3  4
2  5  6
>>> df.clip(lower=3)  # doctest: +SKIP
   A  B
0  3  3
1  3  4
2  5  6

Or to an array of values. By default, threshold should be the same shape as the DataFrame.

>>> df.clip(lower=np.array([[3, 4], [2, 2], [6, 2]]))  # doctest: +SKIP
   A  B
0  3  4
1  3  4
2  6  6

Control how threshold is broadcast with axis. In this case threshold should be the same length as the axis specified by axis.

>>> df.clip(lower=[3, 3, 5], axis='index')  # doctest: +SKIP
   A  B
0  3  3
1  3  4
2  5  6
>>> df.clip(lower=[4, 5], axis='columns')  # doctest: +SKIP
   A  B
0  4  5
1  4  5
2  5  6
clip_upper(threshold)

Trim values above a given threshold.

This docstring was copied from pandas.core.frame.DataFrame.clip_upper.

Some inconsistencies with the Dask version may exist.

Deprecated since version 0.24.0: Use clip(upper=threshold) instead.

Elements above the threshold will be changed to match the threshold value(s). Threshold can be a single value or an array, in the latter case it performs the truncation element-wise.

Parameters:
threshold : numeric or array-like

Maximum value allowed. All values above threshold will be set to this value.

  • float : every value is compared to threshold.
  • array-like : The shape of threshold should match the object it’s compared to. When self is a Series, threshold should be the length. When self is a DataFrame, threshold should 2-D and the same shape as self for axis=None, or 1-D and the same length as the axis being compared.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Align object with threshold along the given axis.

inplace : boolean, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

New in version 0.21.0.

Returns:
Series or DataFrame

Original data with values trimmed.

See also

Series.clip
General purpose method to trim Series values to given threshold(s).
DataFrame.clip
General purpose method to trim DataFrame values to given threshold(s).

Examples

>>> s = pd.Series([1, 2, 3, 4, 5])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    1
1    2
2    3
3    4
4    5
dtype: int64
>>> s.clip(upper=3)  # doctest: +SKIP
0    1
1    2
2    3
3    3
4    3
dtype: int64
>>> elemwise_thresholds = [5, 4, 3, 2, 1]  # doctest: +SKIP
>>> elemwise_thresholds  # doctest: +SKIP
[5, 4, 3, 2, 1]
>>> s.clip(upper=elemwise_thresholds)  # doctest: +SKIP
0    1
1    2
2    3
3    2
4    1
dtype: int64
combine(other, func, fill_value=None, overwrite=True)

Perform column-wise combine with another DataFrame based on a passed function.

This docstring was copied from pandas.core.frame.DataFrame.combine.

Some inconsistencies with the Dask version may exist.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
other : DataFrame

The DataFrame to merge column-wise.

func : function

Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.

fill_value : scalar value, default None

The value to fill NaNs with prior to passing any column to the merge func.

overwrite : boolean, default True

If True, columns in self that do not exist in other will be overwritten with NaNs.

Returns:
result : DataFrame

See also

DataFrame.combine_first
Combine two DataFrame objects and default to non-null values in frame calling the method.

Examples

Combine using a simple function that chooses the smaller column.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  # doctest: +SKIP
>>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2  # doctest: +SKIP
>>> df1.combine(df2, take_smaller)  # doctest: +SKIP
   A  B
0  0  3
1  0  3

Example using a true element-wise combine function.

>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  # doctest: +SKIP
>>> df1.combine(df2, np.minimum)  # doctest: +SKIP
   A  B
0  1  2
1  0  3

Using fill_value fills Nones prior to passing the column to the merge function.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  # doctest: +SKIP
>>> df1.combine(df2, take_smaller, fill_value=-5)  # doctest: +SKIP
   A    B
0  0 -5.0
1  0  4.0

However, if the same element in both dataframes is None, that None is preserved

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]})  # doctest: +SKIP
>>> df1.combine(df2, take_smaller, fill_value=-5)  # doctest: +SKIP
   A    B
0  0  NaN
1  0  3.0

Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1],}, index=[1, 2])  # doctest: +SKIP
>>> df1.combine(df2, take_smaller)  # doctest: +SKIP
     A    B     C
0  NaN  NaN   NaN
1  NaN  3.0 -10.0
2  NaN  3.0   1.0
>>> df1.combine(df2, take_smaller, overwrite=False)  # doctest: +SKIP
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

Demonstrating the preference of the passed in dataframe.

>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1],}, index=[1, 2])  # doctest: +SKIP
>>> df2.combine(df1, take_smaller)  # doctest: +SKIP
   A    B   C
0  0.0  NaN NaN
1  0.0  3.0 NaN
2  NaN  3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False)  # doctest: +SKIP
     A    B   C
0  0.0  NaN NaN
1  0.0  3.0 1.0
2  NaN  3.0 1.0
combine_first(other)

Update null elements with value in the same location in other.

This docstring was copied from pandas.core.frame.DataFrame.combine_first.

Some inconsistencies with the Dask version may exist.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
other : DataFrame

Provided DataFrame to use to fill null values.

Returns:
combined : DataFrame

See also

DataFrame.combine
Perform series-wise operation on two DataFrames using a given function.

Examples

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  # doctest: +SKIP
>>> df1.combine_first(df2)  # doctest: +SKIP
     A    B
0  1.0  3.0
1  0.0  4.0

Null values still persist if the location of that null value does not exist in other

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]})  # doctest: +SKIP
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])  # doctest: +SKIP
>>> df1.combine_first(df2)  # doctest: +SKIP
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0
compute(**kwargs)

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a numpy.array() and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:
scheduler : string, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler function.

See also

dask.base.compute

copy()

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

corr(method='pearson', min_periods=None, split_every=False)

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Parameters:
method : {‘pearson’, ‘kendall’, ‘spearman’} or callable
  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
  • callable: callable with input two 1d ndarrays
    and returning a float .. versionadded:: 0.24.0
min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation

Returns:
y : DataFrame

See also

DataFrame.corrwith, Series.corr

Examples

>>> histogram_intersection = lambda a, b: np.minimum(a, b  # doctest: +SKIP
... ).sum().round(decimals=1)
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  # doctest: +SKIP
      dogs cats
dogs   1.0  0.3
cats   0.3  1.0
count(axis=None, split_every=False)

Count non-NA cells for each column or row.

This docstring was copied from pandas.core.frame.DataFrame.count.

Some inconsistencies with the Dask version may exist.

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

level : int or str, optional (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.

numeric_only : boolean, default False (Not supported in Dask)

Include only float, int or boolean data.

Returns:
Series or DataFrame

For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

See also

Series.count
Number of non-NA elements in a Series.
DataFrame.shape
Number of DataFrame rows and columns (including NA elements).
DataFrame.isna
Boolean same-sized DataFrame showing places of NA elements.

Examples

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":  # doctest: +SKIP
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df  # doctest: +SKIP
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()  # doctest: +SKIP
Person    5
Age       4
Single    5
dtype: int64

Counts for each row:

>>> df.count(axis='columns')  # doctest: +SKIP
0    3
1    2
2    3
3    3
4    3
dtype: int64

Counts for one level of a MultiIndex:

>>> df.set_index(["Person", "Single"]).count(level="Person")  # doctest: +SKIP
        Age
Person
John      2
Lewis     1
Myla      1
cov(min_periods=None, split_every=False)

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:
min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result.

Returns:
DataFrame

The covariance matrix of the series of the DataFrame.

See also

pandas.Series.cov
Compute covariance with another Series.
pandas.core.window.EWM.cov
Exponential weighted sample covariance.
pandas.core.window.Expanding.cov
Expanding sample covariance.
pandas.core.window.Rolling.cov
Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  # doctest: +SKIP
...                   columns=['dogs', 'cats'])
>>> df.cov()  # doctest: +SKIP
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(1000, 5),  # doctest: +SKIP
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  # doctest: +SKIP
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  # doctest: +SKIP
>>> df = pd.DataFrame(np.random.randn(20, 3),  # doctest: +SKIP
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  # doctest: +SKIP
>>> df.loc[df.index[5:10], 'b'] = np.nan  # doctest: +SKIP
>>> df.cov(min_periods=12)  # doctest: +SKIP
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
cummax(axis=None, skipna=True, out=None)

Return cumulative maximum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummax.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’.

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs :

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
cummax : Series or DataFrame

See also

core.window.Expanding.max
Similar functionality but ignores NaN values.
DataFrame.max
Return the maximum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()  # doctest: +SKIP
0    2.0
1    NaN
2    5.0
3    5.0
4    5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)  # doctest: +SKIP
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  # doctest: +SKIP
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)  # doctest: +SKIP
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0
cummin(axis=None, skipna=True, out=None)

Return cumulative minimum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummin.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’.

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs :

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
cummin : Series or DataFrame

See also

core.window.Expanding.min
Similar functionality but ignores NaN values.
DataFrame.min
Return the minimum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()  # doctest: +SKIP
0    2.0
1    NaN
2    2.0
3   -1.0
4   -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)  # doctest: +SKIP
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  # doctest: +SKIP
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()  # doctest: +SKIP
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
cumprod(axis=None, skipna=True, dtype=None, out=None)

Return cumulative product over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumprod.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’.

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs :

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
cumprod : Series or DataFrame

See also

core.window.Expanding.prod
Similar functionality but ignores NaN values.
DataFrame.prod
Return the product over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()  # doctest: +SKIP
0     2.0
1     NaN
2    10.0
3   -10.0
4    -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)  # doctest: +SKIP
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  # doctest: +SKIP
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()  # doctest: +SKIP
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)  # doctest: +SKIP
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0
cumsum(axis=None, skipna=True, dtype=None, out=None)

Return cumulative sum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumsum.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’.

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs :

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
cumsum : Series or DataFrame

See also

core.window.Expanding.sum
Similar functionality but ignores NaN values.
DataFrame.sum
Return the sum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  # doctest: +SKIP
>>> s  # doctest: +SKIP
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()  # doctest: +SKIP
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)  # doctest: +SKIP
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  # doctest: +SKIP
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  # doctest: +SKIP
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()  # doctest: +SKIP
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)  # doctest: +SKIP
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0
describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

This docstring was copied from pandas.core.frame.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
percentiles : list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include : ‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.
  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
  • None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
  • None (default) : The result will exclude nothing.
Returns:
Series or DataFrame

Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count
Count number of non-NA/null observations.
DataFrame.max
Maximum of the values in the object.
DataFrame.min
Minimum of the values in the object.
DataFrame.mean
Mean of the values.
DataFrame.std
Standard deviation of the obersvations.
DataFrame.select_dtypes
Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  # doctest: +SKIP
>>> s.describe()  # doctest: +SKIP
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  # doctest: +SKIP
>>> s.describe()  # doctest: +SKIP
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  # doctest: +SKIP
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()  # doctest: +SKIP
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),  # doctest: +SKIP
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()  # doctest: +SKIP
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  # doctest: +SKIP
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  # doctest: +SKIP
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  # doctest: +SKIP
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])  # doctest: +SKIP
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  # doctest: +SKIP
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  # doctest: +SKIP
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])  # doctest: +SKIP
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
diff(periods=1, axis=0)

First discrete difference of element.

This docstring was copied from pandas.core.frame.DataFrame.diff.

Some inconsistencies with the Dask version may exist.

Note

Pandas currently uses an object-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like |. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use .astype(bool) after the shift.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).

Parameters:
periods : int, default 1

Periods to shift for calculating difference, accepts negative values.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Take difference over rows (0) or columns (1).

New in version 0.16.1..

Returns:
diffed : DataFrame

See also

Series.diff
First discrete difference for a Series.
DataFrame.pct_change
Percent change over given number of periods.
DataFrame.shift
Shift index by desired number of periods with an optional time freq.

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],  # doctest: +SKIP
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df  # doctest: +SKIP
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> df.diff()  # doctest: +SKIP
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)  # doctest: +SKIP
    a    b     c
0 NaN  0.0   0.0
1 NaN -1.0   3.0
2 NaN -1.0   7.0
3 NaN -1.0  13.0
4 NaN  0.0  20.0
5 NaN  2.0  28.0

Difference with 3rd previous row

>>> df.diff(periods=3)  # doctest: +SKIP
     a    b     c
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3  3.0  2.0  15.0
4  3.0  4.0  21.0
5  3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)  # doctest: +SKIP
     a    b     c
0 -1.0  0.0  -3.0
1 -1.0 -1.0  -5.0
2 -1.0 -1.0  -7.0
3 -1.0 -2.0  -9.0
4 -1.0 -3.0 -11.0
5  NaN  NaN   NaN
div(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
divide(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
drop(labels=None, axis=0, columns=None, errors='raise')

Drop specified labels from rows or columns.

This docstring was copied from pandas.core.frame.DataFrame.drop.

Some inconsistencies with the Dask version may exist.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters:
labels : single label or list-like

Index or column labels to drop.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

index, columns : single label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

New in version 0.21.0.

level : int or level name, optional (Not supported in Dask)

For MultiIndex, level from which the labels will be removed.

inplace : bool, default False (Not supported in Dask)

If True, do operation inplace and return None.

errors : {‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and only existing labels are dropped.

Returns:
dropped : pandas.DataFrame
Raises:
KeyError

If none of the labels are found in the selected axis

See also

DataFrame.loc
Label-location based indexer for selection by label.
DataFrame.dropna
Return DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicates
Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.drop
Return Series with specified index labels removed.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3,4),  # doctest: +SKIP
...                   columns=['A', 'B', 'C', 'D'])
>>> df  # doctest: +SKIP
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)  # doctest: +SKIP
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])  # doctest: +SKIP
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])  # doctest: +SKIP
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],  # doctest: +SKIP
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],  # doctest: +SKIP
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3,0.2]])
>>> df  # doctest: +SKIP
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2
>>> df.drop(index='cow', columns='small')  # doctest: +SKIP
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3
>>> df.drop(index='length', level=1)  # doctest: +SKIP
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8
drop_duplicates(subset=None, split_every=None, split_out=1, **kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Parameters:
subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask)
  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.
inplace : boolean, default False (Not supported in Dask)

Whether to drop duplicates in place or to return a copy

Returns:
deduplicated : DataFrame
dropna(how='any', subset=None, thresh=None)

Remove missing values.

This docstring was copied from pandas.core.frame.DataFrame.dropna.

Some inconsistencies with the Dask version may exist.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Determine if rows or columns which contain missing values are removed.

  • 0, or ‘index’ : Drop rows which contain missing values.
  • 1, or ‘columns’ : Drop columns which contain missing value.

Deprecated since version 0.23.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

how : {‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • ‘any’ : If any NA values are present, drop that row or column.
  • ‘all’ : If all values are NA, drop that row or column.
thresh : int, optional

Require that many non-NA values.

subset : array-like, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplace : bool, default False (Not supported in Dask)

If True, do operation inplace and return None.

Returns:
DataFrame

DataFrame with NA entries dropped from it.

See also

DataFrame.isna
Indicate missing values.
DataFrame.notna
Indicate existing (non-missing) values.
DataFrame.fillna
Replace missing values.
Series.dropna
Drop missing values.
Index.dropna
Drop missing indices.

Examples

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],  # doctest: +SKIP
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df  # doctest: +SKIP
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()  # doctest: +SKIP
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')  # doctest: +SKIP
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')  # doctest: +SKIP
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)  # doctest: +SKIP
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'born'])  # doctest: +SKIP
       name        toy       born
1    Batman  Batmobile 1940-04-25

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)  # doctest: +SKIP
>>> df  # doctest: +SKIP
     name        toy       born
1  Batman  Batmobile 1940-04-25
dtypes

Return data types

eq(other, axis='columns', level=None)

Equal to of dataframe and other, element-wise (binary operator eq).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
eval(expr, inplace=None, **kwargs)

Evaluate a string describing operations on DataFrame columns.

This docstring was copied from pandas.core.frame.DataFrame.eval.

Some inconsistencies with the Dask version may exist.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Parameters:
expr : str

The expression string to evaluate.

inplace : bool, default False

If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.

New in version 0.18.0..

kwargs : dict

See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:
ndarray, scalar, or pandas object

The result of the evaluation.

See also

DataFrame.query
Evaluates a boolean expression to query the columns of a frame.
DataFrame.assign
Can evaluate an expression or function to create new values for a column.
pandas.eval
Evaluate a Python expression as a string using various backends.

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})  # doctest: +SKIP
>>> df  # doctest: +SKIP
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')  # doctest: +SKIP
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not modified.

>>> df.eval('C = A + B')  # doctest: +SKIP
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7
>>> df  # doctest: +SKIP
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2

Use inplace=True to modify the original DataFrame.

>>> df.eval('C = A + B', inplace=True)  # doctest: +SKIP
>>> df  # doctest: +SKIP
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7
ffill(axis=None, limit=None)

Synonym for DataFrame.fillna() with method='ffill'.

fillna(value=None, method=None, limit=None, axis=None)

Fill NA/NaN values using the specified method.

This docstring was copied from pandas.core.frame.DataFrame.fillna.

Some inconsistencies with the Dask version may exist.

Parameters:
value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}
inplace : boolean, default False (Not supported in Dask)

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None (Not supported in Dask)

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:
filled : DataFrame

See also

interpolate
Fill NaN values using interpolation.

reindex, asfreq

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  # doctest: +SKIP
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
>>> df  # doctest: +SKIP
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)  # doctest: +SKIP
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')  # doctest: +SKIP
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}  # doctest: +SKIP
>>> df.fillna(value=values)  # doctest: +SKIP
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  # doctest: +SKIP
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4
first(offset)

Convenience method for subsetting initial periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.first.

Some inconsistencies with the Dask version may exist.

Parameters:
offset : string, DateOffset, dateutil.relativedelta
Returns:
subset : same type as caller
Raises:
TypeError

If the index is not a DatetimeIndex

See also

last
Select final periods of time series based on a date offset.
at_time
Select values at a particular time of the day.
between_time
Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  # doctest: +SKIP
>>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i)  # doctest: +SKIP
>>> ts  # doctest: +SKIP
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')  # doctest: +SKIP
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

floordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ge(other, axis='columns', level=None)

Greater than or equal to of dataframe and other, element-wise (binary operator ge).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
get_dtype_counts()

Return counts of unique dtypes in this object.

This docstring was copied from pandas.core.frame.DataFrame.get_dtype_counts.

Some inconsistencies with the Dask version may exist.

Returns:
dtype : Series

Series with the count of columns with each dtype.

See also

dtypes
Return the dtypes in this object.

Examples

>>> a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]]  # doctest: +SKIP
>>> df = pd.DataFrame(a, columns=['str', 'int', 'float'])  # doctest: +SKIP
>>> df  # doctest: +SKIP
  str  int  float
0   a    1    1.0
1   b    2    2.0
2   c    3    3.0
>>> df.get_dtype_counts()  # doctest: +SKIP
float64    1
int64      1
object     1
dtype: int64
get_ftype_counts()

Return counts of unique ftypes in this object.

This docstring was copied from pandas.core.frame.DataFrame.get_ftype_counts.

Some inconsistencies with the Dask version may exist.

Deprecated since version 0.23.0.

This is useful for SparseDataFrame or for DataFrames containing sparse arrays.

Returns:
dtype : Series

Series with the count of columns with each type and sparsity (dense/sparse)

See also

ftypes
Return ftypes (indication of sparse/dense and dtype) in this object.

Examples

>>> a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]]  # doctest: +SKIP
>>> df = pd.DataFrame(a, columns=['str', 'int', 'float'])  # doctest: +SKIP
>>> df  # doctest: +SKIP
  str  int  float
0   a    1    1.0
1   b    2    2.0
2   c    3    3.0
>>> df.get_ftype_counts()  # doctest: +SKIP
float64:dense    1
int64:dense      1
object:dense     1
dtype: int64
get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(by=None, **kwargs)

Group DataFrame or Series using a mapper or by a Series of columns.

This docstring was copied from pandas.core.frame.DataFrame.groupby.

Some inconsistencies with the Dask version may exist.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
by : mapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Split along rows (0) or columns (1).

level : int, level name, or sequence of such, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), group by a particular level or levels.

as_index : bool, default True (Not supported in Dask)

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sort : bool, default True (Not supported in Dask)

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keys : bool, default True (Not supported in Dask)

When calling apply, add group keys to index to identify pieces.

squeeze : bool, default False (Not supported in Dask)

Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

observed : bool, default False (Not supported in Dask)

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

New in version 0.23.0.

**kwargs

Optional, only accepts keyword argument ‘mutated’ and is passed to groupby.

Returns:
DataFrameGroupBy or SeriesGroupBy

Depends on the calling object and returns groupby object that contains information about the groups.

See also

resample
Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more.

Examples

>>> df = pd.DataFrame({'Animal' : ['Falcon', 'Falcon',  # doctest: +SKIP
...                                'Parrot', 'Parrot'],
...                    'Max Speed' : [380., 370., 24., 26.]})
>>> df  # doctest: +SKIP
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()  # doctest: +SKIP
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Hierarchical Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],  # doctest: +SKIP
...           ['Capitve', 'Wild', 'Capitve', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  # doctest: +SKIP
>>> df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]},  # doctest: +SKIP
...                    index=index)
>>> df  # doctest: +SKIP
                Max Speed
Animal Type
Falcon Capitve      390.0
       Wild         350.0
Parrot Capitve       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()  # doctest: +SKIP
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level=1).mean()  # doctest: +SKIP
         Max Speed
Type
Capitve      210.0
Wild         185.0
gt(other, axis='columns', level=None)

Greater than of dataframe and other, element-wise (binary operator gt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
head(n=5, npartitions=1, compute=True)

First n rows of the dataset

Parameters:
n : int, optional

The number of rows to return. Default is 5.

npartitions : int, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:
idxmax : Series
Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(axis=None, skipna=True, split_every=False)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns:
idxmin : Series
Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

iloc

Purely integer-location based indexing for selection by position.

Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.

See Indexing into Dask DataFrames for more.

Examples

>>> df.iloc[:, [2, 0, 1]]  # doctest: +SKIP
index

Return dask Index instance

info(buf=None, verbose=False, memory_usage=False)

Concise summary of a Dask DataFrame.

isin(values)

Whether each element in the DataFrame is contained in values.

This docstring was copied from pandas.core.frame.DataFrame.isin.

Some inconsistencies with the Dask version may exist.

Parameters:
values : iterable, Series, DataFrame or dict

The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns:
DataFrame

DataFrame of booleans showing whether each element in the DataFrame is contained in values.

See also

DataFrame.eq
Equality test for DataFrame.
Series.isin
Equivalent method on Series.
Series.str.contains
Test if pattern or regex is contained within a string of a Series or Index.

Examples

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},  # doctest: +SKIP
...                   index=['falcon', 'dog'])
>>> df  # doctest: +SKIP
        num_legs  num_wings
falcon         2          2
dog            4          0

When values is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])  # doctest: +SKIP
        num_legs  num_wings
falcon      True       True
dog        False       True

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})  # doctest: +SKIP
        num_legs  num_wings
falcon     False      False
dog        False       True

When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in df2.

>>> other = pd.DataFrame({'num_legs': [8, 2],'num_wings': [0, 2]},  # doctest: +SKIP
...                      index=['spider', 'falcon'])
>>> df.isin(other)  # doctest: +SKIP
        num_legs  num_wings
falcon      True       True
dog        False      False
isna()

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isna.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull
Alias of isna.
DataFrame.notna
Boolean inverse of isna.
DataFrame.dropna
Omit axes labels with missing values.
isna
Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  # doctest: +SKIP
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  # doctest: +SKIP
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  # doctest: +SKIP
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  # doctest: +SKIP
>>> ser  # doctest: +SKIP
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  # doctest: +SKIP
0    False
1    False
2     True
dtype: bool
isnull()

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull
Alias of isna.
DataFrame.notna
Boolean inverse of isna.
DataFrame.dropna
Omit axes labels with missing values.
isna
Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  # doctest: +SKIP
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  # doctest: +SKIP
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  # doctest: +SKIP
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  # doctest: +SKIP
>>> ser  # doctest: +SKIP
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  # doctest: +SKIP
0    False
1    False
2     True
dtype: bool
iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

This docstring was copied from pandas.core.frame.DataFrame.iterrows.

Some inconsistencies with the Dask version may exist.

Yields:
index : label or tuple of label

The index of the row. A tuple for a MultiIndex.

data : Series

The data of the row as a Series.

it : generator

A generator that iterates over the rows of the frame.

See also

itertuples
Iterate over DataFrame rows as namedtuples of the values.
iteritems
Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])  # doctest: +SKIP
    >>> row = next(df.iterrows())[1]  # doctest: +SKIP
    >>> row  # doctest: +SKIP
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)  # doctest: +SKIP
    float64
    >>> print(df['int'].dtype)  # doctest: +SKIP
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples(index=True, name='Pandas')

Iterate over DataFrame rows as namedtuples.

This docstring was copied from pandas.core.frame.DataFrame.itertuples.

Some inconsistencies with the Dask version may exist.

Parameters:
index : bool, default True

If True, return the index as the first element of the tuple.

name : str or None, default “Pandas”

The name of the returned namedtuples or None to return regular tuples.

Yields:
collections.namedtuple

Yields a namedtuple for each row in the DataFrame with the first field possibly being the index and following fields being the column values.

See also

DataFrame.iterrows
Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.iteritems
Iterate over (column name, Series) pairs.

Notes

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

Examples

>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},  # doctest: +SKIP
...                   index=['dog', 'hawk'])
>>> df  # doctest: +SKIP
      num_legs  num_wings
dog          4          0
hawk         2          2
>>> for row in df.itertuples():  # doctest: +SKIP
...     print(row)
...
Pandas(Index='dog', num_legs=4, num_wings=0)
Pandas(Index='hawk', num_legs=2, num_wings=2)

By setting the index parameter to False we can remove the index as the first element of the tuple:

>>> for row in df.itertuples(index=False):  # doctest: +SKIP
...     print(row)
...
Pandas(num_legs=4, num_wings=0)
Pandas(num_legs=2, num_wings=2)

With the name parameter set we set a custom name for the yielded namedtuples:

>>> for row in df.itertuples(name='Animal'):  # doctest: +SKIP
...     print(row)
...
Animal(Index='dog', num_legs=4, num_wings=0)
Animal(Index='hawk', num_legs=2, num_wings=2)
join(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)

Join columns of another DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.join.

Some inconsistencies with the Dask version may exist.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters:
other : DataFrame, Series, or list of DataFrame

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.

on : str, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

  • left: use calling frame’s index (or column if on is specified)
  • right: use other’s index.
  • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
  • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
lsuffix : str, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffix : str, default ‘’

Suffix to use from right frame’s overlapping columns.

sort : bool, default False (Not supported in Dask)

Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

Returns:
DataFrame

A dataframe containing columns from both the caller and other.

See also

DataFrame.merge
For column(s)-on-columns(s) operations.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.

Support for specifying index levels as the on parameter was added in version 0.23.0.

Examples

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],  # doctest: +SKIP
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df  # doctest: +SKIP
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],  # doctest: +SKIP
...                       'B': ['B0', 'B1', 'B2']})
>>> other  # doctest: +SKIP
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> df.join(other, lsuffix='_caller', rsuffix='_other')  # doctest: +SKIP
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

>>> df.set_index('key').join(other.set_index('key'))  # doctest: +SKIP
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

>>> df.join(other.set_index('key'), on='key')  # doctest: +SKIP
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN
known_divisions

Whether divisions are already known

last(offset)

Convenience method for subsetting final periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.last.

Some inconsistencies with the Dask version may exist.

Parameters:
offset : string, DateOffset, dateutil.relativedelta
Returns:
subset : same type as caller
Raises:
TypeError

If the index is not a DatetimeIndex

See also

first
Select initial periods of time series based on a date offset.
at_time
Select values at a particular time of the day.
between_time
Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  # doctest: +SKIP
>>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i)  # doctest: +SKIP
>>> ts  # doctest: +SKIP
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')  # doctest: +SKIP
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.

le(other, axis='columns', level=None)

Less than or equal to of dataframe and other, element-wise (binary operator le).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  # doctest: +SKIP
>>> df.loc["b":"d"]  # doctest: +SKIP
lt(other, axis='columns', level=None)

Less than of dataframe and other, element-wise (binary operator lt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
map_overlap(func, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters:
func : function

Function applied to each partition.

before : int

The number of rows to prepend to partition i from the end of partition i - 1.

after : int

The number of rows to append to partition i from the beginning of partition i + 1.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

  1. Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.
  2. Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.
  3. Apply func to each partition, passing in any extra args and kwargs if provided.
  4. Trim before rows from the beginning of all but the first partition.
  5. Trim after rows from the end of all but the last partition.

Note that the index and divisions are assumed to remain unchanged.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
0   1  1.0
1   2  2.0
2   4  3.0
3   7  4.0
4  11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
0   NaN  NaN
1   3.0  3.0
2   6.0  5.0
3  11.0  7.0
4  18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a pd.Timedelta for time- based windows.

>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10))
>>> dts = dd.from_pandas(ts, npartitions=2)
>>> dts.map_overlap(lambda df: df.rolling('2D').sum(),
...                 pd.Timedelta('2D'), 0).compute()
2017-01-01     0.0
2017-01-02     1.0
2017-01-03     3.0
2017-01-04     5.0
2017-01-05     7.0
2017-01-06     9.0
2017-01-07    11.0
2017-01-08    13.0
2017-01-09    15.0
2017-01-10    17.0
dtype: float64
map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters:
func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain Scalar, Delayed or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  # doctest: +SKIP
mask(cond, other=nan)

Replace values where the condition is True.

This docstring was copied from pandas.core.frame.DataFrame.mask.

Some inconsistencies with the Dask version may exist.

Parameters:
cond : boolean NDFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as cond.

other : scalar, NDFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as other.

inplace : boolean, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axis : int, default None (Not supported in Dask)

Alignment axis if needed.

level : int, default None (Not supported in Dask)

Alignment level if needed.

errors : str, {‘raise’, ‘ignore’}, default raise (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

  • raise : allow exceptions to be raised.
  • ignore : suppress exceptions. On error return original object.
try_cast : boolean, default False (Not supported in Dask)

Try to cast the result back to the input type (if possible).

raise_on_error : boolean, default True (Not supported in Dask)

Whether to raise on invalid data types (e.g. trying to where on strings).

Deprecated since version 0.21.0: Use errors.

Returns:
wh : same type as caller

See also

DataFrame.where()
Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

Examples

>>> s = pd.Series(range(5))  # doctest: +SKIP
>>> s.where(s > 0)  # doctest: +SKIP
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)  # doctest: +SKIP
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s.where(s > 1, 10)  # doctest: +SKIP
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  # doctest: +SKIP
>>> m = df % 3 == 0  # doctest: +SKIP
>>> df.where(m, -df)  # doctest: +SKIP
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  # doctest: +SKIP
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  # doctest: +SKIP
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=None, skipna=True, split_every=False, out=None)

Return the maximum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.max.

Some inconsistencies with the Dask version may exist.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
axis : {index (0), columns (1)}

Axis for the function to be applied on.

skipna : bool, default True

Exclude NA/null values when computing the result.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only : bool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
max : Series or DataFrame (if level specified)

See also

Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.min
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  # doctest: +SKIP
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  # doctest: +SKIP
>>> s  # doctest: +SKIP
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.max()  # doctest: +SKIP
8

Max using level names, as well as indices.

>>> s.max(level='blooded')  # doctest: +SKIP
blooded
warm    4
cold    8
Name: legs, dtype: int64
>>> s.max(level=0)  # doctest: +SKIP
blooded
warm    4
cold    8
Name: legs, dtype: int64
mean(axis=None, skipna=True, split_every=False, dtype=None, out=None)

Return the mean of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.mean.

Some inconsistencies with the Dask version may exist.

Parameters:
axis : {index (0), columns (1)}

Axis for the function to be applied on.

skipna : bool, default True

Exclude NA/null values when computing the result.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only : bool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
mean : Series or DataFrame (if level specified)
melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
frame : DataFrame
id_vars : tuple, list, or ndarray, optional

Column(s) to use as identifier variables.

value_vars : tuple, list, or ndarray, optional

Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_name : scalar

Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

value_name : scalar, default ‘value’

Name to use for the ‘value’ column.

col_level : int or string, optional

If columns are a MultiIndex then use this level to melt.

Returns:
DataFrame

Unpivoted DataFrame.

memory_usage(index=True, deep=False)

Return the memory usage of each column in bytes.

This docstring was copied from pandas.core.frame.DataFrame.memory_usage.

Some inconsistencies with the Dask version may exist.

The memory usage can optionally include the contribution of the index and elements of object dtype.

This value is displayed in DataFrame.info by default. This can be suppressed by setting pandas.options.display.memory_usage to False.

Parameters:
index : bool, default True

Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True the memory usage of the index the first item in the output.

deep : bool, default False

If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns:
sizes : Series

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

See also

numpy.ndarray.nbytes
Total bytes consumed by the elements of an ndarray.
Series.memory_usage
Bytes consumed by a Series.
pandas.Categorical
Memory-efficient array for string values with many repeated values.
DataFrame.info
Concise summary of a DataFrame.

Examples

>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']  # doctest: +SKIP
>>> data = dict([(t, np.ones(shape=5000).astype(t))  # doctest: +SKIP
...              for t in dtypes])
>>> df = pd.DataFrame(data)  # doctest: +SKIP
>>> df.head()  # doctest: +SKIP
   int64  float64  complex128 object  bool
0      1      1.0      (1+0j)      1  True
1      1      1.0      (1+0j)      1  True
2      1      1.0      (1+0j)      1  True
3      1      1.0      (1+0j)      1  True
4      1      1.0      (1+0j)      1  True
>>> df.memory_usage()  # doctest: +SKIP
Index            80
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64
>>> df.memory_usage(index=False)  # doctest: +SKIP
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64

The memory footprint of object dtype columns is ignored by default:

>>> df.memory_usage(deep=True)  # doctest: +SKIP
Index             80
int64          40000
float64        40000
complex128     80000
object        160000
bool            5000
dtype: int64

Use a Categorical for efficient storage of an object-dtype column with many repeated values.

>>> df['object'].astype('category').memory_usage(deep=True)  # doctest: +SKIP
5168
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)

Merge the DataFrame with another DataFrame

This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.

Parameters:
right: dask.dataframe.DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘inner’

How to handle the operation of the two objects: - left: use calling frame’s index (or column if on is specified) - right: use other frame’s index - outer: form union of calling frame’s index (or column if on is

specified) with other frame’s index, and sort it lexicographically

  • inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
on : label or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_on : label or list, or array-like

Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

right_on : label or list, or array-like

Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

left_index : boolean, default False

Use the index from the left DataFrame as the join key.

right_index : boolean, default False

Use the index from the right DataFrame as the join key.

suffixes : 2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.

npartitions: int or None, optional

The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If None then npartitions = max(lhs.npartitions, rhs.npartitions). Default is None.

shuffle: {‘disk’, ‘tasks’}, optional

Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.

Notes

There are three ways to join dataframes:

  1. Joining on indices. In this case the divisions are aligned using the function dask.dataframe.multi.align_partitions. Afterwards, each partition is merged with the pandas merge function.
  2. Joining one on index and one on column. In this case the divisions of dataframe merged by index (\(d_i\)) are used to divide the column merged dataframe (\(d_c\)) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)).
  3. Joining both on columns. In this case a hash join is performed using dask.dataframe.multi.hash_join.
min(axis=None, skipna=True, split_every=False, out=None)

Return the minimum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.min.

Some inconsistencies with the Dask version may exist.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
axis : {index (0), columns (1)}

Axis for the function to be applied on.

skipna : bool, default True

Exclude NA/null values when computing the result.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only : bool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
min : Series or DataFrame (if level specified)

See also

Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.min
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  # doctest: +SKIP
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  # doctest: +SKIP
>>> s  # doctest: +SKIP
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.min()  # doctest: +SKIP
0

Min using level names, as well as indices.

>>> s.min(level='blooded')  # doctest: +SKIP
blooded
warm    2
cold    0
Name: legs, dtype: int64
>>> s.min(level=0)  # doctest: +SKIP
blooded
warm    2
cold    0
Name: legs, dtype: int64
mod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
mul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ndim

Return dimensionality

ne(other, axis='columns', level=None)

Not equal to of dataframe and other, element-wise (binary operator ne).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  # doctest: +SKIP
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  # doctest: +SKIP
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  # doctest: +SKIP
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  # doctest: +SKIP
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  # doctest: +SKIP
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  # doctest: +SKIP
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  # doctest: +SKIP
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  # doctest: +SKIP
...                      index=['A', 'B', 'C', 'D'])
>>> other  # doctest: +SKIP
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  # doctest: +SKIP
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  # doctest: +SKIP
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  # doctest: +SKIP
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  # doctest: +SKIP
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
nlargest(n=5, columns=None, split_every=None)

Return the first n rows ordered by columns in descending order.

This docstring was copied from pandas.core.frame.DataFrame.nlargest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Parameters:
n : int

Number of rows to return.

columns : label or list of labels

Column label(s) to order by.

keep : {‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

  • first : prioritize the first occurrence(s)
  • last : prioritize the last occurrence(s)
  • all : do not drop any duplicates, even it means
    selecting more than n items.

New in version 0.24.0.

Returns:
DataFrame

The first n rows ordered by the given columns in descending order.

See also

DataFrame.nsmallest
Return the first n rows ordered by columns in ascending order.
DataFrame.sort_values
Sort DataFrame by the values.
DataFrame.head
Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  # doctest: +SKIP
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  # doctest: +SKIP
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nlargest to select the three rows having the largest values in column “population”.

>>> df.nlargest(3, 'population')  # doctest: +SKIP
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

When using keep='last', ties are resolved in reverse order:

>>> df.nlargest(3, 'population', keep='last')  # doctest: +SKIP
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

When using keep='all', all duplicate items are maintained:

>>> df.nlargest(3, 'population', keep='all')  # doctest: +SKIP
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN

To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nlargest(3, ['population', 'GDP'])  # doctest: +SKIP
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN
notnull()

Detect existing (non-missing) values.

This docstring was copied from pandas.core.frame.DataFrame.notnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.notnull
Alias of notna.
DataFrame.isna
Boolean inverse of notna.
DataFrame.dropna
Omit axes labels with missing values.
notna
Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  # doctest: +SKIP
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  # doctest: +SKIP
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()  # doctest: +SKIP
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])  # doctest: +SKIP
>>> ser  # doctest: +SKIP
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()  # doctest: +SKIP
0     True
1     True
2    False
dtype: bool
npartitions

Return number of partitions

nsmallest(n=5, columns=None, split_every=None)

Return the first n rows ordered by columns in ascending order.

This docstring was copied from pandas.core.frame.DataFrame.nsmallest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant.

Parameters:
n : int

Number of items to retrieve.

columns : list or str

Column name or names to order by.

keep : {‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

  • first : take the first occurrence.
  • last : take the last occurrence.
  • all : do not drop any duplicates, even it means selecting more than n items.

New in version 0.24.0.

Returns:
DataFrame

See also

DataFrame.nlargest
Return the first n rows ordered by columns in descending order.
DataFrame.sort_values
Sort DataFrame by the values.
DataFrame.head
Return the first n rows without re-ordering.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  # doctest: +SKIP
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  # doctest: +SKIP
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nsmallest to select the three rows having the smallest values in column “a”.

>>> df.nsmallest(3, 'population')  # doctest: +SKIP
          population  GDP alpha-2
Nauru          11300  182      NR
Tuvalu         11300   38      TV
Anguilla       11300  311      AI

When using keep='last', ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')  # doctest: +SKIP
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru          11300  182      NR

When using keep='all', all duplicate items are maintained:

>>> df.nsmallest(3, 'population', keep='all')  # doctest: +SKIP
          population  GDP alpha-2
Nauru          11300  182      NR
Tuvalu         11300   38      TV
Anguilla       11300  311      AI

To order by the largest values in column “a” and then “c”, we can specify multiple columns like in the next example.

>>> df.nsmallest(3, ['population', 'GDP'])  # doctest: +SKIP
          population  GDP alpha-2
Tuvalu         11300   38      TV
Nauru          11300  182      NR
Anguilla       11300  311      AI
nunique_approx(split_every=None)

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters:
split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns:
a float representing the approximate number of elements
partitions

Slice dataframe by partitions

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example, df.partitions[:5] produces a new Dask Dataframe of the first five partitions.

Returns:
A Dask DataFrame

Examples

>>> df.partitions[0]  # doctest: +SKIP
>>> df.partitions[:3]  # doctest: +SKIP
>>> df.partitions[::10]  # doctest: +SKIP
persist(**kwargs)

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters:
scheduler : string, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

**kwargs

Extra keywords to forward to the scheduler function.

Returns:
New dask collections backed by in-memory data

See also

dask.base.persist

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

This docstring was copied from pandas.core.frame.DataFrame.pipe.

Some inconsistencies with the Dask version may exist.

Parameters:
func : function

function to apply to the NDFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the NDFrame.

args : iterable, optional

positional arguments passed into func.

kwargs : mapping, optional

a dictionary of keyword arguments passed into func.

Returns:
object : the return type of func.

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)  # doctest: +SKIP

You can write

>>> (df.pipe(h)  # doctest: +SKIP
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  # doctest: +SKIP
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )
pivot_table(index=None, columns=None, values=None, aggfunc='mean')

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, values and aggfunc must be all scalar.

Parameters:
values : scalar

column to aggregate

index : scalar

column to be index

columns : scalar

column to be columns

aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’
Returns:
table : DataFrame
pop(item)

Return item and drop from frame. Raise KeyError if not found.

This docstring was copied from pandas.core.frame.DataFrame.pop.

Some inconsistencies with the Dask version may exist.

Parameters:
item : str

Column label to be popped

Returns:
popped : Series

Examples

>>> df = pd.DataFrame([('falcon', 'bird',    389.0),  # doctest: +SKIP
...                    ('parrot', 'bird',     24.0),
...                    ('lion',   'mammal',   80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df  # doctest: +SKIP
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
>>> df.pop('class')  # doctest: +SKIP
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object
>>> df  # doctest: +SKIP
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN
pow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
prod(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)

Return the product of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
axis : {index (0), columns (1)}

Axis for the function to be applied on.

skipna : bool, default True

Exclude NA/null values when computing the result.

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only : bool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

min_count : int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
prod : Series or DataFrame (if level specified)

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([]).prod()  # doctest: +SKIP
1.0

This can be controlled with the min_count parameter

>>> pd.Series([]).prod(min_count=1)  # doctest: +SKIP
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  # doctest: +SKIP
1.0
>>> pd.Series([np.nan]).prod(min_count=1)  # doctest: +SKIP
nan
quantile(q=0.5, axis=0, method='default')

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters:
q : list/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

axis : {0, 1, ‘index’, ‘columns’} (default 0)

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

method : {‘default’, ‘tdigest’, ‘dask’}, optional

What method to use. By default will use dask’s internal custom algorithm ('dask'). If set to 'tdigest' will use tdigest for floats and ints and fallback to the 'dask' otherwise.

query(expr, **kwargs)

Filter dataframe with complex expression

Blocked version of pd.DataFrame.query

This is like the sequential version except that this will also happen in many threads. This may conflict with numexpr which will use multiple threads itself. We recommend that you set numexpr to use a single thread

import numexpr numexpr.set_num_threads(1)
radd(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
random_split(frac, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:
frac : list

List of floats that should sum to one.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  # doctest: +SKIP

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  # doctest: +SKIP
rdiv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
reduction(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)

Generic row-wise reductions.

Parameters:
chunk : callable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregate : callable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

  • scalar: Input is a Series, with one row per partition.
  • Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combine : callable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

token : str, optional

The name to use for the output keys.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargs : dict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargs : dict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargs : dict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'count': x.count(), 'sum': x.sum()},
...                      index=['count', 'sum'])
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'count': x.count(), 'sum': x.sum()},
...                         columns=['count', 'sum'])
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
rename(index=None, columns=None)

Alter axes labels.

This docstring was copied from pandas.core.frame.DataFrame.rename.

Some inconsistencies with the Dask version may exist.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters:
mapper, index, columns : dict-like or function, optional

dict-like or functions transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.

axis : int or str, optional (Not supported in Dask)

Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.

copy : boolean, default True (Not supported in Dask)

Also copy underlying data

inplace : boolean, default False (Not supported in Dask)

Whether to return a new DataFrame. If True then value of copy is ignored.

level : int or level name, default None (Not supported in Dask)

In case of a MultiIndex, only rename labels in the specified level.

Returns:
renamed : DataFrame

Examples

DataFrame.rename supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})  # doctest: +SKIP
>>> df.rename(index=str, columns={"A": "a", "B": "c"})  # doctest: +SKIP
   a  c
0  1  4
1  2  5
2  3  6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})  # doctest: +SKIP
   a  B
0  1  4
1  2  5
2  3  6

Using axis-style parameters

>>> df.rename(str.lower, axis='columns')  # doctest: +SKIP
   a  b
0  1  4
1  2  5
2  3  6
>>> df.rename({1: 2, 2: 4}, axis='index')  # doctest: +SKIP
   A  B
0  1  4
2  2  5
4  3  6
repartition(divisions=None, npartitions=None, partition_size=None, freq=None, force=False)

Repartition dataframe along new divisions

Parameters:
divisions : list, optional

List of partitions to be used. Only used if npartitions and partition_size isn’t specified.

npartitions : int, optional

Number of partitions of output. Only used if partition_size isn’t specified.

partition_size: int or string, optional

Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.

freq : str, pd.Timedelta

A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Examples

>>> df = df.repartition(npartitions=10)  # doctest: +SKIP
>>> df = df.repartition(divisions=[0, 5, 10, 20])  # doctest: +SKIP
>>> df = df.repartition(freq='7d')  # doctest: +SKIP
replace(to_replace=None, value=None, regex=False)

Replace values given in to_replace with value.

This docstring was copied from pandas.core.frame.DataFrame.replace.

Some inconsistencies with the Dask version may exist.

Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
to_replace : str, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

  • numeric, str or regex:

    • numeric: numeric values equal to to_replace will be replaced with value
    • str: string exactly matching to_replace will be replaced with value
    • regex: regexs matching to_replace will be replaced with value
  • list of str, regex, or numeric:

    • First, if to_replace and value are both lists, they must be the same length.
    • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
    • str, regex and numeric rules apply as above.
  • dict:

    • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
    • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
    • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should be None to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
  • None:

    • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

value : scalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplace : bool, default False (Not supported in Dask)

If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.

limit : int, default None (Not supported in Dask)

Maximum size gap to forward or backward fill.

regex : bool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method : {‘pad’, ‘ffill’, ‘bfill’, None} (Not supported in Dask)

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

Returns:
DataFrame

Object after replacement.

Raises:
AssertionError
  • If regex is not a bool and to_replace is not None.
TypeError
  • If to_replace is a dict and value is not a list, dict, ndarray, or Series
  • If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
  • When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced
ValueError
  • If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna
Fill NA values.
DataFrame.where
Replace values based on boolean condition.
Series.str.replace
Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([0, 1, 2, 3, 4])  # doctest: +SKIP
>>> s.replace(0, 5)  # doctest: +SKIP
0    5
1    1
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],  # doctest: +SKIP
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)  # doctest: +SKIP
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)  # doctest: +SKIP
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])  # doctest: +SKIP
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')  # doctest: +SKIP
0    0
1    3
2    3
3    3
4    4
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})  # doctest: +SKIP
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)  # doctest: +SKIP
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})  # doctest: +SKIP
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],  # doctest: +SKIP
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)  # doctest: +SKIP
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)  # doctest: +SKIP
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')  # doctest: +SKIP
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})  # doctest: +SKIP
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')  # doctest: +SKIP
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace parameter must match the data type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],  # doctest: +SKIP
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises  # doctest: +SKIP
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a TypeError because one of the dict keys is not of the correct type for replacement.

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])  # doctest: +SKIP

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})  # doctest: +SKIP
0      10
1    None
2    None
3       b
4    None
dtype: object

When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.replace(to_replace='a', value=None, method='pad'):

>>> s.replace('a', None)  # doctest: +SKIP
0    10
1    10
2    10
3     b
4     b
dtype: object
resample(rule, closed=None, label=None)

Resample time-series data.

This docstring was copied from pandas.core.frame.DataFrame.resample.

Some inconsistencies with the Dask version may exist.

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters:
rule : str

The offset string or object representing target conversion.

how : str (Not supported in Dask)

Method for down/re-sampling, default to ‘mean’ for downsampling.

Deprecated since version 0.18.0: The new syntax is .resample(...).mean(), or .resample(...).apply(<func>)

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Which axis to use for up- or down-sampling. For Series this will default to 0, i.e. along the rows. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.

fill_method : str, default None (Not supported in Dask)

Filling method for upsampling.

Deprecated since version 0.18.0: The new syntax is .resample(...).<func>(), e.g. .resample(...).pad()

closed : {‘right’, ‘left’}, default None

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

label : {‘right’, ‘left’}, default None

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

convention : {‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask)

For PeriodIndex only, controls whether to use the start or end of rule.

kind : {‘timestamp’, ‘period’}, optional, default None (Not supported in Dask)

Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.

loffset : timedelta, default None (Not supported in Dask)

Adjust the resampled time labels.

limit : int, default None (Not supported in Dask)

Maximum size gap when reindexing with fill_method.

Deprecated since version 0.18.0.

base : int, default 0 (Not supported in Dask)

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.

on : str, optional (Not supported in Dask)

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

New in version 0.19.0.

level : str or int, optional (Not supported in Dask)

For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

New in version 0.19.0.

Returns:
Resampler object

See also

groupby
Group by mapping, function, label, or list of labels.
Series.resample
Resample a Series.
DataFrame.resample
Resample a DataFrame.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  # doctest: +SKIP
>>> series = pd.Series(range(9), index=index)  # doctest: +SKIP
>>> series  # doctest: +SKIP
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  # doctest: +SKIP
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  # doctest: +SKIP
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  # doctest: +SKIP
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows  # doctest: +SKIP
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]  # doctest: +SKIP
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  # doctest: +SKIP
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):  # doctest: +SKIP
...     return np.sum(array_like) + 5
...
>>> series.resample('3T').apply(custom_resampler)  # doctest: +SKIP
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  # doctest: +SKIP
...                                             freq='A',
...                                             periods=2))
>>> s  # doctest: +SKIP
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()  # doctest: +SKIP
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',  # doctest: +SKIP
...                                                   freq='Q',
...                                                   periods=4))
>>> q  # doctest: +SKIP
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()  # doctest: +SKIP
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  # doctest: +SKIP
...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df = pd.DataFrame(d)  # doctest: +SKIP
>>> df['week_starting'] = pd.date_range('01/01/2018',  # doctest: +SKIP
...                                     periods=8,
...                                     freq='W')
>>> df  # doctest: +SKIP
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()  # doctest: +SKIP
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')  # doctest: +SKIP
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  # doctest: +SKIP
...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df2 = pd.DataFrame(d2,  # doctest: +SKIP
...                    index=pd.MultiIndex.from_product([days,
...                                                     ['morning',
...                                                      'afternoon']]
...                                                     ))
>>> df2  # doctest: +SKIP
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()  # doctest: +SKIP
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90
reset_index(drop=False)

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:
drop : boolean, default False

Do not try to insert index into dataframe columns.

rfloordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rolling(window, min_periods=None, center=False, win_type=None, axis=0)

Provides rolling transformations.

Parameters:
window : int, str, offset

Size of the moving window. This is the number of observations used for calculating the statistic. When not using a DatetimeIndex, the window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

center : boolean, default False

Set the labels at the center of the window.

win_type : string, default None

Provide a window type. The recognized window types are identical to pandas.

axis : int, default 0
Returns:
a Rolling object on which to call a method to compute a statistic
round(decimals=0)

Round a DataFrame to a variable number of decimal places.

This docstring was copied from pandas.core.frame.DataFrame.round.

Some inconsistencies with the Dask version may exist.

Parameters:
decimals : int, dict, Series

Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

Returns:
DataFrame

Examples

>>> df = pd.DataFrame(np.random.random([3, 3]),  # doctest: +SKIP
...     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
>>> df  # doctest: +SKIP
               A         B         C
first   0.028208  0.992815  0.173891
second  0.038683  0.645646  0.577595
third   0.877076  0.149370  0.491027
>>> df.round(2)  # doctest: +SKIP
           A     B     C
first   0.03  0.99  0.17
second  0.04  0.65  0.58
third   0.88  0.15  0.49
>>> df.round({'A': 1, 'C': 2})  # doctest: +SKIP
          A         B     C
first   0.0  0.992815  0.17
second  0.0  0.645646  0.58
third   0.9  0.149370  0.49
>>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C'])  # doctest: +SKIP
>>> df.round(decimals)  # doctest: +SKIP
          A  B     C
first   0.0  1  0.17
second  0.0  1  0.58
third   0.9  0  0.49
rpow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rsub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rtruediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
other : scalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis : {0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.

level : int or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value : float or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  # doctest: +SKIP
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  # doctest: +SKIP
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  # doctest: +SKIP
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  # doctest: +SKIP
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  # doctest: +SKIP
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  # doctest: +SKIP
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  # doctest: +SKIP
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  # doctest: +SKIP
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  # doctest: +SKIP
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  # doctest: +SKIP
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  # doctest: +SKIP
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  # doctest: +SKIP
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  # doctest: +SKIP
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  # doctest: +SKIP
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
sample(n=None, frac=None, replace=False, random_state=None)

Random sample of items

Parameters:
n : int, optional

Number of items to return is not supported by dask. Use frac instead.

frac : float, optional

Fraction of axis items to return.

replace : boolean, optional

Sample with or without replacement. Default = False.

random_state : int or np.random.RandomState

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

select_dtypes(include=None, exclude=None)

Return a subset of the DataFrame’s columns based on the column dtypes.

This docstring was copied from pandas.core.frame.DataFrame.select_dtypes.

Some inconsistencies with the Dask version may exist.

Parameters:
include, exclude : scalar or list-like

A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns:
subset : DataFrame

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises:
ValueError
  • If both of include and exclude are empty
  • If include and exclude have overlapping elements
  • If any kind of string dtype is passed in.

Notes

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns
  • See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

Examples

>>> df = pd.DataFrame({'a': [1, 2] * 3,  # doctest: +SKIP
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df  # doctest: +SKIP
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0
>>> df.select_dtypes(include='bool')  # doctest: +SKIP
   b
0  True
1  False
2  True
3  False
4  True
5  False
>>> df.select_dtypes(include=['float64'])  # doctest: +SKIP
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
>>> df.select_dtypes(exclude=['int'])  # doctest: +SKIP
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0
sem(axis=None, skipna=None, ddof=1, split_every=False)

Return unbiased standard error of the mean over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sem.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
axis : {index (0), columns (1)}
skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

Delta Degrees of Freedom. Th