API¶
Dataframe¶
DataFrame (dsk, name, meta, divisions) |
Parallel Pandas DataFrame |
DataFrame.abs () |
Return a Series/DataFrame with absolute numeric value of each element. |
DataFrame.add (other[, axis, level, fill_value]) |
Get Addition of dataframe and other, element-wise (binary operator add). |
DataFrame.align (other[, join, axis, fill_value]) |
Align two objects on their axes with the specified join method. |
DataFrame.all ([axis, skipna, split_every, out]) |
Return whether all elements are True, potentially over an axis. |
DataFrame.any ([axis, skipna, split_every, out]) |
Return whether any element is True, potentially over an axis. |
DataFrame.append (other[, interleave_partitions]) |
Append rows of other to the end of caller, returning a new object. |
DataFrame.apply (func[, axis, broadcast, …]) |
Parallel version of pandas.DataFrame.apply |
DataFrame.applymap (func[, meta]) |
Apply a function to a Dataframe elementwise. |
DataFrame.assign (**kwargs) |
Assign new columns to a DataFrame. |
DataFrame.astype (dtype) |
Cast a pandas object to a specified dtype dtype . |
DataFrame.bfill ([axis, limit]) |
Synonym for DataFrame.fillna() with method='bfill' . |
DataFrame.categorize ([columns, index, …]) |
Convert columns of the DataFrame to category dtype. |
DataFrame.columns |
|
DataFrame.compute (**kwargs) |
Compute this dask collection |
DataFrame.copy () |
Make a copy of the dataframe |
DataFrame.corr ([method, min_periods, …]) |
Compute pairwise correlation of columns, excluding NA/null values. |
DataFrame.count ([axis, split_every]) |
Count non-NA cells for each column or row. |
DataFrame.cov ([min_periods, split_every]) |
Compute pairwise covariance of columns, excluding NA/null values. |
DataFrame.cummax ([axis, skipna, out]) |
Return cumulative maximum over a DataFrame or Series axis. |
DataFrame.cummin ([axis, skipna, out]) |
Return cumulative minimum over a DataFrame or Series axis. |
DataFrame.cumprod ([axis, skipna, dtype, out]) |
Return cumulative product over a DataFrame or Series axis. |
DataFrame.cumsum ([axis, skipna, dtype, out]) |
Return cumulative sum over a DataFrame or Series axis. |
DataFrame.describe ([split_every, …]) |
Generate descriptive statistics. |
DataFrame.diff ([periods, axis]) |
First discrete difference of element. |
DataFrame.div (other[, axis, level, fill_value]) |
Get Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.divide (other[, axis, level, …]) |
Get Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.drop ([labels, axis, columns, errors]) |
Drop specified labels from rows or columns. |
DataFrame.drop_duplicates ([subset, …]) |
Return DataFrame with duplicate rows removed. |
DataFrame.dropna ([how, subset, thresh]) |
Remove missing values. |
DataFrame.dtypes |
Return data types |
DataFrame.eq (other[, axis, level]) |
Get Equal to of dataframe and other, element-wise (binary operator eq). |
DataFrame.eval (expr[, inplace]) |
Evaluate a string describing operations on DataFrame columns. |
DataFrame.explode (column) |
Transform each element of a list-like to a row, replicating index values. |
DataFrame.ffill ([axis, limit]) |
Synonym for DataFrame.fillna() with method='ffill' . |
DataFrame.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method. |
DataFrame.first (offset) |
Select initial periods of time series data based on a date offset. |
DataFrame.floordiv (other[, axis, level, …]) |
Get Integer division of dataframe and other, element-wise (binary operator floordiv). |
DataFrame.ge (other[, axis, level]) |
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge). |
DataFrame.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
DataFrame.groupby ([by, group_keys, sort, …]) |
Group DataFrame using a mapper or by a Series of columns. |
DataFrame.gt (other[, axis, level]) |
Get Greater than of dataframe and other, element-wise (binary operator gt). |
DataFrame.head ([n, npartitions, compute]) |
First n rows of the dataset |
DataFrame.idxmax ([axis, skipna, split_every]) |
Return index of first occurrence of maximum over requested axis. |
DataFrame.idxmin ([axis, skipna, split_every]) |
Return index of first occurrence of minimum over requested axis. |
DataFrame.iloc |
Purely integer-location based indexing for selection by position. |
DataFrame.index |
Return dask Index instance |
DataFrame.info ([buf, verbose, memory_usage]) |
Concise summary of a Dask DataFrame. |
DataFrame.isin (values) |
Whether each element in the DataFrame is contained in values. |
DataFrame.isna () |
Detect missing values. |
DataFrame.isnull () |
Detect missing values. |
DataFrame.items () |
Iterate over (column name, Series) pairs. |
DataFrame.iteritems |
|
DataFrame.iterrows () |
Iterate over DataFrame rows as (index, Series) pairs. |
DataFrame.itertuples ([index, name]) |
Iterate over DataFrame rows as namedtuples. |
DataFrame.join (other[, on, how, lsuffix, …]) |
Join columns of another DataFrame. |
DataFrame.known_divisions |
Whether divisions are already known |
DataFrame.last (offset) |
Select final periods of time series data based on a date offset. |
DataFrame.le (other[, axis, level]) |
Get Less than or equal to of dataframe and other, element-wise (binary operator le). |
DataFrame.loc |
Purely label-location based indexer for selection by label. |
DataFrame.lt (other[, axis, level]) |
Get Less than of dataframe and other, element-wise (binary operator lt). |
DataFrame.map_partitions (func, *args, **kwargs) |
Apply Python function on each DataFrame partition. |
DataFrame.mask (cond[, other]) |
Replace values where the condition is True. |
DataFrame.max ([axis, skipna, split_every, out]) |
Return the maximum of the values over the requested axis. |
DataFrame.mean ([axis, skipna, split_every, …]) |
Return the mean of the values over the requested axis. |
DataFrame.melt ([id_vars, value_vars, …]) |
Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set. |
DataFrame.memory_usage ([index, deep]) |
Return the memory usage of each column in bytes. |
DataFrame.memory_usage_per_partition ([…]) |
Return the memory usage of each partition |
DataFrame.merge (right[, how, on, left_on, …]) |
Merge the DataFrame with another DataFrame |
DataFrame.min ([axis, skipna, split_every, out]) |
Return the minimum of the values over the requested axis. |
DataFrame.mod (other[, axis, level, fill_value]) |
Get Modulo of dataframe and other, element-wise (binary operator mod). |
DataFrame.mode ([dropna, split_every]) |
Get the mode(s) of each element along the selected axis. |
DataFrame.mul (other[, axis, level, fill_value]) |
Get Multiplication of dataframe and other, element-wise (binary operator mul). |
DataFrame.ndim |
Return dimensionality |
DataFrame.ne (other[, axis, level]) |
Get Not equal to of dataframe and other, element-wise (binary operator ne). |
DataFrame.nlargest ([n, columns, split_every]) |
Return the first n rows ordered by columns in descending order. |
DataFrame.npartitions |
Return number of partitions |
DataFrame.nsmallest ([n, columns, split_every]) |
Return the first n rows ordered by columns in ascending order. |
DataFrame.partitions |
Slice dataframe by partitions |
DataFrame.pivot_table ([index, columns, …]) |
Create a spreadsheet-style pivot table as a DataFrame. |
DataFrame.pop (item) |
Return item and drop from frame. |
DataFrame.pow (other[, axis, level, fill_value]) |
Get Exponential power of dataframe and other, element-wise (binary operator pow). |
DataFrame.prod ([axis, skipna, split_every, …]) |
Return the product of the values over the requested axis. |
DataFrame.quantile ([q, axis, method]) |
Approximate row-wise and precise column-wise quantiles of DataFrame |
DataFrame.query (expr, **kwargs) |
Filter dataframe with complex expression |
DataFrame.radd (other[, axis, level, fill_value]) |
Get Addition of dataframe and other, element-wise (binary operator radd). |
DataFrame.random_split (frac[, random_state, …]) |
Pseudorandomly split dataframe into different pieces row-wise |
DataFrame.rdiv (other[, axis, level, fill_value]) |
Get Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.rename ([index, columns]) |
Alter axes labels. |
DataFrame.repartition ([divisions, …]) |
Repartition dataframe along new divisions |
DataFrame.replace ([to_replace, value, regex]) |
Replace values given in to_replace with value. |
DataFrame.resample (rule[, closed, label]) |
Resample time-series data. |
DataFrame.reset_index ([drop]) |
Reset the index to the default index. |
DataFrame.rfloordiv (other[, axis, level, …]) |
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv). |
DataFrame.rmod (other[, axis, level, fill_value]) |
Get Modulo of dataframe and other, element-wise (binary operator rmod). |
DataFrame.rmul (other[, axis, level, fill_value]) |
Get Multiplication of dataframe and other, element-wise (binary operator rmul). |
DataFrame.round ([decimals]) |
Round a DataFrame to a variable number of decimal places. |
DataFrame.rpow (other[, axis, level, fill_value]) |
Get Exponential power of dataframe and other, element-wise (binary operator rpow). |
DataFrame.rsub (other[, axis, level, fill_value]) |
Get Subtraction of dataframe and other, element-wise (binary operator rsub). |
DataFrame.rtruediv (other[, axis, level, …]) |
Get Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.sample ([n, frac, replace, …]) |
Random sample of items |
DataFrame.select_dtypes ([include, exclude]) |
Return a subset of the DataFrame’s columns based on the column dtypes. |
DataFrame.sem ([axis, skipna, ddof, split_every]) |
Return unbiased standard error of the mean over requested axis. |
DataFrame.set_index (other[, drop, sorted, …]) |
Set the DataFrame index (row labels) using an existing column. |
DataFrame.shape |
Return a tuple representing the dimensionality of the DataFrame. |
DataFrame.size |
Size of the Series or DataFrame as a Delayed object. |
DataFrame.squeeze ([axis]) |
Squeeze 1 dimensional axis objects into scalars. |
DataFrame.std ([axis, skipna, ddof, …]) |
Return sample standard deviation over requested axis. |
DataFrame.sub (other[, axis, level, fill_value]) |
Get Subtraction of dataframe and other, element-wise (binary operator sub). |
DataFrame.sum ([axis, skipna, split_every, …]) |
Return the sum of the values over the requested axis. |
DataFrame.tail ([n, compute]) |
Last n rows of the dataset |
DataFrame.to_bag ([index]) |
Create Dask Bag from a Dask DataFrame |
DataFrame.to_csv (filename, **kwargs) |
Store Dask DataFrame to CSV files |
DataFrame.to_dask_array ([lengths, meta]) |
Convert a dask DataFrame to a dask array. |
DataFrame.to_delayed ([optimize_graph]) |
Convert into a list of dask.delayed objects, one per partition. |
DataFrame.to_hdf (path_or_buf, key[, mode, …]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
DataFrame.to_html ([max_rows]) |
Render a DataFrame as an HTML table. |
DataFrame.to_json (filename, *args, **kwargs) |
See dd.to_json docstring for more information |
DataFrame.to_parquet (path, *args, **kwargs) |
Store Dask.dataframe to Parquet files |
DataFrame.to_records ([index, lengths]) |
Create Dask Array from a Dask Dataframe |
DataFrame.to_string ([max_rows]) |
Render a DataFrame to a console-friendly tabular output. |
DataFrame.to_sql (name, uri[, schema, …]) |
See dd.to_sql docstring for more information |
DataFrame.to_timestamp ([freq, how, axis]) |
Cast to DatetimeIndex of timestamps, at beginning of period. |
DataFrame.truediv (other[, axis, level, …]) |
Get Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.values |
Return a dask.array of the values of this dataframe |
DataFrame.var ([axis, skipna, ddof, …]) |
Return unbiased variance over requested axis. |
DataFrame.visualize ([filename, format, …]) |
Render the computation of this object’s task graph using graphviz. |
DataFrame.where (cond[, other]) |
Replace values where the condition is False. |
Series¶
Series (dsk, name, meta, divisions) |
Parallel Pandas Series |
Series.add (other[, level, fill_value, axis]) |
Return Addition of series and other, element-wise (binary operator add). |
Series.align (other[, join, axis, fill_value]) |
Align two objects on their axes with the specified join method. |
Series.all ([axis, skipna, split_every, out]) |
Return whether all elements are True, potentially over an axis. |
Series.any ([axis, skipna, split_every, out]) |
Return whether any element is True, potentially over an axis. |
Series.append (other[, interleave_partitions]) |
Concatenate two or more Series. |
Series.apply (func[, convert_dtype, meta, args]) |
Parallel version of pandas.Series.apply |
Series.astype (dtype) |
Cast a pandas object to a specified dtype dtype . |
Series.autocorr ([lag, split_every]) |
Compute the lag-N autocorrelation. |
Series.between (left, right[, inclusive]) |
Return boolean Series equivalent to left <= series <= right. |
Series.bfill ([axis, limit]) |
Synonym for DataFrame.fillna() with method='bfill' . |
Series.cat |
|
Series.clear_divisions () |
Forget division information |
Series.clip ([lower, upper, out]) |
Trim values at input threshold(s). |
Series.clip_lower (threshold) |
|
Series.clip_upper (threshold) |
|
Series.compute (**kwargs) |
Compute this dask collection |
Series.copy () |
Make a copy of the dataframe |
Series.corr (other[, method, min_periods, …]) |
Compute correlation with other Series, excluding missing values. |
Series.count ([split_every]) |
Return number of non-NA/null observations in the Series. |
Series.cov (other[, min_periods, split_every]) |
Compute covariance with Series, excluding missing values. |
Series.cummax ([axis, skipna, out]) |
Return cumulative maximum over a DataFrame or Series axis. |
Series.cummin ([axis, skipna, out]) |
Return cumulative minimum over a DataFrame or Series axis. |
Series.cumprod ([axis, skipna, dtype, out]) |
Return cumulative product over a DataFrame or Series axis. |
Series.cumsum ([axis, skipna, dtype, out]) |
Return cumulative sum over a DataFrame or Series axis. |
Series.describe ([split_every, percentiles, …]) |
Generate descriptive statistics. |
Series.diff ([periods, axis]) |
First discrete difference of element. |
Series.div (other[, level, fill_value, axis]) |
Return Floating division of series and other, element-wise (binary operator truediv). |
Series.drop_duplicates ([subset, …]) |
Return DataFrame with duplicate rows removed. |
Series.dropna () |
Return a new Series with missing values removed. |
Series.dt |
Namespace of datetime methods |
Series.dtype |
Return data type |
Series.eq (other[, level, fill_value, axis]) |
Return Equal to of series and other, element-wise (binary operator eq). |
Series.explode () |
Transform each element of a list-like to a row. |
Series.ffill ([axis, limit]) |
Synonym for DataFrame.fillna() with method='ffill' . |
Series.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method. |
Series.first (offset) |
Select initial periods of time series data based on a date offset. |
Series.floordiv (other[, level, fill_value, axis]) |
Return Integer division of series and other, element-wise (binary operator floordiv). |
Series.ge (other[, level, fill_value, axis]) |
Return Greater than or equal to of series and other, element-wise (binary operator ge). |
Series.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
Series.groupby ([by, group_keys, sort, …]) |
Group Series using a mapper or by a Series of columns. |
Series.gt (other[, level, fill_value, axis]) |
Return Greater than of series and other, element-wise (binary operator gt). |
Series.head ([n, npartitions, compute]) |
First n rows of the dataset |
Series.idxmax ([axis, skipna, split_every]) |
Return index of first occurrence of maximum over requested axis. |
Series.idxmin ([axis, skipna, split_every]) |
Return index of first occurrence of minimum over requested axis. |
Series.isin (values) |
Whether elements in Series are contained in values. |
Series.isna () |
Detect missing values. |
Series.isnull () |
Detect missing values. |
Series.iteritems () |
Lazily iterate over (index, value) tuples. |
Series.known_divisions |
Whether divisions are already known |
Series.last (offset) |
Select final periods of time series data based on a date offset. |
Series.le (other[, level, fill_value, axis]) |
Return Less than or equal to of series and other, element-wise (binary operator le). |
Series.loc |
Purely label-location based indexer for selection by label. |
Series.lt (other[, level, fill_value, axis]) |
Return Less than of series and other, element-wise (binary operator lt). |
Series.map (arg[, na_action, meta]) |
Map values of Series according to input correspondence. |
Series.map_overlap (func, before, after, …) |
Apply a function to each partition, sharing rows with adjacent partitions. |
Series.map_partitions (func, *args, **kwargs) |
Apply Python function on each DataFrame partition. |
Series.mask (cond[, other]) |
Replace values where the condition is True. |
Series.max ([axis, skipna, split_every, out]) |
Return the maximum of the values over the requested axis. |
Series.mean ([axis, skipna, split_every, …]) |
Return the mean of the values over the requested axis. |
Series.memory_usage ([index, deep]) |
Return the memory usage of the Series. |
Series.memory_usage_per_partition ([index, deep]) |
Return the memory usage of each partition |
Series.min ([axis, skipna, split_every, out]) |
Return the minimum of the values over the requested axis. |
Series.mod (other[, level, fill_value, axis]) |
Return Modulo of series and other, element-wise (binary operator mod). |
Series.mul (other[, level, fill_value, axis]) |
Return Multiplication of series and other, element-wise (binary operator mul). |
Series.nbytes |
Number of bytes |
Series.ndim |
Return dimensionality |
Series.ne (other[, level, fill_value, axis]) |
Return Not equal to of series and other, element-wise (binary operator ne). |
Series.nlargest ([n, split_every]) |
Return the largest n elements. |
Series.notnull () |
Detect existing (non-missing) values. |
Series.nsmallest ([n, split_every]) |
Return the smallest n elements. |
Series.nunique ([split_every]) |
Return number of unique elements in the object. |
Series.nunique_approx ([split_every]) |
Approximate number of unique rows. |
Series.persist (**kwargs) |
Persist this dask collection into memory |
Series.pipe (func, *args, **kwargs) |
Apply func(self, *args, **kwargs). |
Series.pow (other[, level, fill_value, axis]) |
Return Exponential power of series and other, element-wise (binary operator pow). |
Series.prod ([axis, skipna, split_every, …]) |
Return the product of the values over the requested axis. |
Series.quantile ([q, method]) |
Approximate quantiles of Series |
Series.radd (other[, level, fill_value, axis]) |
Return Addition of series and other, element-wise (binary operator radd). |
Series.random_split (frac[, random_state, …]) |
Pseudorandomly split dataframe into different pieces row-wise |
Series.rdiv (other[, level, fill_value, axis]) |
Return Floating division of series and other, element-wise (binary operator rtruediv). |
Series.reduction (chunk[, aggregate, …]) |
Generic row-wise reductions. |
Series.repartition ([divisions, npartitions, …]) |
Repartition dataframe along new divisions |
Series.replace ([to_replace, value, regex]) |
Replace values given in to_replace with value. |
Series.rename ([index, inplace, sorted_index]) |
Alter Series index labels or name |
Series.resample (rule[, closed, label]) |
Resample time-series data. |
Series.reset_index ([drop]) |
Reset the index to the default index. |
Series.rolling (window[, min_periods, …]) |
Provides rolling transformations. |
Series.round ([decimals]) |
Round each value in a Series to the given number of decimals. |
Series.sample ([n, frac, replace, random_state]) |
Random sample of items |
Series.sem ([axis, skipna, ddof, split_every]) |
Return unbiased standard error of the mean over requested axis. |
Series.shape |
Return a tuple representing the dimensionality of a Series. |
Series.shift ([periods, freq, axis]) |
Shift index by desired number of periods with an optional time freq. |
Series.size |
Size of the Series or DataFrame as a Delayed object. |
Series.std ([axis, skipna, ddof, …]) |
Return sample standard deviation over requested axis. |
Series.str |
Namespace for string methods |
Series.sub (other[, level, fill_value, axis]) |
Return Subtraction of series and other, element-wise (binary operator sub). |
Series.sum ([axis, skipna, split_every, …]) |
Return the sum of the values over the requested axis. |
Series.to_bag ([index]) |
Create a Dask Bag from a Series |
Series.to_csv (filename, **kwargs) |
Store Dask DataFrame to CSV files |
Series.to_dask_array ([lengths, meta]) |
Convert a dask DataFrame to a dask array. |
Series.to_delayed ([optimize_graph]) |
Convert into a list of dask.delayed objects, one per partition. |
Series.to_frame ([name]) |
Convert Series to DataFrame. |
Series.to_hdf (path_or_buf, key[, mode, append]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
Series.to_string ([max_rows]) |
Render a string representation of the Series. |
Series.to_timestamp ([freq, how, axis]) |
Cast to DatetimeIndex of timestamps, at beginning of period. |
Series.truediv (other[, level, fill_value, axis]) |
Return Floating division of series and other, element-wise (binary operator truediv). |
Series.unique ([split_every, split_out]) |
Return Series of unique values in the object. |
Series.value_counts ([sort, ascending, …]) |
Return a Series containing counts of unique values. |
Series.values |
Return a dask.array of the values of this dataframe |
Series.var ([axis, skipna, ddof, …]) |
Return unbiased variance over requested axis. |
Series.visualize ([filename, format, …]) |
Render the computation of this object’s task graph using graphviz. |
Series.where (cond[, other]) |
Replace values where the condition is False. |
Groupby Operations¶
DataFrameGroupBy.aggregate (arg[, …]) |
Aggregate using one or more operations over the specified axis. |
DataFrameGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
DataFrameGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values. |
DataFrameGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
DataFrameGroupBy.cumprod ([axis]) |
Cumulative product for each group. |
DataFrameGroupBy.cumsum ([axis]) |
Cumulative sum for each group. |
DataFrameGroupBy.get_group (key) |
Construct DataFrame from group with provided name. |
DataFrameGroupBy.max ([split_every, split_out]) |
Compute max of group values. |
DataFrameGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values. |
DataFrameGroupBy.min ([split_every, split_out]) |
Compute min of group values. |
DataFrameGroupBy.size ([split_every, split_out]) |
Compute group sizes. |
DataFrameGroupBy.std ([ddof, split_every, …]) |
Compute standard deviation of groups, excluding missing values. |
DataFrameGroupBy.sum ([split_every, …]) |
Compute sum of group values. |
DataFrameGroupBy.var ([ddof, split_every, …]) |
Compute variance of groups, excluding missing values. |
DataFrameGroupBy.cov ([ddof, split_every, …]) |
Compute pairwise covariance of columns, excluding NA/null values. |
DataFrameGroupBy.corr ([ddof, split_every, …]) |
Compute pairwise correlation of columns, excluding NA/null values. |
DataFrameGroupBy.first ([split_every, split_out]) |
Compute first of group values. |
DataFrameGroupBy.last ([split_every, split_out]) |
Compute last of group values. |
DataFrameGroupBy.idxmin ([split_every, …]) |
Return index of first occurrence of minimum over requested axis. |
DataFrameGroupBy.idxmax ([split_every, …]) |
Return index of first occurrence of maximum over requested axis. |
SeriesGroupBy.aggregate (arg[, split_every, …]) |
Aggregate using one or more operations over the specified axis. |
SeriesGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
SeriesGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values. |
SeriesGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
SeriesGroupBy.cumprod ([axis]) |
Cumulative product for each group. |
SeriesGroupBy.cumsum ([axis]) |
Cumulative sum for each group. |
SeriesGroupBy.get_group (key) |
Construct DataFrame from group with provided name. |
SeriesGroupBy.max ([split_every, split_out]) |
Compute max of group values. |
SeriesGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values. |
SeriesGroupBy.min ([split_every, split_out]) |
Compute min of group values. |
SeriesGroupBy.nunique ([split_every, split_out]) |
Return number of unique elements in the group. |
SeriesGroupBy.size ([split_every, split_out]) |
Compute group sizes. |
SeriesGroupBy.std ([ddof, split_every, split_out]) |
Compute standard deviation of groups, excluding missing values. |
SeriesGroupBy.sum ([split_every, split_out, …]) |
Compute sum of group values. |
SeriesGroupBy.var ([ddof, split_every, split_out]) |
Compute variance of groups, excluding missing values. |
SeriesGroupBy.first ([split_every, split_out]) |
Compute first of group values. |
SeriesGroupBy.last ([split_every, split_out]) |
Compute last of group values. |
SeriesGroupBy.idxmin ([split_every, …]) |
Return index of first occurrence of minimum over requested axis. |
SeriesGroupBy.idxmax ([split_every, …]) |
Return index of first occurrence of maximum over requested axis. |
Aggregation (name, chunk, agg[, finalize]) |
User defined groupby-aggregation. |
Rolling Operations¶
rolling.map_overlap (func, df, before, after, …) |
Apply a function to each partition, sharing rows with adjacent partitions. |
Series.rolling (window[, min_periods, …]) |
Provides rolling transformations. |
DataFrame.rolling (window[, min_periods, …]) |
Provides rolling transformations. |
Rolling.apply (func[, raw, engine, …]) |
Apply an arbitrary function to each rolling window. |
Rolling.count () |
The rolling count of any non-NaN observations inside the window. |
Rolling.kurt () |
Calculate unbiased rolling kurtosis. |
Rolling.max () |
Calculate the rolling maximum. |
Rolling.mean () |
Calculate the rolling mean of the values. |
Rolling.median () |
Calculate the rolling median. |
Rolling.min () |
Calculate the rolling minimum. |
Rolling.quantile (quantile) |
Calculate the rolling quantile. |
Rolling.skew () |
Unbiased rolling skewness. |
Rolling.std ([ddof]) |
Calculate rolling standard deviation. |
Rolling.sum () |
Calculate rolling sum of given DataFrame or Series. |
Rolling.var ([ddof]) |
Calculate unbiased rolling variance. |
Create DataFrames¶
read_csv (urlpath[, blocksize, …]) |
Read CSV files into a Dask.DataFrame |
read_table (urlpath[, blocksize, …]) |
Read delimited files into a Dask.DataFrame |
read_fwf (urlpath[, blocksize, …]) |
Read fixed-width files into a Dask.DataFrame |
read_parquet (path[, columns, filters, …]) |
Read a Parquet file into a Dask DataFrame |
read_hdf (pattern, key[, start, stop, …]) |
Read HDF files into a Dask DataFrame |
read_json (url_path[, orient, lines, …]) |
Create a dataframe from a set of JSON files |
read_orc (path[, columns, storage_options]) |
Read dataframe from ORC file(s) |
read_sql_table (table, uri, index_col[, …]) |
Create dataframe from an SQL table. |
from_array (x[, chunksize, columns, meta]) |
Read any sliceable array into a Dask Dataframe |
from_bcolz (x[, chunksize, categorize, …]) |
Read BColz CTable into a Dask Dataframe |
from_dask_array (x[, columns, index, meta]) |
Create a Dask DataFrame from a Dask Array. |
from_delayed (dfs[, meta, divisions, prefix, …]) |
Create Dask DataFrame from many Dask Delayed objects |
from_pandas (data[, npartitions, chunksize, …]) |
Construct a Dask DataFrame from a Pandas DataFrame |
dask.bag.core.Bag.to_dataframe ([meta, columns]) |
Create Dask Dataframe from a Dask Bag. |
Store DataFrames¶
to_csv (df, filename[, single_file, …]) |
Store Dask DataFrame to CSV files |
to_parquet (df, path[, engine, compression, …]) |
Store Dask.dataframe to Parquet files |
to_hdf (df, path, key[, mode, append, …]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
to_records (df) |
Create Dask Array from a Dask Dataframe |
to_sql (df, name, uri[, schema, index_label, …]) |
Store Dask Dataframe to a SQL table |
to_json (df, url_path[, orient, lines, …]) |
Write dataframe into JSON text files |
Convert DataFrames¶
DataFrame.to_bag ([index]) |
Create Dask Bag from a Dask DataFrame |
DataFrame.to_dask_array ([lengths, meta]) |
Convert a dask DataFrame to a dask array. |
DataFrame.to_delayed ([optimize_graph]) |
Convert into a list of dask.delayed objects, one per partition. |
Reshape DataFrames¶
get_dummies (data[, prefix, prefix_sep, …]) |
Convert categorical variable into dummy/indicator variables. |
pivot_table (df[, index, columns, values, …]) |
Create a spreadsheet-style pivot table as a DataFrame. |
melt (frame[, id_vars, value_vars, var_name, …]) |
Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set. |
DataFrame Methods¶
-
class
dask.dataframe.
DataFrame
(dsk, name, meta, divisions)[source]¶ Parallel Pandas DataFrame
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.Parameters: - dsk: dict
The dask graph to compute this DataFrame
- name: str
The key prefix that specifies which keys in the dask comprise this particular DataFrame
- meta: pandas.DataFrame
An empty
pandas.DataFrame
with names, dtypes, and index matching the expected output.- divisions: tuple of index values
Values along which we partition our blocks on the index
-
abs
()¶ Return a Series/DataFrame with absolute numeric value of each element.
This docstring was copied from pandas.core.frame.DataFrame.abs.
Some inconsistencies with the Dask version may exist.
This function only applies to elements that are all numeric.
Returns: - abs
Series/DataFrame containing the absolute value of each element.
See also
numpy.absolute
- Calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
Absolute numeric values in a Series.
>>> s = pd.Series([-1.10, 2, -3.33, 4]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
Absolute numeric values in a Series with complex numbers.
>>> s = pd.Series([1.2 + 1j]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.56205 dtype: float64
Absolute numeric values in a Series with a Timedelta element.
>>> s = pd.Series([pd.Timedelta('1 days')]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1 days dtype: timedelta64[ns]
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({ # doctest: +SKIP ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df # doctest: +SKIP a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] # doctest: +SKIP a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
-
add
(other, axis='columns', level=None, fill_value=None)¶ Get Addition of dataframe and other, element-wise (binary operator add).
This docstring was copied from pandas.core.frame.DataFrame.add.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two objects on their axes with the specified join method.
This docstring was copied from pandas.core.frame.DataFrame.align.
Some inconsistencies with the Dask version may exist.
Join method is specified for each axis Index.
Parameters: - other : DataFrame or Series
- join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
- axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None).
- level : int or level name, default None (Not supported in Dask)
Broadcast across a level, matching Index values on the passed MultiIndex level.
- copy : bool, default True (Not supported in Dask)
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)
Method to use for filling holes in reindexed Series:
- pad / ffill: propagate last valid observation forward to next valid.
- backfill / bfill: use NEXT valid observation to fill gap.
- limit : int, default None (Not supported in Dask)
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
Filling axis, method and limit.
- broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None (Not supported in Dask)
Broadcast values along this axis, if aligning two objects of different dimensions.
Returns: - (left, right) : (DataFrame, type of other)
Aligned objects.
-
all
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether all elements are True, potentially over an axis.
This docstring was copied from pandas.core.frame.DataFrame.all.
Some inconsistencies with the Dask version may exist.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only : bool, default None (Not supported in Dask)
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
- skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- **kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
See also
Series.all
- Return True if all elements are True.
DataFrame.any
- Return True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() # doctest: +SKIP True >>> pd.Series([True, False]).all() # doctest: +SKIP False >>> pd.Series([]).all() # doctest: +SKIP True >>> pd.Series([np.nan]).all() # doctest: +SKIP True >>> pd.Series([np.nan]).all(skipna=False) # doctest: +SKIP True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) # doctest: +SKIP >>> df # doctest: +SKIP col1 col2 0 True True 1 True False
Default behaviour checks if column-wise values all return True.
>>> df.all() # doctest: +SKIP col1 True col2 False dtype: bool
Specify
axis='columns'
to check if row-wise values all return True.>>> df.all(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Or
axis=None
for whether every value is True.>>> df.all(axis=None) # doctest: +SKIP False
-
any
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether any element is True, potentially over an axis.
This docstring was copied from pandas.core.frame.DataFrame.any.
Some inconsistencies with the Dask version may exist.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
- bool_only : bool, default None (Not supported in Dask)
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
- skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- **kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
See also
numpy.any
- Numpy version of this method.
Series.any
- Return whether any element is True.
Series.all
- Return whether all elements are True.
DataFrame.any
- Return whether any element is True over requested axis.
DataFrame.all
- Return whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() # doctest: +SKIP False >>> pd.Series([True, False]).any() # doctest: +SKIP True >>> pd.Series([]).any() # doctest: +SKIP False >>> pd.Series([np.nan]).any() # doctest: +SKIP False >>> pd.Series([np.nan]).any(skipna=False) # doctest: +SKIP True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B C 0 1 0 0 1 2 2 0
>>> df.any() # doctest: +SKIP A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 2
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 0
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None
.>>> df.any(axis=None) # doctest: +SKIP True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() # doctest: +SKIP Series([], dtype: bool)
-
append
(other, interleave_partitions=False)[source]¶ Append rows of other to the end of caller, returning a new object.
This docstring was copied from pandas.core.frame.DataFrame.append.
Some inconsistencies with the Dask version may exist.
Columns in other that are not in the caller are added as new columns.
Parameters: - other : DataFrame or Series/dict-like object, or list of these
The data to append.
- ignore_index : bool, default False (Not supported in Dask)
If True, the resulting axis will be labeled 0, 1, …, n - 1.
- verify_integrity : bool, default False (Not supported in Dask)
If True, raise ValueError on creating index with duplicates.
- sort : bool, default False (Not supported in Dask)
Sort columns if the columns of self and other are not aligned.
Changed in version 1.0.0: Changed to not sort by default.
Returns: - DataFrame
See also
concat
- General function to concatenate DataFrame or Series objects.
Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
Examples
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) # doctest: +SKIP >>> df.append(df2) # doctest: +SKIP A B 0 1 2 1 3 4 0 5 6 1 7 8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True) # doctest: +SKIP A B 0 1 2 1 3 4 2 5 6 3 7 8
The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A']) # doctest: +SKIP >>> for i in range(5): # doctest: +SKIP ... df = df.append({'A': i}, ignore_index=True) >>> df # doctest: +SKIP A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], # doctest: +SKIP ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
-
apply
(func, axis=0, broadcast=None, raw=False, reduce=None, args=(), meta='__no_default__', result_type=None, **kwds)[source]¶ Parallel version of pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly). - The user should provide output metadata via the meta keyword.
Parameters: - func : function
Function to apply to each column/row
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’: apply function to each column (NOT SUPPORTED)
- 1 or ‘columns’: apply function to each row
- meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.- args : tuple
Positional arguments to pass to function in addition to the array/series
- Additional keyword arguments will be passed as keywords to the function
Returns: - applied : Series or DataFrame
See also
dask.DataFrame.map_partitions
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
Apply a function to row-wise passing in extra arguments in
args
andkwargs
:>>> def myadd(row, a, b=1): ... return row.sum() + a + b >>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5) # doctest: +SKIP
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
- Only
-
applymap
(func, meta='__no_default__')[source]¶ Apply a function to a Dataframe elementwise.
This docstring was copied from pandas.core.frame.DataFrame.applymap.
Some inconsistencies with the Dask version may exist.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
Parameters: - func : callable
Python function, returns a single value from a single value.
- na_action : {None, ‘ignore’}, default None (Not supported in Dask)
If ‘ignore’, propagate NaN values, without passing them to func.
New in version 1.2.
Returns: - DataFrame
Transformed DataFrame.
See also
DataFrame.apply
- Apply a function along input axis of DataFrame.
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) # doctest: +SKIP >>> df # doctest: +SKIP 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.applymap(lambda x: len(str(x))) # doctest: +SKIP 0 1 0 3 4 1 5 5
Like Series.map, NA values can be ignored:
>>> df_copy = df.copy() # doctest: +SKIP >>> df_copy.iloc[0, 0] = pd.NA # doctest: +SKIP >>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore') # doctest: +SKIP 0 1 0 <NA> 4 1 5 5
Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.
>>> df.applymap(lambda x: x**2) # doctest: +SKIP 0 1 0 1.000000 4.494400 1 11.262736 20.857489
But it’s better to avoid applymap in that case.
>>> df ** 2 # doctest: +SKIP 0 1 0 1.000000 4.494400 1 11.262736 20.857489
-
assign
(**kwargs)[source]¶ Assign new columns to a DataFrame.
This docstring was copied from pandas.core.frame.DataFrame.assign.
Some inconsistencies with the Dask version may exist.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
Parameters: - **kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
Returns: - DataFrame
A new DataFrame with the new columns in addition to all the existing columns.
Notes
Assigning multiple columns within the same
assign
is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]}, # doctest: +SKIP ... index=['Portland', 'Berkeley']) >>> df # doctest: +SKIP temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on df:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) # doctest: +SKIP temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) # doctest: +SKIP temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, # doctest: +SKIP ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
-
astype
(dtype)¶ Cast a pandas object to a specified dtype
dtype
.This docstring was copied from pandas.core.frame.DataFrame.astype.
Some inconsistencies with the Dask version may exist.
Parameters: - dtype : data type, or dict of column name -> data type
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- copy : bool, default True (Not supported in Dask)
Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects).- errors : {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object.
Returns: - casted : same type as caller
See also
to_datetime
- Convert argument to datetime.
to_timedelta
- Convert argument to timedelta.
to_numeric
- Convert argument to a numeric type.
numpy.ndarray.astype
- Cast a numpy array to a specified type.
Examples
Create a DataFrame:
>>> d = {'col1': [1, 2], 'col2': [3, 4]} # doctest: +SKIP >>> df = pd.DataFrame(data=d) # doctest: +SKIP >>> df.dtypes # doctest: +SKIP col1 int64 col2 int64 dtype: object
Cast all columns to int32:
>>> df.astype('int32').dtypes # doctest: +SKIP col1 int32 col2 int32 dtype: object
Cast col1 to int32 using a dictionary:
>>> df.astype({'col1': 'int32'}).dtypes # doctest: +SKIP col1 int32 col2 int64 dtype: object
Create a series:
>>> ser = pd.Series([1, 2], dtype='int32') # doctest: +SKIP >>> ser # doctest: +SKIP 0 1 1 2 dtype: int32 >>> ser.astype('int64') # doctest: +SKIP 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> cat_dtype = pd.api.types.CategoricalDtype( # doctest: +SKIP ... categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1, 2]) # doctest: +SKIP >>> s2 = s1.astype('int64', copy=False) # doctest: +SKIP >>> s2[0] = 10 # doctest: +SKIP >>> s1 # note that s1[0] has changed too # doctest: +SKIP 0 10 1 2 dtype: int64
Create a series of dates:
>>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) # doctest: +SKIP >>> ser_date # doctest: +SKIP 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
Datetimes are localized to UTC first before converting to the specified timezone:
>>> ser_date.astype('datetime64[ns, US/Eastern]') # doctest: +SKIP 0 2019-12-31 19:00:00-05:00 1 2020-01-01 19:00:00-05:00 2 2020-01-02 19:00:00-05:00 dtype: datetime64[ns, US/Eastern]
-
attrs
¶ Dictionary of global attributes of this dataset.
This docstring was copied from pandas.core.frame.DataFrame.attrs.
Some inconsistencies with the Dask version may exist.
Warning
attrs is experimental and may change without warning.
See also
DataFrame.flags
- Global flags applying to this object.
-
bfill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna()
withmethod='bfill'
.This docstring was copied from pandas.core.frame.DataFrame.bfill.
Some inconsistencies with the Dask version may exist.
Returns: - Series/DataFrame or None
Object with missing values filled or None if
inplace=True
.
-
categorize
(columns=None, index=None, split_every=None, **kwargs)¶ Convert columns of the DataFrame to category dtype.
Parameters: - columns : list, optional
A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.
- index : bool, optional
Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.
- split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.
- kwargs
Keyword arguments are passed on to compute.
-
clear_divisions
()¶ Forget division information
-
clip
(lower=None, upper=None, out=None)[source]¶ Trim values at input threshold(s).
This docstring was copied from pandas.core.frame.DataFrame.clip.
Some inconsistencies with the Dask version may exist.
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
Parameters: - lower : float or array_like, default None
Minimum threshold value. All values below this threshold will be set to it.
- upper : float or array_like, default None
Maximum threshold value. All values above this threshold will be set to it.
- axis : int or str axis name, optional (Not supported in Dask)
Align object with lower and upper along the given axis.
- inplace : bool, default False (Not supported in Dask)
Whether to perform the operation in place on the data.
- *args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.
Returns: - Series or DataFrame or None
Same type as calling object with the values outside the clip boundaries replaced or None if
inplace=True
.
See also
Series.clip
- Trim values at input threshold in series.
DataFrame.clip
- Trim values at input threshold in dataframe.
numpy.clip
- Clip (limit) the values in an array.
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} # doctest: +SKIP >>> df = pd.DataFrame(data) # doctest: +SKIP >>> df # doctest: +SKIP col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) # doctest: +SKIP col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) # doctest: +SKIP >>> t # doctest: +SKIP 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) # doctest: +SKIP col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
-
combine
(other, func, fill_value=None, overwrite=True)¶ Perform column-wise combine with another DataFrame.
This docstring was copied from pandas.core.frame.DataFrame.combine.
Some inconsistencies with the Dask version may exist.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: - other : DataFrame
The DataFrame to merge column-wise.
- func : function
Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
- fill_value : scalar value, default None
The value to fill NaNs with prior to passing any column to the merge func.
- overwrite : bool, default True
If True, columns in self that do not exist in other will be overwritten with NaNs.
Returns: - DataFrame
Combination of the provided DataFrames.
See also
DataFrame.combine_first
- Combine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) # doctest: +SKIP >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 # doctest: +SKIP >>> df1.combine(df2, take_smaller) # doctest: +SKIP A B 0 0 3 1 0 3
Example using a true element-wise combine function.
>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) # doctest: +SKIP >>> df1.combine(df2, np.minimum) # doctest: +SKIP A B 0 1 2 1 0 3
Using fill_value fills Nones prior to passing the column to the merge function.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) # doctest: +SKIP >>> df1.combine(df2, take_smaller, fill_value=-5) # doctest: +SKIP A B 0 0 -5.0 1 0 4.0
However, if the same element in both dataframes is None, that None is preserved
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) # doctest: +SKIP >>> df1.combine(df2, take_smaller, fill_value=-5) # doctest: +SKIP A B 0 0 -5.0 1 0 3.0
Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) # doctest: +SKIP >>> df1.combine(df2, take_smaller) # doctest: +SKIP A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0
>>> df1.combine(df2, take_smaller, overwrite=False) # doctest: +SKIP A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) # doctest: +SKIP >>> df2.combine(df1, take_smaller) # doctest: +SKIP A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False) # doctest: +SKIP A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
combine_first
(other)¶ Update null elements with value in the same location in other.
This docstring was copied from pandas.core.frame.DataFrame.combine_first.
Some inconsistencies with the Dask version may exist.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters: - other : DataFrame
Provided DataFrame to use to fill null values.
Returns: - DataFrame
See also
DataFrame.combine
- Perform series-wise operation on two DataFrames using a given function.
Examples
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) # doctest: +SKIP >>> df1.combine_first(df2) # doctest: +SKIP A B 0 1.0 3.0 1 0.0 4.0
Null values still persist if the location of that null value does not exist in other
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) # doctest: +SKIP >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) # doctest: +SKIP >>> df1.combine_first(df2) # doctest: +SKIP A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
-
compute
(**kwargs)¶ Compute this dask collection
This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.
Parameters: - scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
- optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
- kwargs
Extra keywords to forward to the scheduler function.
See also
dask.base.compute
-
copy
()¶ Make a copy of the dataframe
This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data
-
corr
(method='pearson', min_periods=None, split_every=False)[source]¶ Compute pairwise correlation of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.corr.
Some inconsistencies with the Dask version may exist.
Parameters: - method : {‘pearson’, ‘kendall’, ‘spearman’} or callable
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
New in version 0.24.0.
- min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns: - DataFrame
Correlation matrix.
See also
DataFrame.corrwith
- Compute pairwise correlation with another DataFrame or Series.
Series.corr
- Compute the correlation between two Series.
Examples
>>> def histogram_intersection(a, b): # doctest: +SKIP ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) # doctest: +SKIP dogs cats dogs 1.0 0.3 cats 0.3 1.0
-
count
(axis=None, split_every=False)¶ Count non-NA cells for each column or row.
This docstring was copied from pandas.core.frame.DataFrame.count.
Some inconsistencies with the Dask version may exist.
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
- level : int or str, optional (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.
- numeric_only : bool, default False (Not supported in Dask)
Include only float, int or boolean data.
Returns: - Series or DataFrame
For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
See also
Series.count
- Number of non-NA elements in a Series.
DataFrame.value_counts
- Count unique combinations of columns.
DataFrame.shape
- Number of DataFrame rows and columns (including NA elements).
DataFrame.isna
- Boolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": # doctest: +SKIP ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df # doctest: +SKIP Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() # doctest: +SKIP Person 5 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') # doctest: +SKIP 0 3 1 2 2 3 3 3 4 3 dtype: int64
Counts for one level of a MultiIndex:
>>> df.set_index(["Person", "Single"]).count(level="Person") # doctest: +SKIP Age Person John 2 Lewis 1 Myla 1
-
cov
(min_periods=None, split_every=False)[source]¶ Compute pairwise covariance of columns, excluding NA/null values.
This docstring was copied from pandas.core.frame.DataFrame.cov.
Some inconsistencies with the Dask version may exist.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
Parameters: - min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result.
- ddof : int, default 1 (Not supported in Dask)
Delta degrees of freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements.New in version 1.1.0.
Returns: - DataFrame
The covariance matrix of the series of the DataFrame.
See also
Series.cov
- Compute covariance with another Series.
core.window.ExponentialMovingWindow.cov
- Exponential weighted sample covariance.
core.window.Expanding.cov
- Expanding sample covariance.
core.window.Rolling.cov
- Rolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.cov() # doctest: +SKIP dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(1000, 5), # doctest: +SKIP ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() # doctest: +SKIP a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periods
keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(20, 3), # doctest: +SKIP ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan # doctest: +SKIP >>> df.loc[df.index[5:10], 'b'] = np.nan # doctest: +SKIP >>> df.cov(min_periods=12) # doctest: +SKIP a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
-
cummax
(axis=None, skipna=True, out=None)¶ Return cumulative maximum over a DataFrame or Series axis.
This docstring was copied from pandas.core.frame.DataFrame.cummax.
Some inconsistencies with the Dask version may exist.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- *args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
Return cumulative maximum of Series or DataFrame.
See also
core.window.Expanding.max
- Similar functionality but ignores
NaN
values. DataFrame.max
- Return the maximum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummax(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummax() # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1
>>> df.cummax(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
-
cummin
(axis=None, skipna=True, out=None)¶ Return cumulative minimum over a DataFrame or Series axis.
This docstring was copied from pandas.core.frame.DataFrame.cummin.
Some inconsistencies with the Dask version may exist.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- *args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
Return cumulative minimum of Series or DataFrame.
See also
core.window.Expanding.min
- Similar functionality but ignores
NaN
values. DataFrame.min
- Return the minimum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() # doctest: +SKIP 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummin(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummin() # doctest: +SKIP A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1
>>> df.cummin(axis=1) # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
-
cumprod
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative product over a DataFrame or Series axis.
This docstring was copied from pandas.core.frame.DataFrame.cumprod.
Some inconsistencies with the Dask version may exist.
Returns a DataFrame or Series of the same size containing the cumulative product.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- *args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
Return cumulative product of Series or DataFrame.
See also
core.window.Expanding.prod
- Similar functionality but ignores
NaN
values. DataFrame.prod
- Return the product over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() # doctest: +SKIP 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumprod(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumprod() # doctest: +SKIP A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1
>>> df.cumprod(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
-
cumsum
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative sum over a DataFrame or Series axis.
This docstring was copied from pandas.core.frame.DataFrame.cumsum.
Some inconsistencies with the Dask version may exist.
Returns a DataFrame or Series of the same size containing the cumulative sum.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- *args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: - Series or DataFrame
Return cumulative sum of Series or DataFrame.
See also
core.window.Expanding.sum
- Similar functionality but ignores
NaN
values. DataFrame.sum
- Return the sum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() # doctest: +SKIP 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumsum(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumsum() # doctest: +SKIP A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1
>>> df.cumsum(axis=1) # doctest: +SKIP A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
-
describe
(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)¶ Generate descriptive statistics.
This docstring was copied from pandas.core.frame.DataFrame.describe.
Some inconsistencies with the Dask version may exist.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Parameters: - percentiles : list-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.- include : ‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored for
Series
. Here are the options:- ‘all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
- exclude : list-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
- datetime_is_numeric : bool, default False (Not supported in Dask)
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
Returns: - Series or DataFrame
Summary statistics of the Series or Dataframe provided.
See also
DataFrame.count
- Count number of non-NA/null observations.
DataFrame.max
- Maximum of the values in the object.
DataFrame.min
- Minimum of the values in the object.
DataFrame.mean
- Mean of the values.
DataFrame.std
- Standard deviation of the observations.
DataFrame.select_dtypes
- Subset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Examples
Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ # doctest: +SKIP ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe(datetime_is_numeric=True) # doctest: +SKIP count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), # doctest: +SKIP ... 'numeric': [1, 2, 3], ... 'object': ['a', 'b', 'c'] ... }) >>> df.describe() # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') # doctest: +SKIP categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[object]) # doctest: +SKIP object count 3 unique 3 top a freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) # doctest: +SKIP categorical count 3 unique 3 top d freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) # doctest: +SKIP categorical object count 3 3 unique 3 3 top f a freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[object]) # doctest: +SKIP categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
-
diff
(periods=1, axis=0)¶ First discrete difference of element.
This docstring was copied from pandas.core.frame.DataFrame.diff.
Some inconsistencies with the Dask version may exist.
Note
Pandas currently uses an
object
-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like|
. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use.astype(bool)
after theshift
.Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
Parameters: - periods : int, default 1
Periods to shift for calculating difference, accepts negative values.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Take difference over rows (0) or columns (1).
Returns: - Dataframe
First differences of the Series.
See also
Dataframe.pct_change
- Percent change over given number of periods.
Dataframe.shift
- Shift index by desired number of periods with an optional time freq.
Series.diff
- First discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()
rather thanoperator.sub()
. The result is calculated according to current dtype in Dataframe, however dtype of the result is always float64.Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], # doctest: +SKIP ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df # doctest: +SKIP a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() # doctest: +SKIP a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) # doctest: +SKIP a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28
Difference with 3rd previous row
>>> df.diff(periods=3) # doctest: +SKIP a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) # doctest: +SKIP a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
Overflow in input dtype
>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) # doctest: +SKIP >>> df.diff() # doctest: +SKIP a 0 NaN 1 255.0
-
div
(other, axis='columns', level=None, fill_value=None)¶ Get Floating division of dataframe and other, element-wise (binary operator truediv).
This docstring was copied from pandas.core.frame.DataFrame.div.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
divide
(other, axis='columns', level=None, fill_value=None)¶ Get Floating division of dataframe and other, element-wise (binary operator truediv).
This docstring was copied from pandas.core.frame.DataFrame.divide.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
dot
(other, meta='__no_default__')¶ Compute the dot product between the Series and the columns of other.
This docstring was copied from pandas.core.series.Series.dot.
Some inconsistencies with the Dask version may exist.
This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.
It can also be called using self @ other in Python >= 3.5.
Parameters: - other : Series, DataFrame or array-like
The other object to compute the dot product with its columns.
Returns: - scalar, Series or numpy.ndarray
Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.
See also
DataFrame.dot
- Compute the matrix product with the DataFrame.
Series.mul
- Multiplication of series and other, element-wise.
Notes
The Series and other has to share the same index if other is a Series or a DataFrame.
Examples
>>> s = pd.Series([0, 1, 2, 3]) # doctest: +SKIP >>> other = pd.Series([-1, 2, -3, 4]) # doctest: +SKIP >>> s.dot(other) # doctest: +SKIP 8 >>> s @ other # doctest: +SKIP 8 >>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]]) # doctest: +SKIP >>> s.dot(df) # doctest: +SKIP 0 24 1 14 dtype: int64 >>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]]) # doctest: +SKIP >>> s.dot(arr) # doctest: +SKIP array([24, 14])
-
drop
(labels=None, axis=0, columns=None, errors='raise')[source]¶ Drop specified labels from rows or columns.
This docstring was copied from pandas.core.frame.DataFrame.drop.
Some inconsistencies with the Dask version may exist.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
Parameters: - labels : single label or list-like
Index or column labels to drop.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index : single label or list-like (Not supported in Dask)
Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
).- columns : single label or list-like
Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
).- level : int or level name, optional (Not supported in Dask)
For MultiIndex, level from which the labels will be removed.
- inplace : bool, default False (Not supported in Dask)
If False, return a copy. Otherwise, do operation inplace and return None.
- errors : {‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and only existing labels are dropped.
Returns: - DataFrame or None
DataFrame without the removed index or column labels or None if
inplace=True
.
Raises: - KeyError
If any of the labels is not found in the selected axis.
See also
DataFrame.loc
- Label-location based indexer for selection by label.
DataFrame.dropna
- Return DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicates
- Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.drop
- Return Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), # doctest: +SKIP ... columns=['A', 'B', 'C', 'D']) >>> df # doctest: +SKIP A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) # doctest: +SKIP A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) # doctest: +SKIP A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) # doctest: +SKIP A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], # doctest: +SKIP ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], # doctest: +SKIP ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df # doctest: +SKIP big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
>>> df.drop(index='cow', columns='small') # doctest: +SKIP big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) # doctest: +SKIP big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
-
drop_duplicates
(subset=None, split_every=None, split_out=1, ignore_index=False, **kwargs)¶ Return DataFrame with duplicate rows removed.
This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.
Some inconsistencies with the Dask version may exist.
Considering certain columns is optional. Indexes, including time indexes are ignored.
Parameters: - subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep : {‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask)
Determines which duplicates (if any) to keep. -
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.- inplace : bool, default False (Not supported in Dask)
Whether to drop duplicates in place or to return a copy.
- ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
Returns: - DataFrame or None
DataFrame with duplicates removed or None if
inplace=True
.
See also
DataFrame.value_counts
- Count unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ # doctest: +SKIP ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df # doctest: +SKIP brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() # doctest: +SKIP brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
To remove duplicates on specific column(s), use
subset
.>>> df.drop_duplicates(subset=['brand']) # doctest: +SKIP brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use
keep
.>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') # doctest: +SKIP brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
-
dropna
(how='any', subset=None, thresh=None)[source]¶ Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are removed.
- 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
- how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- ‘any’ : If any NA values are present, drop that row or column.
- ‘all’ : If all values are NA, drop that row or column.
- thresh : int, optional
Require that many non-NA values.
- subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
- inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns: - DataFrame or None
DataFrame with NA entries dropped from it or None if
inplace=True
.
See also
DataFrame.isna
- Indicate missing values.
DataFrame.notna
- Indicate existing (non-missing) values.
DataFrame.fillna
- Replace missing values.
Series.dropna
- Drop missing values.
Index.dropna
- Drop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], # doctest: +SKIP ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df # doctest: +SKIP name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') # doctest: +SKIP name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') # doctest: +SKIP name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy']) # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True) # doctest: +SKIP >>> df # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25
-
dtypes
¶ Return data types
-
eq
(other, axis='columns', level=None)¶ Get Equal to of dataframe and other, element-wise (binary operator eq).
This docstring was copied from pandas.core.frame.DataFrame.eq.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
eval
(expr, inplace=None, **kwargs)[source]¶ Evaluate a string describing operations on DataFrame columns.
This docstring was copied from pandas.core.frame.DataFrame.eval.
Some inconsistencies with the Dask version may exist.
Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
Parameters: - expr : str
The expression string to evaluate.
- inplace : bool, default False
If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
- **kwargs
See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.
Returns: - ndarray, scalar, pandas object, or None
The result of the evaluation or None if
inplace=True
.
See also
DataFrame.query
- Evaluates a boolean expression to query the columns of a frame.
DataFrame.assign
- Can evaluate an expression or function to create new values for a column.
eval
- Evaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') # doctest: +SKIP 0 11 1 10 2 9 3 8 4 7 dtype: int64
Assignment is allowed though by default the original DataFrame is not modified.
>>> df.eval('C = A + B') # doctest: +SKIP A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df # doctest: +SKIP A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2
Use
inplace=True
to modify the original DataFrame.>>> df.eval('C = A + B', inplace=True) # doctest: +SKIP >>> df # doctest: +SKIP A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7
Multiple columns can be assigned to using multi-line expressions:
>>> df.eval( # doctest: +SKIP ... ''' ... C = A + B ... D = A - B ... ''' ... ) A B C D 0 1 10 11 -9 1 2 8 10 -6 2 3 6 9 -3 3 4 4 8 0 4 5 2 7 3
-
explode
(column)[source]¶ Transform each element of a list-like to a row, replicating index values.
This docstring was copied from pandas.core.frame.DataFrame.explode.
Some inconsistencies with the Dask version may exist.
New in version 0.25.0.
Parameters: - column : str or tuple
Column to explode.
- ignore_index : bool, default False (Not supported in Dask)
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
Returns: - DataFrame
Exploded lists to rows of the subset columns; index will be duplicated for these rows.
Raises: - ValueError :
if columns of the frame are not unique.
See also
DataFrame.unstack
- Pivot a level of the (necessarily hierarchical) index labels.
DataFrame.melt
- Unpivot a DataFrame from wide format to long format.
Series.explode
- Explode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
Examples
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 [1, 2, 3] 1 1 foo 1 2 [] 1 3 [3, 4] 1
>>> df.explode('A') # doctest: +SKIP A B 0 1 1 0 2 1 0 3 1 1 foo 1 2 NaN 1 3 3 1 3 4 1
-
ffill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna()
withmethod='ffill'
.This docstring was copied from pandas.core.frame.DataFrame.ffill.
Some inconsistencies with the Dask version may exist.
Returns: - Series/DataFrame or None
Object with missing values filled or None if
inplace=True
.
-
fillna
(value=None, method=None, limit=None, axis=None)¶ Fill NA/NaN values using the specified method.
This docstring was copied from pandas.core.frame.DataFrame.fillna.
Some inconsistencies with the Dask version may exist.
Parameters: - value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
- method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
- axis : {0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
- inplace : bool, default False (Not supported in Dask)
If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
- limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast : dict, default is None (Not supported in Dask)
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns: - DataFrame or None
Object with missing values filled or None if
inplace=True
.
See also
interpolate
- Fill NaN values using interpolation.
reindex
- Conform object to new index.
asfreq
- Convert TimeSeries to specified frequency.
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], # doctest: +SKIP ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) # doctest: +SKIP A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} # doctest: +SKIP >>> df.fillna(value=values) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
-
first
(offset)¶ Select initial periods of time series data based on a date offset.
This docstring was copied from pandas.core.frame.DataFrame.first.
Some inconsistencies with the Dask version may exist.
When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.
Parameters: - offset : str, DateOffset or dateutil.relativedelta
The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.
Returns: - Series or DataFrame
A subset of the caller.
Raises: - TypeError
If the index is not a
DatetimeIndex
See also
last
- Select final periods of time series based on a date offset.
at_time
- Select values at a particular time of the day.
between_time
- Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the first 3 days:
>>> ts.first('3D') # doctest: +SKIP A 2018-04-09 1 2018-04-11 2
Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
-
floordiv
(other, axis='columns', level=None, fill_value=None)¶ Get Integer division of dataframe and other, element-wise (binary operator floordiv).
This docstring was copied from pandas.core.frame.DataFrame.floordiv.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
ge
(other, axis='columns', level=None)¶ Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
This docstring was copied from pandas.core.frame.DataFrame.ge.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(by=None, group_keys=True, sort=None, observed=None, dropna=None, **kwargs)[source]¶ Group DataFrame using a mapper or by a Series of columns.
This docstring was copied from pandas.core.frame.DataFrame.groupby.
Some inconsistencies with the Dask version may exist.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters: - by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If
by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted as a (single) key.- axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
Split along rows (0) or columns (1).
- level : int, level name, or sequence of such, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
- as_index : bool, default True (Not supported in Dask)
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
- sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
- group_keys : bool, default True
When calling apply, add group keys to index to identify pieces.
- squeeze : bool, default False (Not supported in Dask)
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
- observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- dropna : bool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
Returns: - DataFrameGroupBy
Returns a groupby object that contains information about the groups.
See also
resample
- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', # doctest: +SKIP ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df # doctest: +SKIP Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() # doctest: +SKIP Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], # doctest: +SKIP ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) # doctest: +SKIP >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, # doctest: +SKIP ... index=index) >>> df # doctest: +SKIP Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() # doctest: +SKIP Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() # doctest: +SKIP Max Speed Type Captive 210.0 Wild 185.0
We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True:
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] # doctest: +SKIP >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) # doctest: +SKIP
>>> df.groupby(by=["b"]).sum() # doctest: +SKIP a c b 1.0 2 3 2.0 2 5
>>> df.groupby(by=["b"], dropna=False).sum() # doctest: +SKIP a c b 1.0 2 3 2.0 2 5 NaN 1 4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] # doctest: +SKIP >>> df = pd.DataFrame(l, columns=["a", "b", "c"]) # doctest: +SKIP
>>> df.groupby(by="a").sum() # doctest: +SKIP b c a a 13.0 13.0 b 12.3 123.0
>>> df.groupby(by="a", dropna=False).sum() # doctest: +SKIP b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
-
gt
(other, axis='columns', level=None)¶ Get Greater than of dataframe and other, element-wise (binary operator gt).
This docstring was copied from pandas.core.frame.DataFrame.gt.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: - n : int, optional
The number of rows to return. Default is 5.
- npartitions : int, optional
Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions.- compute : bool, optional
Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis.
This docstring was copied from pandas.core.frame.DataFrame.idxmax.
Some inconsistencies with the Dask version may exist.
NA/null values are excluded.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: - Series
Indexes of maxima along the specified axis.
Raises: - ValueError
- If the row/column is empty
See also
Series.idxmax
- Return index of the maximum element.
Notes
This method is the DataFrame version of
ndarray.argmax
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], # doctest: +SKIP ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df # doctest: +SKIP consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax() # doctest: +SKIP consumption Wheat Products co2_emissions Beef dtype: object
To return the index for the maximum value in each row, use
axis="columns"
.>>> df.idxmax(axis="columns") # doctest: +SKIP Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis.
This docstring was copied from pandas.core.frame.DataFrame.idxmin.
Some inconsistencies with the Dask version may exist.
NA/null values are excluded.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- skipna : bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: - Series
Indexes of minima along the specified axis.
Raises: - ValueError
- If the row/column is empty
See also
Series.idxmin
- Return index of the minimum element.
Notes
This method is the DataFrame version of
ndarray.argmin
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], # doctest: +SKIP ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df # doctest: +SKIP consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin() # doctest: +SKIP consumption Pork co2_emissions Wheat Products dtype: object
To return the index for the minimum value in each row, use
axis="columns"
.>>> df.idxmin(axis="columns") # doctest: +SKIP Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
-
iloc
¶ Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
See Indexing into Dask DataFrames for more.
Examples
>>> df.iloc[:, [2, 0, 1]] # doctest: +SKIP
-
index
¶ Return dask Index instance
-
isin
(values)¶ Whether each element in the DataFrame is contained in values.
This docstring was copied from pandas.core.frame.DataFrame.isin.
Some inconsistencies with the Dask version may exist.
Parameters: - values : iterable, Series, DataFrame or dict
The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
Returns: - DataFrame
DataFrame of booleans showing whether each element in the DataFrame is contained in values.
See also
DataFrame.eq
- Equality test for DataFrame.
Series.isin
- Equivalent method on Series.
Series.str.contains
- Test if pattern or regex is contained within a string of a Series or Index.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, # doctest: +SKIP ... index=['falcon', 'dog']) >>> df # doctest: +SKIP num_legs num_wings falcon 2 2 dog 4 0
When
values
is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)>>> df.isin([0, 2]) # doctest: +SKIP num_legs num_wings falcon True True dog False True
When
values
is a dict, we can pass values to check for each column separately:>>> df.isin({'num_wings': [0, 3]}) # doctest: +SKIP num_legs num_wings falcon False False dog False True
When
values
is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in df2.>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]}, # doctest: +SKIP ... index=['spider', 'falcon']) >>> df.isin(other) # doctest: +SKIP num_legs num_wings falcon True True dog False False
-
isna
()¶ Detect missing values.
This docstring was copied from pandas.core.frame.DataFrame.isna.
Some inconsistencies with the Dask version may exist.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: - DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
See also
DataFrame.isnull
- Alias of isna.
DataFrame.notna
- Boolean inverse of isna.
DataFrame.dropna
- Omit axes labels with missing values.
isna
- Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], # doctest: +SKIP ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
isnull
()¶ Detect missing values.
This docstring was copied from pandas.core.frame.DataFrame.isnull.
Some inconsistencies with the Dask version may exist.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: - DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
See also
DataFrame.isnull
- Alias of isna.
DataFrame.notna
- Boolean inverse of isna.
DataFrame.dropna
- Omit axes labels with missing values.
isna
- Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], # doctest: +SKIP ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
items
()[source]¶ Iterate over (column name, Series) pairs.
This docstring was copied from pandas.core.frame.DataFrame.items.
Some inconsistencies with the Dask version may exist.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
Yields: - label : object
The column names for the DataFrame being iterated over.
- content : Series
The column entries belonging to each label, as a Series.
See also
DataFrame.iterrows
- Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples
- Iterate over DataFrame rows as namedtuples of the values.
Examples
>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'], # doctest: +SKIP ... 'population': [1864, 22000, 80000]}, ... index=['panda', 'polar', 'koala']) >>> df # doctest: +SKIP species population panda bear 1864 polar bear 22000 koala marsupial 80000 >>> for label, content in df.items(): # doctest: +SKIP ... print(f'label: {label}') ... print(f'content: {content}', sep='\n') ... label: species content: panda bear polar bear koala marsupial Name: species, dtype: object label: population content: panda 1864 polar 22000 koala 80000 Name: population, dtype: int64
-
iterrows
()[source]¶ Iterate over DataFrame rows as (index, Series) pairs.
This docstring was copied from pandas.core.frame.DataFrame.iterrows.
Some inconsistencies with the Dask version may exist.
Yields: - index : label or tuple of label
The index of the row. A tuple for a MultiIndex.
- data : Series
The data of the row as a Series.
See also
DataFrame.itertuples
- Iterate over DataFrame rows as namedtuples of the values.
DataFrame.items
- Iterate over (column name, Series) pairs.
Notes
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) # doctest: +SKIP >>> row = next(df.iterrows())[1] # doctest: +SKIP >>> row # doctest: +SKIP int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) # doctest: +SKIP float64 >>> print(df['int'].dtype) # doctest: +SKIP int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
-
itertuples
(index=True, name='Pandas')[source]¶ Iterate over DataFrame rows as namedtuples.
This docstring was copied from pandas.core.frame.DataFrame.itertuples.
Some inconsistencies with the Dask version may exist.
Parameters: - index : bool, default True
If True, return the index as the first element of the tuple.
- name : str or None, default “Pandas”
The name of the returned namedtuples or None to return regular tuples.
Returns: - iterator
An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.
See also
DataFrame.iterrows
- Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.items
- Iterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. On python versions < 3.7 regular tuples are returned for DataFrames with a large number of columns (>254).
Examples
>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, # doctest: +SKIP ... index=['dog', 'hawk']) >>> df # doctest: +SKIP num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): # doctest: +SKIP ... print(row) ... Pandas(Index='dog', num_legs=4, num_wings=0) Pandas(Index='hawk', num_legs=2, num_wings=2)
By setting the index parameter to False we can remove the index as the first element of the tuple:
>>> for row in df.itertuples(index=False): # doctest: +SKIP ... print(row) ... Pandas(num_legs=4, num_wings=0) Pandas(num_legs=2, num_wings=2)
With the name parameter set we set a custom name for the yielded namedtuples:
>>> for row in df.itertuples(name='Animal'): # doctest: +SKIP ... print(row) ... Animal(Index='dog', num_legs=4, num_wings=0) Animal(Index='hawk', num_legs=2, num_wings=2)
-
join
(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)[source]¶ Join columns of another DataFrame.
This docstring was copied from pandas.core.frame.DataFrame.join.
Some inconsistencies with the Dask version may exist.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
Parameters: - other : DataFrame, Series, or list of DataFrame
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
- on : str, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.
- how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other’s index.
- outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
- inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
- lsuffix : str, default ‘’
Suffix to use from left frame’s overlapping columns.
- rsuffix : str, default ‘’
Suffix to use from right frame’s overlapping columns.
- sort : bool, default False (Not supported in Dask)
Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
Returns: - DataFrame
A dataframe containing columns from both the caller and other.
See also
DataFrame.merge
- For column(s)-on-column(s) operations.
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.
Support for specifying index levels as the on parameter was added in version 0.23.0.
Examples
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], # doctest: +SKIP ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df # doctest: +SKIP key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], # doctest: +SKIP ... 'B': ['B0', 'B1', 'B2']})
>>> other # doctest: +SKIP key B 0 K0 B0 1 K1 B1 2 K2 B2
Join DataFrames using their indexes.
>>> df.join(other, lsuffix='_caller', rsuffix='_other') # doctest: +SKIP key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.
>>> df.set_index('key').join(other.set_index('key')) # doctest: +SKIP A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.
>>> df.join(other.set_index('key'), on='key') # doctest: +SKIP key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 NaN 4 K4 A4 NaN 5 K5 A5 NaN
-
known_divisions
¶ Whether divisions are already known
-
kurtosis
(axis=None, fisher=True, bias=True, nan_policy='propagate', out=None)¶ Return unbiased kurtosis over requested axis.
This docstring was copied from pandas.core.frame.DataFrame.kurtosis.
Some inconsistencies with the Dask version may exist.
Note
This implementation follows the dask.array.stats implementation of kurtosis and calculates kurtosis without taking into account a bias term for finite sample size, which corresponds to the default settings of the scipy.stats kurtosis calculation. This differs from pandas.
Further, this method currently does not support filtering out NaN values, which is again a difference to Pandas.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters: - axis : {index (0), columns (1)}
Axis for the function to be applied on.
- skipna : bool, default True (Not supported in Dask)
Exclude NA/null values when computing the result.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only : bool, default None (Not supported in Dask)
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
Returns: - Series or DataFrame (if level specified)
-
last
(offset)¶ Select final periods of time series data based on a date offset.
This docstring was copied from pandas.core.frame.DataFrame.last.
Some inconsistencies with the Dask version may exist.
When having a DataFrame with dates as index, this function can select the last few rows based on a date offset.
Parameters: - offset : str, DateOffset, dateutil.relativedelta
The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.
Returns: - Series or DataFrame
A subset of the caller.
Raises: - TypeError
If the index is not a
DatetimeIndex
See also
first
- Select initial periods of time series based on a date offset.
at_time
- Select values at a particular time of the day.
between_time
- Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the last 3 days:
>>> ts.last('3D') # doctest: +SKIP A 2018-04-13 3 2018-04-15 4
Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
-
le
(other, axis='columns', level=None)¶ Get Less than or equal to of dataframe and other, element-wise (binary operator le).
This docstring was copied from pandas.core.frame.DataFrame.le.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] # doctest: +SKIP >>> df.loc["b":"d"] # doctest: +SKIP
-
lt
(other, axis='columns', level=None)¶ Get Less than of dataframe and other, element-wise (binary operator lt).
This docstring was copied from pandas.core.frame.DataFrame.lt.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
map_overlap
(func, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
This can be useful for implementing windowing functions such as
df.rolling(...).mean()
ordf.diff()
.Parameters: - func : function
Function applied to each partition.
- before : int
The number of rows to prepend to partition
i
from the end of partitioni - 1
.- after : int
The number of rows to append to partition
i
from the beginning of partitioni + 1
.- args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
- meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Notes
Given positive integers
before
andafter
, and a functionfunc
,map_overlap
does the following:- Prepend
before
rows to each partitioni
from the end of partitioni - 1
. The first partition has no rows prepended. - Append
after
rows to each partitioni
from the beginning of partitioni + 1
. The last partition has no rows appended. - Apply
func
to each partition, passing in any extraargs
andkwargs
if provided. - Trim
before
rows from the beginning of all but the first partition. - Trim
after
rows from the end of all but the last partition.
Note that the index and divisions are assumed to remain unchanged.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to
df.rolling(2).sum()
:>>> ddf.compute() x y 0 1 1.0 1 2 2.0 2 4 3.0 3 7 4.0 4 11 5.0 >>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute() x y 0 NaN NaN 1 3.0 3.0 2 6.0 5.0 3 11.0 7.0 4 18.0 9.0
The pandas
diff
method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls todf.diff
to each partition after prepending/appending that many rows, depending on sign:>>> def diff(df, periods=1): ... before, after = (periods, 0) if periods > 0 else (0, -periods) ... return df.map_overlap(lambda df, periods=1: df.diff(periods), ... periods, 0, periods=periods) >>> diff(ddf, 1).compute() x y 0 NaN NaN 1 1.0 1.0 2 2.0 1.0 3 3.0 1.0 4 4.0 1.0
If you have a
DatetimeIndex
, you can use apd.Timedelta
for time- based windows.>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling('2D').sum(), ... pd.Timedelta('2D'), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 Freq: D, dtype: float64
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Note that the index and divisions are assumed to remain unchanged.
Parameters: - func : function
Function applied to each partition.
- args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain
Scalar
,Delayed
,partition_info
or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function.- meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf)
Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:
>>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP
Your map function gets information about where it is in the dataframe by accepting a special
partition_info
keyword argument.>>> def func(partition, partition_info=None): ... pass
This will receive the following information:
>>> partition_info # doctest: +SKIP {'number': 1, 'division': 3}
For each argument and keyword arguments that are dask dataframes you will receive the number (n) which represents the nth partition of the dataframe and the division (the first index value in the partition). If divisions are not known (for instance if the index is not sorted) then you will get None as the division.
-
mask
(cond, other=nan)¶ Replace values where the condition is True.
This docstring was copied from pandas.core.frame.DataFrame.mask.
Some inconsistencies with the Dask version may exist.
Parameters: - cond : bool Series/DataFrame, array-like, or callable
Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
- other : scalar, Series/DataFrame, or callable
Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
- inplace : bool, default False (Not supported in Dask)
Whether to perform the operation in place on the data.
- axis : int, default None (Not supported in Dask)
Alignment axis if needed.
- level : int, default None (Not supported in Dask)
Alignment level if needed.
- errors : str, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
- ‘raise’ : allow exceptions to be raised.
- ‘ignore’ : suppress exceptions. On error return original object.
- try_cast : bool, default False (Not supported in Dask)
Try to cast the result back to the input type (if possible).
Returns: - Same type as caller or None if
inplace=True
.
See also
DataFrame.where()
- Return an object of same shape as self.
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.Examples
>>> s = pd.Series(range(5)) # doctest: +SKIP >>> s.where(s > 0) # doctest: +SKIP 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> s.mask(s > 0) # doctest: +SKIP 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
>>> s.where(s > 1, 10) # doctest: +SKIP 0 10 1 10 2 2 3 3 4 4 dtype: int64 >>> s.mask(s > 1, 10) # doctest: +SKIP 0 0 1 1 2 10 3 10 4 10 dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> m = df % 3 == 0 # doctest: +SKIP >>> df.where(m, -df) # doctest: +SKIP A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
max
(axis=None, skipna=True, split_every=False, out=None)¶ Return the maximum of the values over the requested axis.
This docstring was copied from pandas.core.frame.DataFrame.max.
Some inconsistencies with the Dask version may exist.
If you want the index of the maximum, use
idxmax
. This isthe equivalent of thenumpy.ndarray
methodargmax
.Parameters: - axis : {index (0), columns (1)}
Axis for the function to be applied on.
- skipna : bool, default True
Exclude NA/null values when computing the result.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only : bool, default None (Not supported in Dask)
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
Returns: - Series or DataFrame (if level specified)
See also
Series.sum
- Return the sum.
Series.min
- Return the minimum.
Series.max
- Return the maximum.
Series.idxmin
- Return the index of the minimum.
Series.idxmax
- Return the index of the maximum.
DataFrame.sum
- Return the sum over the requested axis.
DataFrame.min
- Return the minimum over the requested axis.
DataFrame.max
- Return the maximum over the requested axis.
DataFrame.idxmin
- Return the index of the minimum over the requested axis.
DataFrame.idxmax
- Return the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ # doctest: +SKIP ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) # doctest: +SKIP >>> s # doctest: +SKIP blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() # doctest: +SKIP 8
Max using level names, as well as indices.
>>> s.max(level='blooded') # doctest: +SKIP blooded warm 4 cold 8 Name: legs, dtype: int64
>>> s.max(level=0) # doctest: +SKIP blooded warm 4 cold 8 Name: legs, dtype: int64
-
mean
(axis=None, skipna=True, split_every=False, dtype=None, out=None)¶ Return the mean of the values over the requested axis.
This docstring was copied from pandas.core.frame.DataFrame.mean.
Some inconsistencies with the Dask version may exist.
Parameters: - axis : {index (0), columns (1)}
Axis for the function to be applied on.
- skipna : bool, default True
Exclude NA/null values when computing the result.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only : bool, default None (Not supported in Dask)
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
Returns: - Series or DataFrame (if level specified)
-
melt
(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)[source]¶ Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (
id_vars
), while all other columns, considered measured variables (value_vars
), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.Parameters: - frame : DataFrame
- id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
- value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name : scalar
Name to use for the ‘variable’ column. If None it uses
frame.columns.name
or ‘variable’.- value_name : scalar, default ‘value’
Name to use for the ‘value’ column.
- col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
Returns: - DataFrame
Unpivoted DataFrame.
See also
-
memory_usage
(index=True, deep=False)[source]¶ Return the memory usage of each column in bytes.
This docstring was copied from pandas.core.frame.DataFrame.memory_usage.
Some inconsistencies with the Dask version may exist.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting
pandas.options.display.memory_usage
to False.Parameters: - index : bool, default True
Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True
, the memory usage of the index is the first item in the output.- deep : bool, default False
If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
Returns: - Series
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
See also
numpy.ndarray.nbytes
- Total bytes consumed by the elements of an ndarray.
Series.memory_usage
- Bytes consumed by a Series.
Categorical
- Memory-efficient array for string values with many repeated values.
DataFrame.info
- Concise summary of a DataFrame.
Examples
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool'] # doctest: +SKIP >>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t)) # doctest: +SKIP ... for t in dtypes]) >>> df = pd.DataFrame(data) # doctest: +SKIP >>> df.head() # doctest: +SKIP int64 float64 complex128 object bool 0 1 1.0 1.0+0.0j 1 True 1 1 1.0 1.0+0.0j 1 True 2 1 1.0 1.0+0.0j 1 True 3 1 1.0 1.0+0.0j 1 True 4 1 1.0 1.0+0.0j 1 True
>>> df.memory_usage() # doctest: +SKIP Index 128 int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) # doctest: +SKIP int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
The memory footprint of object dtype columns is ignored by default:
>>> df.memory_usage(deep=True) # doctest: +SKIP Index 128 int64 40000 float64 40000 complex128 80000 object 180000 bool 5000 dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True) # doctest: +SKIP 5244
-
memory_usage_per_partition
(index=True, deep=False)¶ Return the memory usage of each partition
Parameters: - index : bool, default True
Specifies whether to include the memory usage of the index in returned Series.
- deep : bool, default False
If True, introspect the data deeply by interrogating
object
dtypes for system-level memory consumption, and include it in the returned values.
Returns: - Series
A Series whose index is the partition number and whose values are the memory usage of each partition in bytes.
-
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None, broadcast=None)[source]¶ Merge the DataFrame with another DataFrame
This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.
Parameters: - right: dask.dataframe.DataFrame
- how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘inner’
How to handle the operation of the two objects:
- left: use calling frame’s index (or column if on is specified)
- right: use other frame’s index
- outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
- inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
- on : label or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
- left_on : label or list, or array-like
Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.
- right_on : label or list, or array-like
Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.
- left_index : boolean, default False
Use the index from the left DataFrame as the join key.
- right_index : boolean, default False
Use the index from the right DataFrame as the join key.
- suffixes : 2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively
- indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.
- npartitions: int or None, optional
The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If
None
thennpartitions = max(lhs.npartitions, rhs.npartitions)
. Default isNone
.- shuffle: {‘disk’, ‘tasks’}, optional
Either
'disk'
for single-node operation or'tasks'
for distributed operation. Will be inferred by your current scheduler.- broadcast: boolean or float, optional
Whether to use a broadcast-based join in lieu of a shuffle-based join for supported cases. By default, a simple heuristic will be used to select the underlying algorithm. If a floating-point value is specified, that number will be used as the
broadcast_bias
within the simple heuristic (a large number makes Dask more likely to choose thebroacast_join
code path). Seebroadcast_join
for more information.
Notes
There are three ways to join dataframes:
- Joining on indices. In this case the divisions are
aligned using the function
dask.dataframe.multi.align_partitions
. Afterwards, each partition is merged with the pandas merge function. - Joining one on index and one on column. In this case the divisions of
dataframe merged by index (\(d_i\)) are used to divide the column
merged dataframe (\(d_c\)) one using
dask.dataframe.multi.rearrange_by_divisions
. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)). - Joining both on columns. In this case a hash join is performed using
dask.dataframe.multi.hash_join
.
-
min
(axis=None, skipna=True, split_every=False, out=None)¶ Return the minimum of the values over the requested axis.
This docstring was copied from pandas.core.frame.DataFrame.min.
Some inconsistencies with the Dask version may exist.
If you want the index of the minimum, use
idxmin
. This isthe equivalent of thenumpy.ndarray
methodargmin
.Parameters: - axis : {index (0), columns (1)}
Axis for the function to be applied on.
- skipna : bool, default True
Exclude NA/null values when computing the result.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only : bool, default None (Not supported in Dask)
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
Returns: - Series or DataFrame (if level specified)
See also
Series.sum
- Return the sum.
Series.min
- Return the minimum.
Series.max
- Return the maximum.
Series.idxmin
- Return the index of the minimum.
Series.idxmax
- Return the index of the maximum.
DataFrame.sum
- Return the sum over the requested axis.
DataFrame.min
- Return the minimum over the requested axis.
DataFrame.max
- Return the maximum over the requested axis.
DataFrame.idxmin
- Return the index of the minimum over the requested axis.
DataFrame.idxmax
- Return the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ # doctest: +SKIP ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) # doctest: +SKIP >>> s # doctest: +SKIP blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() # doctest: +SKIP 0
Min using level names, as well as indices.
>>> s.min(level='blooded') # doctest: +SKIP blooded warm 2 cold 0 Name: legs, dtype: int64
>>> s.min(level=0) # doctest: +SKIP blooded warm 2 cold 0 Name: legs, dtype: int64
-
mod
(other, axis='columns', level=None, fill_value=None)¶ Get Modulo of dataframe and other, element-wise (binary operator mod).
This docstring was copied from pandas.core.frame.DataFrame.mod.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
mode
(dropna=True, split_every=False)[source]¶ Get the mode(s) of each element along the selected axis.
This docstring was copied from pandas.core.frame.DataFrame.mode.
Some inconsistencies with the Dask version may exist.
The mode of a set of values is the value that appears most often. It can be multiple values.
Parameters: - axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
The axis to iterate over while searching for the mode:
- 0 or ‘index’ : get mode of each column
- 1 or ‘columns’ : get mode of each row.
- numeric_only : bool, default False (Not supported in Dask)
If True, only apply to numeric columns.
- dropna : bool, default True
Don’t consider counts of NaN/NaT.
New in version 0.24.0.
Returns: - DataFrame
The modes of each column or row.
See also
Series.mode
- Return the highest frequency value in a Series.
Series.value_counts
- Return the counts of values in a Series.
Examples
>>> df = pd.DataFrame([('bird', 2, 2), # doctest: +SKIP ... ('mammal', 4, np.nan), ... ('arthropod', 8, 0), ... ('bird', 2, np.nan)], ... index=('falcon', 'horse', 'spider', 'ostrich'), ... columns=('species', 'legs', 'wings')) >>> df # doctest: +SKIP species legs wings falcon bird 2 2.0 horse mammal 4 NaN spider arthropod 8 0.0 ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of
species
andlegs
containsNaN
.>>> df.mode() # doctest: +SKIP species legs wings 0 bird 2.0 0.0 1 NaN NaN 2.0
Setting
dropna=False
NaN
values are considered and they can be the mode (like for wings).>>> df.mode(dropna=False) # doctest: +SKIP species legs wings 0 bird 2 NaN
Setting
numeric_only=True
, only the mode of numeric columns is computed, and columns of other types are ignored.>>> df.mode(numeric_only=True) # doctest: +SKIP legs wings 0 2.0 0.0 1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
>>> df.mode(axis='columns', numeric_only=True) # doctest: +SKIP 0 1 falcon 2.0 NaN horse 4.0 NaN spider 0.0 8.0 ostrich 2.0 NaN
-
mul
(other, axis='columns', level=None, fill_value=None)¶ Get Multiplication of dataframe and other, element-wise (binary operator mul).
This docstring was copied from pandas.core.frame.DataFrame.mul.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
ndim
¶ Return dimensionality
-
ne
(other, axis='columns', level=None)¶ Get Not equal to of dataframe and other, element-wise (binary operator ne).
This docstring was copied from pandas.core.frame.DataFrame.ne.
Some inconsistencies with the Dask version may exist.
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}, default ‘columns’
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
Returns: - DataFrame of bool
Result of the comparison.
See also
DataFrame.eq
- Compare DataFrames for equality elementwise.
DataFrame.ne
- Compare DataFrames for inequality elementwise.
DataFrame.le
- Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
- Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
- Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
- Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], # doctest: +SKIP ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df # doctest: +SKIP cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 # doctest: +SKIP cost revenue A False True B False False C True False
>>> df.eq(100) # doctest: +SKIP cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) # doctest: +SKIP cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') # doctest: +SKIP cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] # doctest: +SKIP cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') # doctest: +SKIP cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, # doctest: +SKIP ... index=['A', 'B', 'C', 'D']) >>> other # doctest: +SKIP revenue A 300 B 250 C 100 D 150
>>> df.gt(other) # doctest: +SKIP cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], # doctest: +SKIP ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex # doctest: +SKIP cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) # doctest: +SKIP cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
-
nlargest
(n=5, columns=None, split_every=None)[source]¶ Return the first n rows ordered by columns in descending order.
This docstring was copied from pandas.core.frame.DataFrame.nlargest.
Some inconsistencies with the Dask version may exist.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n)
, but more performant.Parameters: - n : int
Number of rows to return.
- columns : label or list of labels
Column label(s) to order by.
- keep : {‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)
Where there are duplicate values:
- first : prioritize the first occurrence(s)
- last : prioritize the last occurrence(s)
all
: do not drop any duplicates, even it means- selecting more than n items.
New in version 0.24.0.
Returns: - DataFrame
The first n rows ordered by the given columns in descending order.
See also
DataFrame.nsmallest
- Return the first n rows ordered by columns in ascending order.
DataFrame.sort_values
- Sort DataFrame by the values.
DataFrame.head
- Return the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeError
is raised.Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, # doctest: +SKIP ... 434000, 434000, 337000, 11300, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df # doctest: +SKIP population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 11300 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nlargest
to select the three rows having the largest values in column “population”.>>> df.nlargest(3, 'population') # doctest: +SKIP population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT
When using
keep='last'
, ties are resolved in reverse order:>>> df.nlargest(3, 'population', keep='last') # doctest: +SKIP population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
When using
keep='all'
, all duplicate items are maintained:>>> df.nlargest(3, 'population', keep='all') # doctest: +SKIP population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN
To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nlargest(3, ['population', 'GDP']) # doctest: +SKIP population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
-
notnull
()¶ Detect existing (non-missing) values.
This docstring was copied from pandas.core.frame.DataFrame.notnull.
Some inconsistencies with the Dask version may exist.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: - DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.notnull
- Alias of notna.
DataFrame.isna
- Boolean inverse of notna.
DataFrame.dropna
- Omit axes labels with missing values.
notna
- Top-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], # doctest: +SKIP ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() # doctest: +SKIP age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() # doctest: +SKIP 0 True 1 True 2 False dtype: bool
-
npartitions
¶ Return number of partitions
-
nsmallest
(n=5, columns=None, split_every=None)[source]¶ Return the first n rows ordered by columns in ascending order.
This docstring was copied from pandas.core.frame.DataFrame.nsmallest.
Some inconsistencies with the Dask version may exist.
Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=True).head(n)
, but more performant.Parameters: - n : int
Number of items to retrieve.
- columns : list or str
Column name or names to order by.
- keep : {‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)
Where there are duplicate values:
first
: take the first occurrence.last
: take the last occurrence.all
: do not drop any duplicates, even it means selecting more than n items.
New in version 0.24.0.
Returns: - DataFrame
See also
DataFrame.nlargest
- Return the first n rows ordered by columns in descending order.
DataFrame.sort_values
- Sort DataFrame by the values.
DataFrame.head
- Return the first n rows without re-ordering.
Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, # doctest: +SKIP ... 434000, 434000, 337000, 337000, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df # doctest: +SKIP population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 337000 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nsmallest
to select the three rows having the smallest values in column “population”.>>> df.nsmallest(3, 'population') # doctest: +SKIP population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS
When using
keep='last'
, ties are resolved in reverse order:>>> df.nsmallest(3, 'population', keep='last') # doctest: +SKIP population GDP alpha-2 Anguilla 11300 311 AI Tuvalu 11300 38 TV Nauru 337000 182 NR
When using
keep='all'
, all duplicate items are maintained:>>> df.nsmallest(3, 'population', keep='all') # doctest: +SKIP population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS Nauru 337000 182 NR
To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nsmallest(3, ['population', 'GDP']) # doctest: +SKIP population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Nauru 337000 182 NR
-
nunique_approx
(split_every=None)¶ Approximate number of unique rows.
This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.
Parameters: - split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.
Returns: - a float representing the approximate number of elements
-
partitions
¶ Slice dataframe by partitions
This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example,
df.partitions[:5]
produces a new Dask Dataframe of the first five partitions.Returns: - A Dask DataFrame
Examples
>>> df.partitions[0] # doctest: +SKIP >>> df.partitions[:3] # doctest: +SKIP >>> df.partitions[::10] # doctest: +SKIP
-
persist
(**kwargs)¶ Persist this dask collection into memory
This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.
The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.
This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.
Parameters: - scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
- optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
- **kwargs
Extra keywords to forward to the scheduler function.
Returns: - New dask collections backed by in-memory data
See also
dask.base.persist
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs).
This docstring was copied from pandas.core.frame.DataFrame.pipe.
Some inconsistencies with the Dask version may exist.
Parameters: - func : function
Function to apply to the Series/DataFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the Series/DataFrame.- args : iterable, optional
Positional arguments passed into
func
.- kwargs : mapping, optional
A dictionary of keyword arguments passed into
func
.
Returns: - object : the return type of
func
.
See also
DataFrame.apply
- Apply a function along input axis of DataFrame.
DataFrame.applymap
- Apply a function elementwise on a whole DataFrame.
Series.map
- Apply a mapping correspondence on a
Series
.
Notes
Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> func(g(h(df), arg1=a), arg2=b, arg3=c) # doctest: +SKIP
You can write
>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe(func, arg2=b, arg3=c) ... ) # doctest: +SKIP
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe((func, 'arg2'), arg1=a, arg3=c) ... ) # doctest: +SKIP
-
pivot_table
(index=None, columns=None, values=None, aggfunc='mean')[source]¶ Create a spreadsheet-style pivot table as a DataFrame. Target
columns
must have category dtype to infer result’scolumns
.index
,columns
,values
andaggfunc
must be all scalar.Parameters: - values : scalar
column to aggregate
- index : scalar
column to be index
- columns : scalar
column to be columns
- aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’
Returns: - table : DataFrame
-
pop
(item)[source]¶ Return item and drop from frame. Raise KeyError if not found.
This docstring was copied from pandas.core.frame.DataFrame.pop.
Some inconsistencies with the Dask version may exist.
Parameters: - item : label
Label of column to be popped.
Returns: - Series
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), # doctest: +SKIP ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df # doctest: +SKIP name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
>>> df.pop('class') # doctest: +SKIP 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object
>>> df # doctest: +SKIP name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
-
pow
(other, axis='columns', level=None, fill_value=None)¶ Get Exponential power of dataframe and other, element-wise (binary operator pow).
This docstring was copied from pandas.core.frame.DataFrame.pow.
Some inconsistencies with the Dask version may exist.
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
prod
(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶ Return the product of the values over the requested axis.
This docstring was copied from pandas.core.frame.DataFrame.prod.
Some inconsistencies with the Dask version may exist.
Parameters: - axis : {index (0), columns (1)}
Axis for the function to be applied on.
- skipna : bool, default True
Exclude NA/null values when computing the result.
- level : int or level name, default None (Not supported in Dask)
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only : bool, default None (Not supported in Dask)
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- min_count : int, default 0
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.- **kwargs
Additional keyword arguments to be passed to the function.
Returns: - Series or DataFrame (if level specified)
See also
Series.sum
- Return the sum.
Series.min
- Return the minimum.
Series.max
- Return the maximum.
Series.idxmin
- Return the index of the minimum.
Series.idxmax
- Return the index of the maximum.
DataFrame.sum
- Return the sum over the requested axis.
DataFrame.min
- Return the minimum over the requested axis.
DataFrame.max
- Return the maximum over the requested axis.
DataFrame.idxmin
- Return the index of the minimum over the requested axis.
DataFrame.idxmax
- Return the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1
>>> pd.Series([]).prod() # doctest: +SKIP 1.0
This can be controlled with the
min_count
parameter>>> pd.Series([]).prod(min_count=1) # doctest: +SKIP nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() # doctest: +SKIP 1.0
>>> pd.Series([np.nan]).prod(min_count=1) # doctest: +SKIP nan
-
quantile
(q=0.5, axis=0, method='default')¶ Approximate row-wise and precise column-wise quantiles of DataFrame
Parameters: - q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
- axis : {0, 1, ‘index’, ‘columns’} (default 0)
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- method : {‘default’, ‘tdigest’, ‘dask’}, optional
What method to use. By default will use dask’s internal custom algorithm (
'dask'
). If set to'tdigest'
will use tdigest for floats and ints and fallback to the'dask'
otherwise.
-
query
(expr, **kwargs)[source]¶ Filter dataframe with complex expression
Blocked version of pd.DataFrame.query
This is like the sequential version except that this will also happen in many threads. This may conflict with
numexpr
which will use multiple threads itself. We recommend that you set numexpr to use a single threadimport numexpr numexpr.set_num_threads(1)See also
-
radd
(other, axis='columns', level=None, fill_value=None)¶ Get Addition of dataframe and other, element-wise (binary operator radd).
This docstring was copied from pandas.core.frame.DataFrame.radd.
Some inconsistencies with the Dask version may exist.
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
random_split
(frac, random_state=None, shuffle=False)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: - frac : list
List of floats that should sum to one.
- random_state : int or np.random.RandomState
If int create a new RandomState with this as the seed. Otherwise draw from the passed RandomState.
- shuffle : bool, default False
If set to True, the dataframe is shuffled (within partition) before the split.
See also
dask.DataFrame.sample
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5]) # doctest: +SKIP
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123) # doctest: +SKIP
-
rdiv
(other, axis='columns', level=None, fill_value=None)¶ Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
This docstring was copied from pandas.core.frame.DataFrame.rdiv.
Some inconsistencies with the Dask version may exist.
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: - chunk : callable
Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.- aggregate : callable, optional
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a:- scalar: Input is a Series, with one row per partition.
- Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.- combine : callable, optional
Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above.- meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.- token : str, optional
The name to use for the output keys.
- split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to
aggregate
. Default is 8.- chunk_kwargs : dict, optional
Keyword arguments to pass on to
chunk
only.- aggregate_kwargs : dict, optional
Keyword arguments to pass on to
aggregate
only.- combine_kwargs : dict, optional
Keyword arguments to pass on to
combine
only.- kwargs :
All remaining keywords will be passed to
chunk
,combine
, andaggregate
.
Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'count': x.count(), 'sum': x.sum()}, ... index=['count', 'sum']) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'count': x.count(), 'sum': x.sum()}, ... columns=['count', 'sum']) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
-
rename
(index=None, columns=None)[source]¶ Alter axes labels.
This docstring was copied from pandas.core.frame.DataFrame.rename.
Some inconsistencies with the Dask version may exist.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
Parameters: - mapper : dict-like or function (Not supported in Dask)
Dict-like or function transformations to apply to that axis’ values. Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
andcolumns
.- index : dict-like or function (Not supported in Dask)
Alternative to specifying axis (
mapper, axis=0
is equivalent toindex=mapper
).- columns : dict-like or function
Alternative to specifying axis (
mapper, axis=1
is equivalent tocolumns=mapper
).- axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
Axis to target with
mapper
. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.- copy : bool, default True (Not supported in Dask)
Also copy underlying data.
- inplace : bool, default False (Not supported in Dask)
Whether to return a new DataFrame. If True then value of copy is ignored.
- level : int or level name, default None (Not supported in Dask)
In case of a MultiIndex, only rename labels in the specified level.
- errors : {‘ignore’, ‘raise’}, default ‘ignore’ (Not supported in Dask)
If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
Returns: - DataFrame or None
DataFrame with the renamed axis labels or None if
inplace=True
.
Raises: - KeyError
If any of the labels is not found in the selected axis and “errors=’raise’”.
See also
DataFrame.rename_axis
- Set the name of the axis.
Examples
DataFrame.rename
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Rename columns using a mapping:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # doctest: +SKIP >>> df.rename(columns={"A": "a", "B": "c"}) # doctest: +SKIP a c 0 1 4 1 2 5 2 3 6
Rename index using a mapping:
>>> df.rename(index={0: "x", 1: "y", 2: "z"}) # doctest: +SKIP A B x 1 4 y 2 5 z 3 6
Cast index labels to a different type:
>>> df.index # doctest: +SKIP RangeIndex(start=0, stop=3, step=1) >>> df.rename(index=str).index # doctest: +SKIP Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise") # doctest: +SKIP Traceback (most recent call last): KeyError: ['C'] not found in axis
Using axis-style parameters:
>>> df.rename(str.lower, axis='columns') # doctest: +SKIP a b 0 1 4 1 2 5 2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index') # doctest: +SKIP A B 0 1 4 2 2 5 4 3 6
-
repartition
(divisions=None, npartitions=None, partition_size=None, freq=None, force=False)¶ Repartition dataframe along new divisions
Parameters: - divisions : list, optional
List of partitions to be used. Only used if npartitions and partition_size isn’t specified. For convenience if given an integer this will defer to npartitions and if given a string it will defer to partition_size (see below)
- npartitions : int, optional
Number of partitions of output. Only used if partition_size isn’t specified.
- partition_size: int or string, optional
Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored.
Warning
This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.
- freq : str, pd.Timedelta
A period on which to partition timeseries data like
'7D'
or'12h'
orpd.Timedelta(hours=12)
. Assumes a datetime index.- force : bool, default False
Allows the expansion of the existing divisions. If False then the new divisions’ lower and upper bounds must be the same as the old divisions’.
Notes
Exactly one of divisions, npartitions, partition_size, or freq should be specified. A
ValueError
will be raised when that is not the case.Examples
>>> df = df.repartition(npartitions=10) # doctest: +SKIP >>> df = df.repartition(divisions=[0, 5, 10, 20]) # doctest: +SKIP >>> df = df.repartition(freq='7d') # doctest: +SKIP
-
replace
(to_replace=None, value=None, regex=False)¶ Replace values given in to_replace with value.
This docstring was copied from pandas.core.frame.DataFrame.replace.
Some inconsistencies with the Dask version may exist.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with
.loc
or.iloc
, which require you to specify a location to update with some value.Parameters: - to_replace : str, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
numeric, str or regex:
- numeric: numeric values equal to to_replace will be replaced with value
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if
regex=True
then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use. - str, regex and numeric rules apply as above.
dict:
- Dicts can be used to specify different replacement values
for different existing values. For example,
{'a': 'b', 'y': 'z'}
replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None. - For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}
looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNone
in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in. - For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}
, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should beNone
to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
- Dicts can be used to specify different replacement values
for different existing values. For example,
None:
- This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also
None
then this must be a nested dictionary or Series.
- This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also
See the examples section for examples of each of these.
- value : scalar, dict, list, str, regex, default None
Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
- inplace : bool, default False (Not supported in Dask)
If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.
- limit : int or None, default None (Not supported in Dask)
Maximum size gap to forward or backward fill.
- regex : bool or same types as to_replace, default False
Whether to interpret to_replace and/or value as regular expressions. If this is
True
then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone
.- method : {‘pad’, ‘ffill’, ‘bfill’, None} (Not supported in Dask)
The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None
.
Returns: - DataFrame or None
Object after replacement or None if
inplace=True
.
Raises: - AssertionError
- If regex is not a
bool
and to_replace is notNone
.
- If regex is not a
- TypeError
- If to_replace is not a scalar, array-like,
dict
, orNone
- If to_replace is a
dict
and value is not alist
,dict
,ndarray
, orSeries
- If to_replace is
None
and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. - When replacing multiple
bool
ordatetime64
objects and the arguments to to_replace does not match the type of the value being replaced
- If to_replace is not a scalar, array-like,
- ValueError
- If a
list
or anndarray
is passed to to_replace and value but they are not the same length.
- If a
See also
DataFrame.fillna
- Fill NA values.
DataFrame.where
- Replace values based on boolean condition.
Series.str.replace
- Simple string replacement.
Notes
- Regex substitution is performed under the hood with
re.sub
. The rules for substitution forre.sub
are the same. - Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
- This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
- When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Examples
Scalar `to_replace` and `value`
>>> s = pd.Series([0, 1, 2, 3, 4]) # doctest: +SKIP >>> s.replace(0, 5) # doctest: +SKIP 0 5 1 1 2 2 3 3 4 4 dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], # doctest: +SKIP ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) # doctest: +SKIP A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
List-like `to_replace`
>>> df.replace([0, 1, 2, 3], 4) # doctest: +SKIP A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # doctest: +SKIP A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e
>>> s.replace([1, 2], method='bfill') # doctest: +SKIP 0 0 1 3 2 3 3 3 4 4 dtype: int64
dict-like `to_replace`
>>> df.replace({0: 10, 1: 100}) # doctest: +SKIP A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100) # doctest: +SKIP A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}}) # doctest: +SKIP A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e
Regular expression `to_replace`
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], # doctest: +SKIP ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) # doctest: +SKIP A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) # doctest: +SKIP A B 0 new abc 1 foo bar 2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new') # doctest: +SKIP A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) # doctest: +SKIP A B 0 new abc 1 xyz new 2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new') # doctest: +SKIP A B 0 new abc 1 new new 2 bait xyz
Compare the behavior of
s.replace({'a': None})
ands.replace('a', None)
to understand the peculiarities of the to_replace parameter:>>> s = pd.Series([10, 'a', 'a', 'b', 'a']) # doctest: +SKIP
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter.
s.replace({'a': None})
is equivalent tos.replace(to_replace={'a': None}, value=None, method=None)
:>>> s.replace({'a': None}) # doctest: +SKIP 0 10 1 None 2 None 3 b 4 None dtype: object
When
value=None
and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The commands.replace('a', None)
is actually equivalent tos.replace(to_replace='a', value=None, method='pad')
:>>> s.replace('a', None) # doctest: +SKIP 0 10 1 10 2 10 3 b 4 b dtype: object
-
resample
(rule, closed=None, label=None)¶ Resample time-series data.
This docstring was copied from pandas.core.frame.DataFrame.resample.
Some inconsistencies with the Dask version may exist.
Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: - rule : DateOffset, Timedelta or str
The offset string or object representing target conversion.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)
Which axis to use for up- or down-sampling. For Series this will default to 0, i.e. along the rows. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
- closed : {‘right’, ‘left’}, default None
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
- label : {‘right’, ‘left’}, default None
Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
- convention : {‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask)
For PeriodIndex only, controls whether to use the start or end of rule.
- kind : {‘timestamp’, ‘period’}, optional, default None (Not supported in Dask)
Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
- loffset : timedelta, default None (Not supported in Dask)
Adjust the resampled time labels.
Deprecated since version 1.1.0: You should add the loffset to the df.index after the resample. See below.
- base : int, default 0 (Not supported in Dask)
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.
Deprecated since version 1.1.0: The new arguments that you should use are ‘offset’ or ‘origin’.
- on : str, optional (Not supported in Dask)
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
- level : str or int, optional (Not supported in Dask)
For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.
- origin : {‘epoch’, ‘start’, ‘start_day’}, Timestamp or str, default ‘start_day’ (Not supported in Dask)
The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If a timestamp is not used, these values are also supported:
- ‘epoch’: origin is 1970-01-01
- ‘start’: origin is the first value of the timeseries
- ‘start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
- offset : Timedelta or str, default is None (Not supported in Dask)
An offset timedelta added to the origin.
New in version 1.1.0.
Returns: - Resampler object
See also
groupby
- Group by mapping, function, label, or list of labels.
Series.resample
- Resample a Series.
DataFrame.resample
- Resample a DataFrame.
Notes
See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') # doctest: +SKIP >>> series = pd.Series(range(9), index=index) # doctest: +SKIP >>> series # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() # doctest: +SKIP 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00
does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() # doctest: +SKIP 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] # Select first 5 rows # doctest: +SKIP 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): # doctest: +SKIP ... return np.sum(array_like) + 5 ... >>> series.resample('3T').apply(custom_resampler) # doctest: +SKIP 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', # doctest: +SKIP ... freq='A', ... periods=2)) >>> s # doctest: +SKIP 2012 1 2013 2 Freq: A-DEC, dtype: int64 >>> s.resample('Q', convention='start').asfreq() # doctest: +SKIP 2012Q1 1.0 2012Q2 NaN 2012Q3 NaN 2012Q4 NaN 2013Q1 2.0 2013Q2 NaN 2013Q3 NaN 2013Q4 NaN Freq: Q-DEC, dtype: float64
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01', # doctest: +SKIP ... freq='Q', ... periods=4)) >>> q # doctest: +SKIP 2018Q1 1 2018Q2 2 2018Q3 3 2018Q4 4 Freq: Q-DEC, dtype: int64 >>> q.resample('M', convention='end').asfreq() # doctest: +SKIP 2018-03 1.0 2018-04 NaN 2018-05 NaN 2018-06 2.0 2018-07 NaN 2018-08 NaN 2018-09 3.0 2018-10 NaN 2018-11 NaN 2018-12 4.0 Freq: M, dtype: float64
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.
>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19], # doctest: +SKIP ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df = pd.DataFrame(d) # doctest: +SKIP >>> df['week_starting'] = pd.date_range('01/01/2018', # doctest: +SKIP ... periods=8, ... freq='W') >>> df # doctest: +SKIP price volume week_starting 0 10 50 2018-01-07 1 11 60 2018-01-14 2 9 40 2018-01-21 3 13 100 2018-01-28 4 14 50 2018-02-04 5 18 100 2018-02-11 6 17 40 2018-02-18 7 19 50 2018-02-25 >>> df.resample('M', on='week_starting').mean() # doctest: +SKIP price volume week_starting 2018-01-31 10.75 62.5 2018-02-28 17.00 60.0
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') # doctest: +SKIP >>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19], # doctest: +SKIP ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df2 = pd.DataFrame(d2, # doctest: +SKIP ... index=pd.MultiIndex.from_product([days, ... ['morning', ... 'afternoon']] ... )) >>> df2 # doctest: +SKIP price volume 2000-01-01 morning 10 50 afternoon 11 60 2000-01-02 morning 9 40 afternoon 13 100 2000-01-03 morning 14 50 afternoon 18 100 2000-01-04 morning 17 40 afternoon 19 50 >>> df2.resample('D', level=0).sum() # doctest: +SKIP price volume 2000-01-01 21 110 2000-01-02 22 140 2000-01-03 32 150 2000-01-04 36 90
If you want to adjust the start of the bins based on a fixed timestamp:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00' # doctest: +SKIP >>> rng = pd.date_range(start, end, freq='7min') # doctest: +SKIP >>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng) # doctest: +SKIP >>> ts # doctest: +SKIP 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7T, dtype: int64
>>> ts.resample('17min').sum() # doctest: +SKIP 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum() # doctest: +SKIP 2000-10-01 23:18:00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum() # doctest: +SKIP 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.resample('17min', origin='start').sum() # doctest: +SKIP 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum() # doctest: +SKIP 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:
>>> ts.resample('17min', offset='2min').sum() # doctest: +SKIP 2000-10-01 23:16:00 0 2000-10-01 23:33:00 9 2000-10-01 23:50:00 36 2000-10-02 00:07:00 39 2000-10-02 00:24:00 24 Freq: 17T, dtype: int64
To replace the use of the deprecated loffset argument:
>>> from pandas.tseries.frequencies import to_offset # doctest: +SKIP >>> loffset = '19min' # doctest: +SKIP >>> ts_out = ts.resample('17min').sum() # doctest: +SKIP >>> ts_out.index = ts_out.index + to_offset(loffset) # doctest: +SKIP >>> ts_out # doctest: +SKIP 2000-10-01 23:33:00 0 2000-10-01 23:50:00 9 2000-10-02 00:07:00 21 2000-10-02 00:24:00 54 2000-10-02 00:41:00 24 Freq: 17T, dtype: int64
-
reset_index
(drop=False)¶ Reset the index to the default index.
Note that unlike in
pandas
, the resetdask.dataframe
index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g.index1 = [0, ..., 10], index2 = [0, ...]
). This is due to the inability to statically know the full length of the index.For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: - drop : boolean, default False
Do not try to insert index into dataframe columns.
-
rfloordiv
(other, axis='columns', level=None, fill_value=None)¶ Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
This docstring was copied from pandas.core.frame.DataFrame.rfloordiv.
Some inconsistencies with the Dask version may exist.
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rmod
(other, axis='columns', level=None, fill_value=None)¶ Get Modulo of dataframe and other, element-wise (binary operator rmod).
This docstring was copied from pandas.core.frame.DataFrame.rmod.
Some inconsistencies with the Dask version may exist.
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex # doctest: +SKIP angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) # doctest: +SKIP angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
-
rmul
(other, axis='columns', level=None, fill_value=None)¶ Get Multiplication of dataframe and other, element-wise (binary operator rmul).
This docstring was copied from pandas.core.frame.DataFrame.rmul.
Some inconsistencies with the Dask version may exist.
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
Parameters: - other : scalar, sequence, Series, or DataFrame
Any single or multiple element data structure, or list-like object.
- axis : {0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
- level : int or label
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value : float or None, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
Returns: - DataFrame
Result of the arithmetic operation.
See also
DataFrame.add
- Add DataFrames.
DataFrame.sub
- Subtract DataFrames.
DataFrame.mul
- Multiply DataFrames.
DataFrame.div
- Divide DataFrames (float division).
DataFrame.truediv
- Divide DataFrames (float division).
DataFrame.floordiv
- Divide DataFrames (integer division).
DataFrame.mod
- Calculate modulo (remainder after division).
DataFrame.pow
- Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], # doctest: +SKIP ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df # doctest: +SKIP angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) # doctest: +SKIP angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) # doctest: +SKIP angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) # doctest: +SKIP angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') # doctest: +SKIP angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), # doctest: +SKIP ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, # doctest: +SKIP ... index=['circle', 'triangle', 'rectangle']) >>> other # doctest: +SKIP angles circle 0 triangle 3 rectangle 4
>>> df * other # doctest: +SKIP angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) # doctest: +SKIP angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], # doctest: +SKIP ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'],