Dask DataFrame API with Logical Query Planning
Contents
Dask DataFrame API with Logical Query Planning¶
DataFrame¶
|
DataFrame-like Expr Collection. |
Return a Series/DataFrame with absolute numeric value of each element. |
|
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.DataFrame.apply |
|
Assign new columns to a DataFrame. |
|
Cast a pandas object to a specified dtype |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
|
Convert columns of the DataFrame to category dtype. |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute pairwise correlation of columns, excluding NA/null values. |
|
Count non-NA cells for each column or row. |
|
Compute pairwise covariance of columns, excluding NA/null values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Tuple of |
|
|
Drop specified labels from rows or columns. |
|
Return DataFrame with duplicate rows removed. |
|
Remove missing values. |
Return data types |
|
|
|
|
Evaluate a string describing operations on DataFrame columns. |
|
Transform each element of a list-like to a row, replicating index values. |
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group DataFrame using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
|
Return index of first occurrence of maximum over requested axis. |
|
Return index of first occurrence of minimum over requested axis. |
Purely integer-location based indexing for selection by position. |
|
Return dask Index instance |
|
|
Concise summary of a Dask DataFrame |
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Iterate over (column name, Series) pairs. |
|
Iterate over DataFrame rows as (index, Series) pairs. |
|
|
Iterate over DataFrame rows as namedtuples. |
|
Join columns of another DataFrame. |
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
|
Return the mean of the values over the requested axis. |
|
Return the median of the values over the requested axis. |
|
Return the approximate median of the values over the requested axis. |
|
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. |
|
Return the memory usage of each column in bytes. |
Return the memory usage of each partition |
|
|
Merge the DataFrame with another DataFrame |
|
Return the minimum of the values over the requested axis. |
|
|
|
Get the mode(s) of each element along the selected axis. |
|
|
Return dimensionality |
|
|
|
|
Return the first n rows ordered by columns in descending order. |
Return number of partitions |
|
|
Return the first n rows ordered by columns in ascending order. |
Slice dataframe by partitions |
|
|
Persist this dask collection into memory |
|
Create a spreadsheet-style pivot table as a DataFrame. |
|
Return item and drop from frame. |
|
|
|
Return the product of the values over the requested axis. |
|
Approximate row-wise and precise column-wise quantiles of DataFrame |
|
Filter dataframe with complex expression |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Rename columns or index labels. |
|
Set the name of the axis for the index or columns. |
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
|
|
|
|
|
|
Round a DataFrame to a variable number of decimal places. |
|
|
|
|
|
|
|
Random sample of items |
|
Return a subset of the DataFrame's columns based on the column dtypes. |
|
Return unbiased standard error of the mean over requested axis. |
|
Set the DataFrame index (row labels) using an existing column. |
|
Rearrange DataFrame into new partitions |
Size of the Series or DataFrame as a Delayed object. |
|
|
Sort the dataset by a single column. |
|
Squeeze 1 dimensional axis objects into scalars. |
|
Return sample standard deviation over requested axis. |
|
|
|
Return the sum of the values over the requested axis. |
|
Last n rows of the dataset |
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
|
See dd.to_hdf docstring for more information |
|
Render a DataFrame as an HTML table. |
|
See dd.to_json docstring for more information |
|
|
|
|
|
Render a DataFrame to a console-friendly tabular output. |
|
|
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
Return a dask.array of the values of this dataframe |
|
|
Return unbiased variance over requested axis. |
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
Series¶
|
Series-like Expr Collection. |
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.Series.apply |
|
Cast a pandas object to a specified dtype |
|
Compute the lag-N autocorrelation. |
|
Return boolean Series equivalent to left <= series <= right. |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
Forget division information. |
|
|
Trim values at input threshold(s). |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute correlation with other Series, excluding missing values. |
|
Count non-NA cells for each column or row. |
|
Compute covariance with Series, excluding missing values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Return a new Series with missing values removed. |
|
|
|
Transform each element of a list-like to a row. |
|
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group Series using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
|
Return index of first occurrence of maximum over requested axis. |
|
Return index of first occurrence of minimum over requested axis. |
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Map values of Series according to an input mapping or function. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
|
Return the mean of the values over the requested axis. |
Return the median of the values over the requested axis. |
|
|
Return the approximate median of the values over the requested axis. |
|
Return the memory usage of the Series. |
|
Return the memory usage of each partition |
|
Return the minimum of the values over the requested axis. |
|
|
|
|
Number of bytes |
|
Return dimensionality |
|
|
|
|
Return the largest n elements. |
DataFrame.notnull is an alias for DataFrame.notna. |
|
|
Return the smallest n elements. |
|
Return number of unique elements in the object. |
|
Approximate number of unique rows. |
|
Persist this dask collection into memory |
|
Apply chainable functions that expect Series or DataFrames. |
|
|
|
Return the product of the values over the requested axis. |
|
Approximate quantiles of Series |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Alter Series index labels or name |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
Provides rolling transformations. |
|
Round a DataFrame to a variable number of decimal places. |
|
Random sample of items |
|
Return unbiased standard error of the mean over requested axis. |
Return a tuple representing the dimensionality of the DataFrame. |
|
|
Shift index by desired number of periods with an optional time freq. |
Size of the Series or DataFrame as a Delayed object. |
|
|
Return sample standard deviation over requested axis. |
|
|
|
Return the sum of the values over the requested axis. |
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
|
Convert Series to DataFrame. |
|
See dd.to_hdf docstring for more information |
|
Render a string representation of the Series. |
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
|
Return Series of unique values in the object. |
|
Return a Series containing counts of unique values. |
Return a dask.array of the values of this dataframe |
|
|
Return unbiased variance over requested axis. |
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
Index¶
|
Index-like Expr Collection. |
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.Series.apply |
|
Cast a pandas object to a specified dtype |
|
Compute the lag-N autocorrelation. |
|
Return boolean Series equivalent to left <= series <= right. |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
Forget division information. |
|
|
Trim values at input threshold(s). |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute correlation with other Series, excluding missing values. |
|
Count non-NA cells for each column or row. |
|
Compute covariance with Series, excluding missing values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Return a new Series with missing values removed. |
|
|
|
Transform each element of a list-like to a row. |
|
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group Series using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
Return boolean if values in the object are monotonically decreasing. |
|
Return boolean if values in the object are monotonically increasing. |
|
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Map values using an input mapping or function. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
Return the median of the values over the requested axis. |
|
|
Return the approximate median of the values over the requested axis. |
|
Memory usage of the values. |
|
Return the memory usage of each partition |
|
Return the minimum of the values over the requested axis. |
|
|
|
|
Number of bytes |
|
Return dimensionality |
|
|
|
|
Return the largest n elements. |
DataFrame.notnull is an alias for DataFrame.notna. |
|
|
Return the smallest n elements. |
|
Return number of unique elements in the object. |
|
Approximate number of unique rows. |
|
Persist this dask collection into memory |
|
Apply chainable functions that expect Series or DataFrames. |
|
|
|
Approximate quantiles of Series |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Alter Series index labels or name |
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
Provides rolling transformations. |
|
Round a DataFrame to a variable number of decimal places. |
|
Random sample of items |
|
Return unbiased standard error of the mean over requested axis. |
Return a tuple representing the dimensionality of the DataFrame. |
|
|
Shift index by desired number of periods with an optional time freq. |
Size of the Series or DataFrame as a Delayed object. |
|
|
|
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
|
Create a DataFrame with a column containing the Index. |
|
See dd.to_hdf docstring for more information |
|
Create a Series with both index and values equal to the index keys. |
|
Render a string representation of the Series. |
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
|
Return Series of unique values in the object. |
|
Return a Series containing counts of unique values. |
Return a dask.array of the values of this dataframe |
|
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
|
Create a DataFrame with a column containing the Index. |
Accessors¶
Similar to pandas, Dask provides dtype-specific methods under various accessors.
These are separate namespaces within Series
that only apply to specific data types.
Datetime Accessor¶
Methods
|
Perform ceil operation on the data to the specified freq. |
|
Perform floor operation on the data to the specified freq. |
Calculate year, week, and day according to the ISO 8601 standard. |
|
|
Convert times to midnight. |
|
Perform round operation on the data to the specified freq. |
|
Convert to Index using specified date_format. |
Attributes
Returns numpy array of python |
|
The day of the datetime. |
|
The day of the week with Monday=0, Sunday=6. |
|
The ordinal day of the year. |
|
The number of days in the month. |
|
The hours of the datetime. |
|
The microseconds of the datetime. |
|
The minutes of the datetime. |
|
The month as January=1, December=12. |
|
The nanoseconds of the datetime. |
|
The quarter of the date. |
|
The seconds of the datetime. |
|
Returns numpy array of |
|
Returns numpy array of |
|
Return the timezone. |
|
The week ordinal of the year. |
|
The day of the week with Monday=0, Sunday=6. |
|
The week ordinal of the year. |
|
The year of the datetime. |
String Accessor¶
Methods
Convert strings in the Series/Index to be capitalized. |
|
Convert strings in the Series/Index to be casefolded. |
|
|
|
|
Pad left and right side of strings in the Series/Index. |
|
Test if pattern or regex is contained within a string of a Series or Index. |
|
Count occurrences of pattern in each string of the Series/Index. |
|
Decode character string in the Series/Index using indicated encoding. |
|
Encode character string in the Series/Index using indicated encoding. |
|
Test if the end of each string element matches a pattern. |
|
Extract capture groups in the regex pat as columns in a DataFrame. |
|
Extract capture groups in the regex pat as columns in DataFrame. |
|
Return lowest indexes in each strings in the Series/Index. |
|
Find all occurrences of pattern or regular expression in the Series/Index. |
|
Determine if each string entirely matches a regular expression. |
Extract element from each component at specified position or with specified key. |
|
|
Return lowest indexes in each string in Series/Index. |
Check whether all characters in each string are alphanumeric. |
|
Check whether all characters in each string are alphabetic. |
|
Check whether all characters in each string are decimal. |
|
Check whether all characters in each string are digits. |
|
Check whether all characters in each string are lowercase. |
|
Check whether all characters in each string are numeric. |
|
Check whether all characters in each string are whitespace. |
|
Check whether all characters in each string are titlecase. |
|
Check whether all characters in each string are uppercase. |
|
|
Join lists contained as elements in the Series/Index with passed delimiter. |
Compute the length of each element in the Series/Index. |
|
|
Pad right side of strings in the Series/Index. |
Convert strings in the Series/Index to lowercase. |
|
|
Remove leading characters. |
|
Determine if each string starts with a match of a regular expression. |
|
Return the Unicode normal form for the strings in the Series/Index. |
|
Pad strings in the Series/Index up to width. |
|
Split the string at the first occurrence of sep. |
|
Duplicate each string in the Series or Index. |
|
Replace each occurrence of pattern/regex in the Series/Index. |
|
Return highest indexes in each strings in the Series/Index. |
|
Return highest indexes in each string in Series/Index. |
|
Pad left side of strings in the Series/Index. |
|
Split the string at the last occurrence of sep. |
|
|
|
Remove trailing characters. |
|
Slice substrings from each element in the Series or Index. |
|
Known inconsistencies: |
|
Test if the start of each string element matches a pattern. |
|
Remove leading and trailing characters. |
Convert strings in the Series/Index to be swapcased. |
|
Convert strings in the Series/Index to titlecase. |
|
|
Map all characters in the string through the given mapping table. |
Convert strings in the Series/Index to uppercase. |
|
|
Wrap strings in Series/Index at specified line width. |
|
Pad strings in the Series/Index by prepending '0' characters. |
Categorical Accessor¶
Methods
|
Add new categories. |
|
Ensure the categories in this series are known. |
|
Set the Categorical to be ordered. |
Ensure the categories in this series are unknown |
|
|
Set the Categorical to be unordered. |
|
Remove the specified categories. |
Removes categories which are not used |
|
|
Rename categories. |
|
Reorder categories as specified in new_categories. |
|
Set the categories to the specified new categories. |
Attributes
The categories of this categorical. |
|
The codes of this categorical. |
|
Whether the categories are fully known |
|
Whether the categories have an ordered relationship |
Groupby Operations¶
DataFrame Groupby¶
|
Aggregate using one or more specified operations |
|
Parallel version of pandas GroupBy.apply |
|
Backward fill the values. |
|
Compute count of group, excluding missing values. |
Number each item in each group from 0 to the length of that group - 1. |
|
|
Cumulative product for each group. |
|
Cumulative sum for each group. |
|
Forward fill the values. |
|
Construct DataFrame from group with provided name. |
|
Compute max of group values. |
|
Compute mean of groups, excluding missing values. |
|
Compute min of group values. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
|
Compute pairwise covariance of columns, excluding NA/null values. |
|
Compute pairwise correlation of columns, excluding NA/null values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Return index of first occurrence of minimum over requested axis. |
|
Return index of first occurrence of maximum over requested axis. |
|
Provides rolling transformations. |
|
Parallel version of pandas GroupBy.transform |
Series Groupby¶
|
Aggregate using one or more specified operations |
|
Parallel version of pandas GroupBy.apply |
|
Backward fill the values. |
|
Compute count of group, excluding missing values. |
Number each item in each group from 0 to the length of that group - 1. |
|
|
Cumulative product for each group. |
|
Cumulative sum for each group. |
|
Forward fill the values. |
Construct DataFrame from group with provided name. |
|
|
Compute max of group values. |
|
Compute mean of groups, excluding missing values. |
|
Compute min of group values. |
|
Return number of unique elements in the group. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Return index of first occurrence of minimum over requested axis. |
|
Return index of first occurrence of maximum over requested axis. |
|
Provides rolling transformations. |
|
Parallel version of pandas GroupBy.transform |
Custom Aggregation¶
|
User defined groupby-aggregation. |
Rolling Operations¶
|
Provides rolling transformations. |
|
Provides rolling transformations. |
|
Calculate the rolling custom aggregation function. |
Calculate the rolling count of non NaN observations. |
|
Calculate the rolling Fisher's definition of kurtosis without bias. |
|
Calculate the rolling maximum. |
|
Calculate the rolling mean. |
|
Calculate the rolling median. |
|
Calculate the rolling minimum. |
|
Calculate the rolling quantile. |
|
Calculate the rolling unbiased skewness. |
|
Calculate the rolling standard deviation. |
|
Calculate the rolling sum. |
|
Calculate the rolling variance. |
Create DataFrames¶
|
Read CSV files into a Dask.DataFrame |
|
Read delimited files into a Dask.DataFrame |
|
Read fixed-width files into a Dask.DataFrame |
|
Read a Parquet file into a Dask DataFrame |
|
Read HDF files into a Dask DataFrame |
|
Create a dataframe from a set of JSON files |
|
Read dataframe from ORC file(s) |
|
Read SQL database table into a DataFrame. |
|
Read SQL query into a DataFrame. |
|
Read SQL query or database table into a DataFrame. |
|
Read any sliceable array into a Dask Dataframe |
|
Create a Dask DataFrame from a Dask Array. |
|
Create Dask DataFrame from many Dask Delayed objects |
|
Create a DataFrame collection from a custom function map. |
|
Construct a Dask DataFrame from a Pandas DataFrame |
|
Construct a Dask DataFrame from a Python Dictionary |
Store DataFrames¶
|
Store Dask DataFrame to CSV files |
|
Store Dask.dataframe to Parquet files |
|
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
|
Create Dask Array from a Dask Dataframe |
|
Store Dask Dataframe to a SQL table |
|
Write dataframe into JSON text files |
Convert DataFrames¶
|
Create a Dask Bag from a Series |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
Reshape DataFrames¶
|
Convert categorical variable into dummy/indicator variables. |
|
Create a spreadsheet-style pivot table as a DataFrame. |
|
Concatenate DataFrames¶
|
Merge the DataFrame with another DataFrame |
|
Concatenate DataFrames along rows. |
|
Merge DataFrame or named Series objects with a database-style join. |
|
Perform a merge by key distance. |
Resampling¶
|
Aggregate using one or more operations |
|
Aggregate using one or more operations over the specified axis. |
Compute count of group, excluding missing values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Compute max value of group. |
|
Compute mean of groups, excluding missing values. |
|
Compute median of groups, excluding missing values. |
|
Compute min value of group. |
|
Return number of unique elements in the group. |
|
Compute open, high, low and close values of a group, excluding missing values. |
|
Compute prod of group values. |
|
Return value at the given quantile. |
|
Compute standard error of the mean of groups, excluding missing values. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
Dask Metadata¶
|
This method creates meta-data based on the type of |
Query Planning and Optimization¶
|
Create a graph representation of the Expression. |
|
Visualize the expression or task graph |
|
Outputs statistics about every node in the expression. |
Other functions¶
|
Compute several dask collections at once. |
|
Apply Python function on each DataFrame partition. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
Convert argument to datetime. |
|
|
Convert argument to a numeric type. |
Convert argument to timedelta. |