dask.dataframe.DataFrame

Contents

dask.dataframe.DataFrame#

class dask.dataframe.DataFrame(expr)[source]#

DataFrame-like Expr Collection.

The constructor takes the expression that represents the query as input. The class is not meant to be instantiated directly. Instead, use one of the IO connectors from Dask.

__init__(expr)#

Methods

`__init__`(expr)
`abs`()	Return a Series/DataFrame with absolute numeric value of each element.
`add`(other[, axis, level, fill_value])
`add_prefix`(prefix[, axis])	Prefix labels with string prefix.
`add_suffix`(suffix[, axis])	Suffix labels with string suffix.
`align`(other[, join, axis, fill_value])	Align two objects on their axes with the specified join method.
`all`([axis, skipna, split_every])	Return whether all elements are True, potentially over an axis.
`analyze`([filename, format])	Outputs statistics about every node in the expression.
`any`([axis, skipna, split_every])	Return whether any element is True, potentially over an axis.
`apply`(function, *args[, meta, axis])	Parallel version of pandas.DataFrame.apply
`assign`(**pairs)	Assign new columns to a DataFrame.
`astype`(dtypes)	Cast a pandas object to a specified dtype `dtype`.
`bfill`([axis, limit])	Fill NA/NaN values by using the next valid observation to fill the gap.
`categorize`([columns, index, split_every])	Convert columns of the DataFrame to category dtype.
`clear_divisions`()	Forget division information.
`clip`([lower, upper, axis])	Trim values at input threshold(s).
`combine`(other, func[, fill_value, overwrite])	Perform column-wise combine with another DataFrame.
`combine_first`(other)	Update null elements with value in the same location in other.
`compute`(**kwargs)	Compute this dask collection
`compute_current_divisions`([col, set_divisions])	Compute the current divisions of the DataFrame.
`copy`([deep])	Make a copy of the dataframe
`corr`([method, min_periods, numeric_only, ...])	Compute pairwise correlation of columns, excluding NA/null values.
`count`([axis, numeric_only, split_every])	Count non-NA cells for each column or row.
`cov`([min_periods, numeric_only, split_every])	Compute pairwise covariance of columns, excluding NA/null values.
`cummax`([axis, skipna])	Return cumulative maximum over a DataFrame or Series axis.
`cummin`([axis, skipna])	Return cumulative minimum over a DataFrame or Series axis.
`cumprod`([axis, skipna])	Return cumulative product over a DataFrame or Series axis.
`cumsum`([axis, skipna])	Return cumulative sum over a DataFrame or Series axis.
`describe`([split_every, percentiles, ...])	Generate descriptive statistics.
`diff`([periods, axis])	First discrete difference of element.
`div`(other[, axis, level, fill_value])
`divide`(other[, axis, level, fill_value])
`dot`(other[, meta])	Compute the dot product between the Series and the columns of other.
`drop`([labels, axis, columns, errors])	Drop specified labels from rows or columns.
`drop_duplicates`([subset, split_every, ...])	Return DataFrame with duplicate rows removed.
`dropna`([how, subset, thresh])	Remove missing values.
`enforce_runtime_divisions`()	Enforce the current divisions at runtime.
`eq`(other[, level, axis])
`eval`(expr, **kwargs)	Evaluate a string describing operations on DataFrame columns.
`explain`([stage, format])	Create a graph representation of the Expression.
`explode`(column)	Transform each element of a list-like to a row, replicating index values.
`ffill`([axis, limit])	Fill NA/NaN values by propagating the last valid observation to next valid.
`fillna`([value, axis])	Fill NA/NaN values with value.
`floordiv`(other[, axis, level, fill_value])
`from_dict`(data, *[, npartitions, orient, ...])	Construct a Dask DataFrame from a Python Dictionary
`ge`(other[, level, axis])
`get_partition`(n)	Get a dask DataFrame/Series representing the nth partition.
`groupby`(by[, group_keys, sort, observed, dropna])	Group DataFrame using a mapper or by a Series of columns.
`gt`(other[, level, axis])
`head`([n, npartitions, compute])	First n rows of the dataset
`idxmax`([axis, skipna, numeric_only, split_every])	Return index of first occurrence of maximum over requested axis.
`idxmin`([axis, skipna, numeric_only, split_every])	Return index of first occurrence of minimum over requested axis.
`info`([buf, verbose, memory_usage])	Concise summary of a Dask DataFrame
`isin`(values)	Whether each element in the DataFrame is contained in values.
`isna`()	Detect missing values.
`isnull`()	DataFrame.isnull is an alias for DataFrame.isna.
`items`()	Iterate over (column name, Series) pairs.
`iterrows`()	Iterate over DataFrame rows as (index, Series) pairs.
`itertuples`([index, name])	Iterate over DataFrame rows as namedtuples.
`join`(other[, on, how, lsuffix, rsuffix, ...])	Join columns of another DataFrame.
`kurt`([axis, fisher, bias, nan_policy, ...])	Return unbiased kurtosis over requested axis.
`kurtosis`([axis, fisher, bias, nan_policy, ...])	Return unbiased kurtosis over requested axis.
`le`(other[, level, axis])
`lower_once`()
`lt`(other[, level, axis])
`map`(func[, na_action, meta])
`map_overlap`(func, before, after, *args[, ...])	Apply a function to each partition, sharing rows with adjacent partitions.
`map_partitions`(func, *args[, meta, ...])	Apply a Python function to each partition
`mask`(cond[, other])	Replace values where the condition is True.
`max`([axis, skipna, numeric_only, split_every])	Return the maximum of the values over the requested axis.
`mean`([axis, skipna, numeric_only, split_every])	Return the mean of the values over the requested axis.
`median`([axis, numeric_only])	Return the median of the values over the requested axis.
`median_approximate`([axis, method, numeric_only])	Return the approximate median of the values over the requested axis.
`melt`([id_vars, value_vars, var_name, ...])	Unpivot DataFrame from wide to long format, optionally leaving identifiers set.
`memory_usage`([deep, index])	Return the memory usage of each column in bytes.
`memory_usage_per_partition`([index, deep])	Return the memory usage of each partition
`merge`(right[, how, on, left_on, right_on, ...])	Merge the DataFrame with another DataFrame
`min`([axis, skipna, numeric_only, split_every])	Return the minimum of the values over the requested axis.
`mod`(other[, axis, level, fill_value])
`mode`([dropna, split_every, numeric_only])	Get the mode(s) of each element along the selected axis.
`mul`(other[, axis, level, fill_value])
`ne`(other[, level, axis])
`nlargest`([n, columns, split_every])	Return the first n rows ordered by columns in descending order.
`notnull`()	DataFrame.notnull is an alias for DataFrame.notna.
`nsmallest`([n, columns, split_every])	Return the first n rows ordered by columns in ascending order.
`nunique`([axis, dropna, split_every])	Count number of distinct elements in specified axis.
`nunique_approx`([split_every])	Approximate number of unique rows.
`optimize`([fuse])	Optimizes the DataFrame.
`persist`([fuse])	Persist this dask collection into memory
`pipe`(func, args, *kwargs)	Apply chainable functions that expect Series or DataFrames.
`pivot_table`(index, columns, values[, aggfunc])	Create a spreadsheet-style pivot table as a DataFrame.
`pop`(item)	Return item and drop it from DataFrame.
`pow`(other[, axis, level, fill_value])
`pprint`()	Outputs a string representation of the DataFrame.
`prod`([axis, skipna, numeric_only, ...])	Return the product of the values over the requested axis.
`product`([axis, skipna, numeric_only, ...])	Return the product of the values over the requested axis.
`quantile`([q, axis, numeric_only, method])	Approximate row-wise and precise column-wise quantiles of DataFrame
`query`(expr, **kwargs)	Filter dataframe with complex expression
`radd`(other[, axis, level, fill_value])
`random_split`(frac[, random_state, shuffle])	Pseudorandomly split dataframe into different pieces row-wise
`rdiv`(other[, axis, level, fill_value])
`reduction`(chunk[, aggregate, combine, meta, ...])	Generic row-wise reductions.
`rename`([index, columns])	Rename columns or index labels.
`rename_axis`([mapper, index, columns, axis])	Set the name of the axis for the index or columns.
`repartition`([divisions, npartitions, ...])	Repartition a collection
`replace`([to_replace, value, regex])	Replace values given in to_replace with value.
`resample`(rule[, closed, label])	Resample time-series data.
`reset_index`([drop])	Reset the index to the default index.
`rfloordiv`(other[, axis, level, fill_value])
`rmod`(other[, axis, level, fill_value])
`rmul`(other[, axis, level, fill_value])
`rolling`(window, **kwargs)	Provides rolling transformations.
`round`([decimals])	Round numeric columns in a DataFrame to a variable number of decimal places.
`rpow`(other[, axis, level, fill_value])
`rsub`(other[, axis, level, fill_value])
`rtruediv`(other[, axis, level, fill_value])
`sample`([n, frac, replace, random_state])	Random sample of items
`select_dtypes`([include, exclude])	Return a subset of the DataFrame's columns based on the column dtypes.
`sem`([axis, skipna, ddof, split_every, ...])	Return unbiased standard error of the mean over requested axis.
`set_index`(other[, drop, sorted, ...])	Set the DataFrame index (row labels) using an existing column.
`shift`([periods, freq, axis])	Shift index by desired number of periods with an optional time freq.
`shuffle`([on, ignore_index, npartitions, ...])	Rearrange DataFrame into new partitions
`simplify`()
`skew`([axis, bias, nan_policy, numeric_only])	Return unbiased skew over requested axis.
`sort_values`(by[, npartitions, ascending, ...])	Sort the dataset by a single column.
`squeeze`([axis])	Squeeze 1 dimensional axis objects into scalars.
`std`([axis, skipna, ddof, numeric_only, ...])	Return sample standard deviation over requested axis.
`sub`(other[, axis, level, fill_value])
`sum`([axis, skipna, numeric_only, min_count, ...])	Return the sum of the values over the requested axis.
`tail`([n, compute])	Last n rows of the dataset
`to_backend`([backend])	Move to a new DataFrame backend
`to_bag`([index, format])	Create a Dask Bag from a Series
`to_csv`(filename, **kwargs)	See dd.to_csv docstring for more information
`to_dask_array`([lengths, meta, optimize])	Convert a dask DataFrame to a dask array.
`to_delayed`([optimize_graph])	Convert into a list of `dask.delayed` objects, one per partition.
`to_hdf`(path_or_buf, key[, mode, append])	See dd.to_hdf docstring for more information
`to_html`([max_rows])	Render a DataFrame as an HTML table.
`to_json`(filename, args, *kwargs)	See dd.to_json docstring for more information
`to_orc`(path, args, *kwargs)	See dd.to_orc docstring for more information
`to_parquet`(path, **kwargs)
`to_records`([index, lengths])
`to_sql`(name, uri[, schema, if_exists, ...])
`to_string`([max_rows])	Render a DataFrame to a console-friendly tabular output.
`to_timestamp`([freq, how])	Cast PeriodIndex to DatetimeIndex of timestamps, at beginning of period.
`truediv`(other[, axis, level, fill_value])
`var`([axis, skipna, ddof, numeric_only, ...])	Return unbiased variance over requested axis.
`visualize`([tasks])	Visualize the expression or task graph
`where`(cond[, other])	Replace values where the condition is False.

Attributes

`axes`
`columns`
`dask`
`divisions`	Tuple of `npartitions + 1` values, in ascending order, marking the lower/upper bounds of each partition's index.
`dtypes`	Return data types
`empty`
`expr`
`iloc`	Purely integer-location based indexing for selection by position.
`index`	Return dask Index instance
`known_divisions`	Whether the divisions are known.
`loc`	Purely label-location based indexer for selection by label.
`nbytes`
`ndim`	Return dimensionality
`npartitions`	Return number of partitions
`partitions`	Slice dataframe by partitions
`shape`
`size`	Size of the Series or DataFrame as a Delayed object.
`values`	Return a dask.array of the values of this dataframe