dask.dataframe.DataFrame
dask.dataframe.DataFrame¶
- class dask.dataframe.DataFrame(dsk, name, meta, divisions)[source]¶
Parallel Pandas DataFrame
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.- Parameters
- dsk: dict
The dask graph to compute this DataFrame
- name: str
The key prefix that specifies which keys in the dask comprise this particular DataFrame
- meta: pandas.DataFrame
An empty
pandas.DataFrame
with names, dtypes, and index matching the expected output.- divisions: tuple of index values
Values along which we partition our blocks on the index
Methods
__init__
(dsk, name, meta, divisions)abs
()Return a Series/DataFrame with absolute numeric value of each element.
add
(other[, axis, level, fill_value])Get Addition of dataframe and other, element-wise (binary operator add).
add_prefix
(prefix)Prefix labels with string prefix.
add_suffix
(suffix)Suffix labels with string suffix.
align
(other[, join, axis, fill_value])Align two objects on their axes with the specified join method.
all
([axis, skipna, split_every, out])Return whether all elements are True, potentially over an axis.
any
([axis, skipna, split_every, out])Return whether any element is True, potentially over an axis.
apply
(func[, axis, broadcast, raw, reduce, ...])Parallel version of pandas.DataFrame.apply
applymap
(func[, meta])Apply a function to a Dataframe elementwise.
assign
(**kwargs)Assign new columns to a DataFrame.
astype
(dtype)Cast a pandas object to a specified dtype
dtype
.bfill
([axis, limit])Fill NA/NaN values by using the next valid observation to fill the gap.
categorize
([columns, index, split_every])Convert columns of the DataFrame to category dtype.
clear_divisions
()Forget division information
clip
([lower, upper, axis])Trim values at input threshold(s).
combine
(other, func[, fill_value, overwrite])Perform column-wise combine with another DataFrame.
combine_first
(other)Update null elements with value in the same location in other.
compute
(**kwargs)Compute this dask collection
compute_current_divisions
([col])Compute the current divisions of the DataFrame.
copy
([deep])Make a copy of the dataframe
corr
([method, min_periods, numeric_only, ...])Compute pairwise correlation of columns, excluding NA/null values.
count
([axis, split_every, numeric_only])Count non-NA cells for each column or row.
cov
([min_periods, numeric_only, split_every])Compute pairwise covariance of columns, excluding NA/null values.
cummax
([axis, skipna, out])Return cumulative maximum over a DataFrame or Series axis.
cummin
([axis, skipna, out])Return cumulative minimum over a DataFrame or Series axis.
cumprod
([axis, skipna, dtype, out])Return cumulative product over a DataFrame or Series axis.
cumsum
([axis, skipna, dtype, out])Return cumulative sum over a DataFrame or Series axis.
describe
([split_every, percentiles, ...])Generate descriptive statistics.
diff
([periods, axis])First discrete difference of element.
div
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator truediv).
divide
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator truediv).
dot
(other[, meta])Compute the dot product between the Series and the columns of other.
drop
([labels, axis, columns, errors])Drop specified labels from rows or columns.
drop_duplicates
([subset, split_every, ...])Return DataFrame with duplicate rows removed.
dropna
([how, subset, thresh])Remove missing values.
enforce_runtime_divisions
()Enforce the current divisions at runtime
eq
(other[, axis, level])Get Equal to of dataframe and other, element-wise (binary operator eq).
eval
(expr[, inplace])Evaluate a string describing operations on DataFrame columns.
explode
(column)Transform each element of a list-like to a row, replicating index values.
ffill
([axis, limit])Fill NA/NaN values by propagating the last valid observation to next valid.
fillna
([value, method, limit, axis])Fill NA/NaN values using the specified method.
first
(offset)Select initial periods of time series data based on a date offset.
floordiv
(other[, axis, level, fill_value])Get Integer division of dataframe and other, element-wise (binary operator floordiv).
from_dict
(data, *, npartitions[, orient, ...])Construct a Dask DataFrame from a Python Dictionary
ge
(other[, axis, level])Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
Get a dask DataFrame/Series representing the nth partition.
groupby
([by, group_keys, sort, observed, dropna])Group DataFrame using a mapper or by a Series of columns.
gt
(other[, axis, level])Get Greater than of dataframe and other, element-wise (binary operator gt).
head
([n, npartitions, compute])First n rows of the dataset
idxmax
([axis, skipna, split_every, numeric_only])Return index of first occurrence of maximum over requested axis.
idxmin
([axis, skipna, split_every, numeric_only])Return index of first occurrence of minimum over requested axis.
info
([buf, verbose, memory_usage])Concise summary of a Dask DataFrame.
isin
(values)Whether each element in the DataFrame is contained in values.
isna
()Detect missing values.
isnull
()DataFrame.isnull is an alias for DataFrame.isna.
items
()Iterate over (column name, Series) pairs.
iterrows
()Iterate over DataFrame rows as (index, Series) pairs.
itertuples
([index, name])Iterate over DataFrame rows as namedtuples.
join
(other[, on, how, lsuffix, rsuffix, ...])Join columns of another DataFrame.
kurtosis
([axis, fisher, bias, nan_policy, ...])Return unbiased kurtosis over requested axis.
last
(offset)Select final periods of time series data based on a date offset.
le
(other[, axis, level])Get Less than or equal to of dataframe and other, element-wise (binary operator le).
lt
(other[, axis, level])Get Less than of dataframe and other, element-wise (binary operator lt).
map
(func[, meta, na_action])map_overlap
(func, before, after, *args, **kwargs)Apply a function to each partition, sharing rows with adjacent partitions.
map_partitions
(func, *args, **kwargs)Apply Python function on each DataFrame partition.
mask
(cond[, other])Replace values where the condition is True.
max
([axis, skipna, split_every, out, ...])Return the maximum of the values over the requested axis.
mean
([axis, skipna, split_every, dtype, ...])Return the mean of the values over the requested axis.
median
([axis, method])Return the median of the values over the requested axis.
median_approximate
([axis, method])Return the approximate median of the values over the requested axis.
melt
([id_vars, value_vars, var_name, ...])Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.
memory_usage
([index, deep])Return the memory usage of each column in bytes.
memory_usage_per_partition
([index, deep])Return the memory usage of each partition
merge
(right[, how, on, left_on, right_on, ...])Merge the DataFrame with another DataFrame
min
([axis, skipna, split_every, out, ...])Return the minimum of the values over the requested axis.
mod
(other[, axis, level, fill_value])Get Modulo of dataframe and other, element-wise (binary operator mod).
mode
([dropna, split_every, numeric_only])Get the mode(s) of each element along the selected axis.
mul
(other[, axis, level, fill_value])Get Multiplication of dataframe and other, element-wise (binary operator mul).
ne
(other[, axis, level])Get Not equal to of dataframe and other, element-wise (binary operator ne).
nlargest
([n, columns, split_every])Return the first n rows ordered by columns in descending order.
notnull
()DataFrame.notnull is an alias for DataFrame.notna.
nsmallest
([n, columns, split_every])Return the first n rows ordered by columns in ascending order.
nunique
([split_every, dropna, axis])Count number of distinct elements in specified axis.
nunique_approx
([split_every])Approximate number of unique rows.
persist
(**kwargs)Persist this dask collection into memory
pipe
(func, *args, **kwargs)Apply chainable functions that expect Series or DataFrames.
pivot_table
([index, columns, values, aggfunc])Create a spreadsheet-style pivot table as a DataFrame.
pop
(item)Return item and drop from frame.
pow
(other[, axis, level, fill_value])Get Exponential power of dataframe and other, element-wise (binary operator pow).
prod
([axis, skipna, split_every, dtype, ...])Return the product of the values over the requested axis.
product
([axis, skipna, split_every, dtype, ...])Return the product of the values over the requested axis.
quantile
([q, axis, numeric_only, method])Approximate row-wise and precise column-wise quantiles of DataFrame
query
(expr, **kwargs)Filter dataframe with complex expression
radd
(other[, axis, level, fill_value])Get Addition of dataframe and other, element-wise (binary operator radd).
random_split
(frac[, random_state, shuffle])Pseudorandomly split dataframe into different pieces row-wise
rdiv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
reduction
(chunk[, aggregate, combine, meta, ...])Generic row-wise reductions.
rename
([index, columns])Rename columns or index labels.
repartition
([divisions, npartitions, ...])Repartition dataframe along new divisions
replace
([to_replace, value, regex])Replace values given in to_replace with value.
resample
(rule[, closed, label])Resample time-series data.
reset_index
([drop])Reset the index to the default index.
rfloordiv
(other[, axis, level, fill_value])Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
rmod
(other[, axis, level, fill_value])Get Modulo of dataframe and other, element-wise (binary operator rmod).
rmul
(other[, axis, level, fill_value])Get Multiplication of dataframe and other, element-wise (binary operator rmul).
rolling
(window[, min_periods, center, ...])Provides rolling transformations.
round
([decimals])Round a DataFrame to a variable number of decimal places.
rpow
(other[, axis, level, fill_value])Get Exponential power of dataframe and other, element-wise (binary operator rpow).
rsub
(other[, axis, level, fill_value])Get Subtraction of dataframe and other, element-wise (binary operator rsub).
rtruediv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
sample
([n, frac, replace, random_state])Random sample of items
select_dtypes
([include, exclude])Return a subset of the DataFrame's columns based on the column dtypes.
sem
([axis, skipna, ddof, split_every, ...])Return unbiased standard error of the mean over requested axis.
set_index
(other[, drop, sorted, ...])Set the DataFrame index (row labels) using an existing column.
shift
([periods, freq, axis])Shift index by desired number of periods with an optional time freq.
shuffle
(on[, npartitions, max_branch, ...])Rearrange DataFrame into new partitions
skew
([axis, bias, nan_policy, out, numeric_only])Return unbiased skew over requested axis.
sort_values
(by[, npartitions, ascending, ...])Sort the dataset by a single column.
squeeze
([axis])Squeeze 1 dimensional axis objects into scalars.
std
([axis, skipna, ddof, split_every, ...])Return sample standard deviation over requested axis.
sub
(other[, axis, level, fill_value])Get Subtraction of dataframe and other, element-wise (binary operator sub).
sum
([axis, skipna, split_every, dtype, out, ...])Return the sum of the values over the requested axis.
tail
([n, compute])Last n rows of the dataset
to_backend
([backend])Move to a new DataFrame backend
to_bag
([index, format])Create Dask Bag from a Dask DataFrame
to_csv
(filename, **kwargs)Store Dask DataFrame to CSV files
to_dask_array
([lengths, meta])Convert a dask DataFrame to a dask array.
to_delayed
([optimize_graph])Convert into a list of
dask.delayed
objects, one per partition.to_hdf
(path_or_buf, key[, mode, append])Store Dask Dataframe to Hierarchical Data Format (HDF) files
to_html
([max_rows])Render a DataFrame as an HTML table.
to_json
(filename, *args, **kwargs)See dd.to_json docstring for more information
to_orc
(path, *args, **kwargs)See dd.to_orc docstring for more information
to_parquet
(path, *args, **kwargs)Store Dask.dataframe to Parquet files
to_records
([index, lengths])Create Dask Array from a Dask Dataframe
to_sql
(name, uri[, schema, if_exists, ...])See dd.to_sql docstring for more information
to_string
([max_rows])Render a DataFrame to a console-friendly tabular output.
to_timestamp
([freq, how, axis])Cast to DatetimeIndex of timestamps, at beginning of period.
truediv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator truediv).
var
([axis, skipna, ddof, split_every, ...])Return unbiased variance over requested axis.
visualize
([filename, format, optimize_graph])Render the computation of this object's task graph using graphviz.
where
(cond[, other])Replace values where the condition is False.
Attributes
attrs
Dictionary of global attributes of this dataset.
axes
divisions
Tuple of
npartitions + 1
values, in ascending order, marking the lower/upper bounds of each partition's index.Return data types
empty
Purely integer-location based indexing for selection by position.
Return dask Index instance
Whether divisions are already known
Purely label-location based indexer for selection by label.
Return dimensionality
Return number of partitions
Slice dataframe by partitions
Return a tuple representing the dimensionality of the DataFrame.
Size of the Series or DataFrame as a Delayed object.
Return a dask.array of the values of this dataframe