dask.dataframe.DataFrame.describe

dask.dataframe.DataFrame.describe#

DataFrame.describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)[source]#

Generate descriptive statistics.

This docstring was copied from pandas.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Generate descriptive statistics.

Dask computes percentiles (used for the 25%, 50%, and 75% statistics) using an approximate algorithm by default. Results may therefore differ slightly from pandas. Use percentiles_method="dask" for the built-in Dask algorithm or percentiles_method="tdigest" for the t-digest algorithm. See dask.dataframe.DataFrame.quantile() for details.

Parameters:
split_everyint or False, optional

Number of partitions to aggregate at once. Defaults to False which uses a single-pass reduction over all partitions.

percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. By default, [0.25, 0.5, 0.75] is used.

percentiles_method{“default”, “tdigest”, “dask”}, optional

Method for computing percentiles. "default" uses the internal Dask algorithm. "tdigest" uses the t-digest algorithm for floats and ints and falls back to "dask" otherwise.

Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset’s distribution, excluding ``NaN`` values.
Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.
Returns:
Series or DataFrame

Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count

Count number of non-NA/null observations.

DataFrame.max

Maximum of the values in the object.

DataFrame.min

Minimum of the values in the object.

DataFrame.mean

Mean of the values.

DataFrame.std

Standard deviation of the observations.

DataFrame.select_dtypes

Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(["a", "a", "b", "c"])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series(
...     [
...         np.datetime64("2000-01-01"),
...         np.datetime64("2010-01-01"),
...         np.datetime64("2010-01-01"),
...     ]
... )
>>> s.describe()
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame(
...     {
...         "categorical": pd.Categorical(["d", "e", "f"]),
...         "numeric": [1, 2, 3],
...         "object": ["a", "b", "c"],
...     }
... )
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include="all")
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=["category"])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0