dask.dataframe.DataFrame.describe#
- DataFrame.describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)[source]#
Generate descriptive statistics.
This docstring was copied from pandas.DataFrame.describe.
Some inconsistencies with the Dask version may exist.
Generate descriptive statistics.
Dask computes percentiles (used for the
25%,50%, and75%statistics) using an approximate algorithm by default. Results may therefore differ slightly from pandas. Usepercentiles_method="dask"for the built-in Dask algorithm orpercentiles_method="tdigest"for the t-digest algorithm. Seedask.dataframe.DataFrame.quantile()for details.- Parameters:
- split_everyint or False, optional
Number of partitions to aggregate at once. Defaults to
Falsewhich uses a single-pass reduction over all partitions.- percentileslist-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1. By default,
[0.25, 0.5, 0.75]is used.- percentiles_method{“default”, “tdigest”, “dask”}, optional
Method for computing percentiles.
"default"uses the internal Dask algorithm."tdigest"uses the t-digest algorithm for floats and ints and falls back to"dask"otherwise.- Descriptive statistics include those that summarize the central
- tendency, dispersion and shape of a
- dataset’s distribution, excluding ``NaN`` values.
- Analyzes both numeric and object series, as well
- as ``DataFrame`` column sets of mixed data types. The output
- will vary depending on what is provided. Refer to the notes
- below for more detail.
- Returns:
- Series or DataFrame
Summary statistics of the Series or Dataframe provided.
See also
DataFrame.countCount number of non-NA/null observations.
DataFrame.maxMaximum of the values in the object.
DataFrame.minMinimum of the values in the object.
DataFrame.meanMean of the values.
DataFrame.stdStandard deviation of the observations.
DataFrame.select_dtypesSubset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include
count,mean,std,min,maxas well as lower,50and upper percentiles. By default the lower percentile is25and the upper percentile is75. The50percentile is the same as the median.For object data (e.g. strings), the result’s index will include
count,unique,top, andfreq. Thetopis the most common value. Thefreqis the most common value’s frequency.If multiple object values have the highest count, then the
countandtopresults will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrameare analyzed for the output. The parameters are ignored when analyzing aSeries.Examples
Describing a numeric
Series.>>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
Describing a categorical
Series.>>> s = pd.Series(["a", "a", "b", "c"]) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series.>>> s = pd.Series( ... [ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01"), ... ] ... ) >>> s.describe() count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame. By default only numeric fields are returned.>>> df = pd.DataFrame( ... { ... "categorical": pd.Categorical(["d", "e", "f"]), ... "numeric": [1, 2, 3], ... "object": ["a", "b", "c"], ... } ... ) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrameregardless of data type.>>> df.describe(include="all") categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrameby accessing it as an attribute.>>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFramedescription.>>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFramedescription.>>> df.describe(include=[object]) object count 3 unique 3 top a freq 1
Including only categorical columns from a
DataFramedescription.>>> df.describe(include=["category"]) categorical count 3 unique 3 top d freq 1
Excluding numeric columns from a
DataFramedescription.>>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f a freq 1 1
Excluding object columns from a
DataFramedescription.>>> df.describe(exclude=[object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0