dask.dataframe.api.GroupBy.aggregate

dask.dataframe.api.GroupBy.aggregate#

GroupBy.aggregate(arg=None, split_every=8, split_out=None, shuffle_method=None, **kwargs)[source]#

Aggregate using one or more specified operations

Based on pd.core.groupby.DataFrameGroupBy.agg

Parameters:
argcallable, str, list or dict, optional

Aggregation spec. Accepted combinations are:

  • callable function

  • string function name

  • list of functions and/or function names, e.g. [np.sum, 'mean']

  • dict of column names -> function, function name or list of such.

  • None only if named aggregation syntax is used

split_everyint >= 2 or dict(axis: int), optional

Number of intermediate partitions that may be aggregated at once. This defaults to 8. Determines the depth of the recursive aggregation. If set to or more than the number of input chunks, the aggregation will be performed in two steps, one chunk function per input chunk and a single aggregate function at the end. If set to less than that, an intermediate combine function will be used, so that any one combine or aggregate function has no more than split_every inputs. The depth of the aggregation graph will be :math:log_`split_every`(input chunks along reduced axes). Setting to a low value can reduce cache size and network transfers, at the cost of more CPU and a larger dask graph.

split_outint, optional

Number of output results in group-by like aggregations (defaults to 1)

shufflebool or str, optional

Whether a shuffle-based algorithm should be used. A specific algorithm name may also be specified (e.g. "tasks" or "p2p"). The shuffle-based algorithm is likely to be more efficient than shuffle=False when split_out>1 and the number of unique groups is large (high cardinality). Default is False when split_out = 1. When split_out > 1, it chooses the algorithm set by the shuffle option in the dask config system, or "tasks" if nothing is set.

kwargs: tuple or pd.NamedAgg, optional

Used for named aggregations where the keywords are the output column names and the values are tuples where the first element is the input column name and the second element is the aggregation function. pandas.NamedAgg can also be used as the value. To use the named aggregation syntax, arg must be set to None.