dask.array.histogramdd

dask.array.histogramdd#

dask.array.histogramdd(sample, bins, range=None, normed=None, weights=None, density=None)[source]#

Blocked variant of numpy.histogramdd().

Chunking of the input data (sample) is only allowed along the 0th (row) axis (the axis corresponding to the total number of samples). Data chunked along the 1st axis (column) axis is not compatible with this function. If weights are used, they must be chunked along the 0th axis identically to the input sample.

An example setup for a three dimensional histogram, where the sample shape is (8, 3) and weights are shape (8,), sample chunks would be ((4, 4), (3,)) and the weights chunks would be ((4, 4),) a table of the structure:

	sample (8 x 3)				weights
chunk	row	x	y	z	row	w
0	0	5	6	6	0	0.5
	1	8	9	2	1	0.8
	2	3	3	1	2	0.3
	3	2	5	6	3	0.7
1	4	3	1	1	4	0.3
	5	3	2	9	5	1.3
	6	8	1	5	6	0.8
	7	3	5	3	7	0.7

If the sample 0th dimension and weight 0th (row) dimension are chunked differently, a ValueError will be raised. If coordinate groupings ((x, y, z) trios) are separated by a chunk boundary, then a ValueError will be raised. We suggest that you rechunk your data if it is of that form.

The chunks property of the data (and optional weights) are used to check for compatibility with the blocked algorithm (as described above); therefore, you must call to_dask_array on a collection from dask.dataframe, i.e. dask.dataframe.Series or dask.dataframe.DataFrame.

The function is also compatible with x, y, and z being individual 1D arrays with equal chunking. In that case, the data should be passed as a tuple: histogramdd((x, y, z), ...)

Parameters:

sampledask.array.Array (N, D) or sequence of dask.array.Array

Multidimensional data to be histogrammed.

Note the unusual interpretation of a sample when it is a sequence of dask Arrays:

When a (N, D) dask Array, each row is an entry in the sample (coordinate in D dimensional space).
When a sequence of dask Arrays, each element in the sequence is the array of values for a single coordinate.

binssequence of arrays describing bin edges, int, or sequence of ints

The bin specification.

The possible binning configurations are:

A sequence of arrays describing the monotonically increasing bin edges along each dimension.
A single int describing the total number of bins that will be used in each dimension (this requires the range argument to be defined).
A sequence of ints describing the total number of bins to be used in each dimension (this requires the range argument to be defined).

When bins are described by arrays, the rightmost edge is included. Bins described by arrays also allows for non-uniform bin widths.

rangesequence of pairs, optional

A sequence of length D, each a (min, max) tuple giving the outer bin edges to be used if the edges are not given explicitly in bins. If defined, this argument is required to have an entry for each dimension. Unlike numpy.histogramdd(), if bins does not define bin edges, this argument is required (this function will not automatically use the min and max of the value in a given dimension because the input data may be lazy in dask).

normedbool, optional

An alias for the density argument that behaves identically. To avoid confusion with the broken argument to histogram, density should be preferred.

weightsdask.array.Array, optional

An array of values weighing each sample in the input data. The chunks of the weights must be identical to the chunking along the 0th (row) axis of the data sample.

densitybool, optional

If False (default), the returned array represents the number of samples in each bin. If True, the returned array represents the probability density function at each bin.

Returns:

dask.array.Array: The values of the histogram.
list(dask.array.Array): Sequence of arrays representing the bin edges along each dimension.

See also

histogram

Examples

Computing the histogram in 5 blocks using different bin edges along each dimension:

>>> import dask.array as da
>>> x = da.random.uniform(0, 1, size=(1000, 3), chunks=(200, 3))
>>> edges = [
...     np.linspace(0, 1, 5), # 4 bins in 1st dim
...     np.linspace(0, 1, 6), # 5 in the 2nd
...     np.linspace(0, 1, 4), # 3 in the 3rd
... ]
>>> h, edges = da.histogramdd(x, bins=edges)
>>> result = h.compute()
>>> result.shape
(4, 5, 3)

Defining the bins by total number and their ranges, along with using weights:

>>> bins = (4, 5, 3)
>>> ranges = ((0, 1),) * 3  # expands to ((0, 1), (0, 1), (0, 1))
>>> w = da.random.uniform(0, 1, size=(1000,), chunks=x.chunksize[0])
>>> h, edges = da.histogramdd(x, bins=bins, range=ranges, weights=w)
>>> np.isclose(h.sum().compute(), w.sum().compute())
np.True_

Using a sequence of 1D arrays as the input:

>>> x = da.array([2, 4, 2, 4, 2, 4])
>>> y = da.array([2, 2, 4, 4, 2, 4])
>>> z = da.array([4, 2, 4, 2, 4, 2])
>>> bins = ([0, 3, 6],) * 3
>>> h, edges = da.histogramdd((x, y, z), bins)
>>> h
dask.array<sum-aggregate, shape=(2, 2, 2), dtype=float64, chunksize=(2, 2, 2), chunktype=numpy.ndarray>
>>> edges[0]
dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
>>> h.compute()
array([[[0., 2.],
        [0., 1.]],

       [[1., 0.],
        [2., 0.]]])
>>> edges[0].compute()
array([0, 3, 6])
>>> edges[1].compute()
array([0, 3, 6])
>>> edges[2].compute()
array([0, 3, 6])

dask.array.histogramdd

Contents

dask.array.histogramdd#