dask.array.stats.skewtest
dask.array.stats.skewtest¶
- dask.array.stats.skewtest(a, axis=0, nan_policy='propagate')[source]¶
This docstring was copied from scipy.stats.skewtest.
Some inconsistencies with the Dask version may exist.
Test whether the skew is different from the normal distribution.
This function tests the null hypothesis that the skewness of the population that the sample was drawn from is the same as that of a corresponding normal distribution.
- Parameters
- aarray
The data to be tested. Must contain at least eight observations.
- axisint or None, default: 0
If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If
None
, the input will be raveled before computing the statistic.- nan_policy{‘propagate’, ‘omit’, ‘raise’}
Defines how to handle input NaNs.
propagate
: if a NaN is present in the axis slice (e.g. row) along which the statistic is computed, the corresponding entry of the output will be NaN.omit
: NaNs will be omitted when performing the calculation. If insufficient data remains in the axis slice along which the statistic is computed, the corresponding entry of the output will be NaN.raise
: if a NaN is present, aValueError
will be raised.
- alternative{‘two-sided’, ‘less’, ‘greater’}, optional (Not supported in Dask)
Defines the alternative hypothesis. Default is ‘two-sided’. The following options are available:
‘two-sided’: the skewness of the distribution underlying the sample is different from that of the normal distribution (i.e. 0)
‘less’: the skewness of the distribution underlying the sample is less than that of the normal distribution
‘greater’: the skewness of the distribution underlying the sample is greater than that of the normal distribution
New in version 1.7.0.
- keepdimsbool, default: False (Not supported in Dask)
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
- Returns
- statisticfloat
The computed z-score for this test.
- pvaluefloat
The p-value for the hypothesis test.
Notes
The sample size must be at least 8.
Beginning in SciPy 1.9,
np.matrix
inputs (not recommended for new code) are converted tonp.ndarray
before the calculation is performed. In this case, the output will be a scalar ornp.ndarray
of appropriate shape rather than a 2Dnp.matrix
. Similarly, while masked elements of masked arrays are ignored, the output will be a scalar ornp.ndarray
rather than a masked array withmask=False
.References
- 1
R. B. D’Agostino, A. J. Belanger and R. B. D’Agostino Jr., “A suggestion for using powerful and informative tests of normality”, American Statistician 44, pp. 316-321, 1990.
- 2
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4), 591-611.
- 3
B. Phipson and G. K. Smyth. “Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology 9.1 (2010).
Examples
Suppose we wish to infer from measurements whether the weights of adult human males in a medical study are not normally distributed [2]. The weights (lbs) are recorded in the array
x
below.>>> import numpy as np >>> x = np.array([148, 154, 158, 160, 161, 162, 166, 170, 182, 195, 236])
The skewness test from [1] begins by computing a statistic based on the sample skewness.
>>> from scipy import stats >>> res = stats.skewtest(x) >>> res.statistic 2.7788579769903414
Because normal distributions have zero skewness, the magnitude of this statistic tends to be low for samples drawn from a normal distribution.
The test is performed by comparing the observed value of the statistic against the null distribution: the distribution of statistic values derived under the null hypothesis that the weights were drawn from a normal distribution.
For this test, the null distribution of the statistic for very large samples is the standard normal distribution.
>>> import matplotlib.pyplot as plt >>> dist = stats.norm() >>> st_val = np.linspace(-5, 5, 100) >>> pdf = dist.pdf(st_val) >>> fig, ax = plt.subplots(figsize=(8, 5)) >>> def st_plot(ax): # we'll reuse this ... ax.plot(st_val, pdf) ... ax.set_title("Skew Test Null Distribution") ... ax.set_xlabel("statistic") ... ax.set_ylabel("probability density") >>> st_plot(ax) >>> plt.show()
The comparison is quantified by the p-value: the proportion of values in the null distribution as extreme or more extreme than the observed value of the statistic. In a two-sided test, elements of the null distribution greater than the observed statistic and elements of the null distribution less than the negative of the observed statistic are both considered “more extreme”.
>>> fig, ax = plt.subplots(figsize=(8, 5)) >>> st_plot(ax) >>> pvalue = dist.cdf(-res.statistic) + dist.sf(res.statistic) >>> annotation = (f'p-value={pvalue:.3f}\n(shaded area)') >>> props = dict(facecolor='black', width=1, headwidth=5, headlength=8) >>> _ = ax.annotate(annotation, (3, 0.005), (3.25, 0.02), arrowprops=props) >>> i = st_val >= res.statistic >>> ax.fill_between(st_val[i], y1=0, y2=pdf[i], color='C0') >>> i = st_val <= -res.statistic >>> ax.fill_between(st_val[i], y1=0, y2=pdf[i], color='C0') >>> ax.set_xlim(-5, 5) >>> ax.set_ylim(0, 0.1) >>> plt.show() >>> res.pvalue 0.005455036974740185
If the p-value is “small” - that is, if there is a low probability of sampling data from a normally distributed population that produces such an extreme value of the statistic - this may be taken as evidence against the null hypothesis in favor of the alternative: the weights were not drawn from a normal distribution. Note that:
The inverse is not true; that is, the test is not used to provide evidence for the null hypothesis.
The threshold for values that will be considered “small” is a choice that should be made before the data is analyzed [3] with consideration of the risks of both false positives (incorrectly rejecting the null hypothesis) and false negatives (failure to reject a false null hypothesis).
Note that the standard normal distribution provides an asymptotic approximation of the null distribution; it is only accurate for samples with many observations. For small samples like ours, scipy.stats.monte_carlo_test may provide a more accurate, albeit stochastic, approximation of the exact p-value.
>>> def statistic(x, axis): ... # get just the skewtest statistic; ignore the p-value ... return stats.skewtest(x, axis=axis).statistic >>> res = stats.monte_carlo_test(x, stats.norm.rvs, statistic) >>> fig, ax = plt.subplots(figsize=(8, 5)) >>> st_plot(ax) >>> ax.hist(res.null_distribution, np.linspace(-5, 5, 50), ... density=True) >>> ax.legend(['aymptotic approximation\n(many observations)', ... 'Monte Carlo approximation\n(11 observations)']) >>> plt.show() >>> res.pvalue 0.0062 # may vary
In this case, the asymptotic approximation and Monte Carlo approximation agree fairly closely, even for our small sample.