Statistics functions¶
This chapter describes the statistical functions in the library. The basic statistical functions include routines to compute the mean, variance and standard deviation. More advanced functions allow you to calculate absolute deviations, skewness, and kurtosis as well as the median and arbitrary percentiles.
The algorithms provided here use recurrence relations to compute average quantities in a stable way, without large intermediate values that might overflow. All functions work on any Python sequence (of appropriate data-type), but see section [sec:stat:speed-considerations] for advantages and drawbacks of different kinds of input data.
For details on the underlying implementation of these functions please consult the GNU Scientific Library reference manual.
Organization of the module¶
Individual parts of the GSL functions names, providing artificial namespaces in C, are mapped to modules and submodules in PyGSL. That is, can be found as and as .
The functions in the module are available in versions for datasets in the standard and NumPy floating-point and integer types. The generic versions available in the module are using the generic GSL versions. The submodules use GSL functions according to the submodule name, e.g. long for .
Implemented submodules are , , , , , , and . The latter one also serves as default and is used whenever you don’t expclicitely state a different datatype. In most cases it is appropriate to simply use the default implementation as it covers the widest range of the real space, offers high precision, and as such is simple to use. If you have a sequence of all integer values it is straightforward to use functions as these use an implementation corresponding to Pythons -type. These implemented submodules represent all numeric datatypes available in Python (, ) besides which has no representation in standard C, as well as all numeric datatypes available in NumPy that have corresponding implementations in GSL (on 32 bit systems these are: Character, UnsigendInt8, Int16, Int32, Int, Float32, Float).
Available functions¶
Mean, Standard Deviation, and Variance¶
meanx Arithmetic mean (sample mean) of :
variancex Estimated (sample) variance of :
computed the mean then you can pass it directly to .
variance_mx, mean Estimated (sample) variance of relative to :
sdx
sd_mx, mean The standard deviation is defined as the square root of the variance of . These functions returns the square root of the respective variance-functions above.
variance_with_fixed_meanx, mean Compute an unbiased estimate of the variance of when the population mean of the underlying distribution is known a priori. In this case the estimator for the variance uses the factor \(1/N\) and the sample mean \(\hat\mu\) is replaced by the known population mean \(\mu\):
Absolute deviation¶
absdevdata Compute the absolute deviation from the mean of The absolute deviation from the mean is defined as
deviation from the mean provides a more robust measure of the width of a distribution than the variance. This function computes the mean of via a call to .
absdev_mdata, mean Compute the absolute deviation of the dataset relative to the given value of
want to avoid recomputing it), or wish to calculate the absolute deviation relative to another value (such as zero, or the median).
Higher moments (skewness and kurtosis)¶
skewdata Compute the skewness of . The skewness is defined as
measures the asymmetry of the tails of a distribution.
The function computes the mean and estimated standard deviation of via calls to and .
skew_m_sddata, mean, sd Compute the skewness of the dataset using the given values of the mean and standard deviation varsd
standard deviation of and want to avoid recomputing them.
kurtosisdata Compute the kurtosis of . The kurtosis is defined as
its width. The kurtosis is normalized to zero for a gaussian distribution.
kurtosis_m_sddata, mean, sd This function computes the kurtosis of the dataset using the given values of the mean and standard deviation
standard deviation of and want to avoid recomputing them.
Autocorrelation¶
lag1_autocorrelationx Computes the lag-1 autocorrelation of the dataset
lag1_autocorrelation_mx, mean Computes the lag-1 autocorrelation of the dataset using the given value of the mean .
Covariance¶
covariancex, y Computes the covariance of the datasets and which must be of same length.
lag1_autocorrelation_mx, y, mean_x, mean_y Computes the covariance of the datasets and using the given values of the means and . The datasets and must be of equal length.
Maximum and Minimum values¶
maxdata This function returns the maximum value in . The maximum value is defined as the value of the element \(x_i\) which satisfies \(x_i \ge x_j\) for all \(j\).
If you want instead to find the element with the largest absolute magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before calling this function.
mindata This function returns the minimum value in . The maximum value is defined as the value of the element \(x_i\) which satisfies \(x_i \le x_j\) for all \(j\).
If you want instead to find the element with the smallest absolute magnitude you will need to apply ‘fabs’ or ‘abs’ to your data before calling this function.
minmaxdata This function returns both the minimum and maximum values of , determined in a single pass.
max_indexdata This function returns the index of the maximum value in . The maximum value is defined as the value of the element \(x_i\) which satisfies \(x_i \ge x_j\) for all \(j\). When there are several equal maximum elements then the first one is chosen.
min_indexdata This function returns the index of the minimum value in . The minimum value is defined as the value of the element \(x_i\) which satisfies \(x_i \le x_j\) for all \(j\). When there are several equal minimum elements then the first one is chosen.
minmax_indexdata This function returns the indexes of the minimum and maximum values of , determined in a single pass.
Median and Percentiles¶
The median and percentile functions described in this section operate on sorted data. For convenience we use “quantiles”, measured on a scale of 0 to 1, instead of percentiles (which use a scale of 0 to 100).
median_from_sorted_datadata This function returns the median value of . The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function should always be used first.
When the dataset has an odd number of elements the median is the value of element (n-1)/2. When the dataset has an even number of elements the median is the mean of the two nearest middle values, elements (n-1)/2 and n/2. Since the algorithm for computing the median involves interpolation this function always returns a floating-point number, even for integer data types.
quantile_from_sorted_datadata, F This function returns a quantile value of . The elements of the array must be in ascending numerical order. The quantile is determined by the , a fraction between 0 and 1. For example, to compute the value of the 75th percentile should have the value 0.75.
There are no checks to see whether the data are sorted, so the function should always be used first.
The quantile is found by interpolation, using the formula
\((n-1)f - i\).
Thus the minimum value of the array () is given by equal to zero, the maximum value () is given by equal to one and the median value is given by equal to 0.5. Since the algorithm for computing quantiles involves interpolation this function always returns a floating-point number, even for integer data types.
Weighted Samples¶
The functions described in this section allow the computation of statistics for weighted samples. The functions accept an array of samples, \(x_i\), with associated weights, \(w_i\). Each sample \(x_i\) is considered as having been drawn from a Gaussian distribution with variance \(\sigma_i^2\). The sample weight \(w_i\) is defined as the reciprocal of this variance, \(w_i = 1/\sigma_i^2\). Setting a weight to zero corresponds to removing a sample from a dataset.
wmeanw, data This function returns the weighted mean of the dataset using the set of weights . The weighted mean is defined as
wvariance w, data This function returns the estimated variance of the dataset , using the set of weights . The estimated variance of a weighted dataset is defined as
familiar \(1/(N-1)\) factor when there are \(N\) equal non-zero weights.
wvariance_mw, data, wmean This function returns the estimated variance of the weighted dataset using the given weighted mean .
wsdw, data The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above.
wsd_mw, data, wmean This function returns the square root of the corresponding variance function above.
wvariance_with_fixed_meanw, data, mean This function computes an unbiased estimate of the variance of weighted dataset when the population mean of the underlying distribution is known _a priori_. In this case the estimator for the variance replaces the sample mean \(\hat\mu\) by the known population mean \(\mu\),
wsd_with_fixed_meanw, data, mean The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above.
wabsdevw, data This function computes the weighted absolute deviation from the weighted mean of . The absolute deviation from the mean is defined as
wabsdev_mw, data, wmean This function computes the absolute deviation of the weighted dataset DATA about the given weighted mean WMEAN.
wskeww, data This function computes the weighted skewness of the dataset DATA.
wskew_m_sdw, data, mean, wsd This function computes the weighted skewness of the dataset using the given values of the weighted mean and weighted standard deviation, and .
wkurtosisw, data This function computes the weighted kurtosis of the dataset . The kurtosis is defined as
wkurtosis_m_sdw, data, mean, wsd This function computes the weighted kurtosis of the dataset using the given values of the weighted mean and weighted standard deviation, and .
Further Reading¶
See the GSL reference manual for a description of all available functions and the calculations they perform.
The standard reference for almost any topic in statistics is the multi-volume Advanced Theory of Statistics by Kendall and Stuart. Many statistical concepts can be more easily understood by a Bayesian approach. The book by Gelman, Carlin, Stern and Rubin gives a comprehensive coverage of the subject. For physicists the Particle Data Group provides useful reviews of Probability and Statistics in the “Mathematical Tools” section of its Annual Review of Particle Physics.
Modules in Testing¶
Modules in this package are often reimplementations of an original package with significant change to the original. The current rng implementation, for example, started its life here. The sf module implemented here will supersede the sf package in one of the next releases. Concerning the other modules the usage is encouraged for tests to see if they work, but use them with caution in your production code!