Copyright | (c) 2008 Don Stewart 2009 Bryan O'Sullivan |
---|---|
License | BSD3 |
Maintainer | bos@serpentine.com |
Stability | experimental |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Commonly used sample statistics, also known as descriptive statistics.
Synopsis
- type Sample = Vector Double
- type WeightedSample = Vector (Double, Double)
- range :: Vector v Double => v Double -> Double
- mean :: Vector v Double => v Double -> Double
- welfordMean :: Vector v Double => v Double -> Double
- meanWeighted :: Vector v (Double, Double) => v (Double, Double) -> Double
- harmonicMean :: Vector v Double => v Double -> Double
- geometricMean :: Vector v Double => v Double -> Double
- centralMoment :: Vector v Double => Int -> v Double -> Double
- centralMoments :: Vector v Double => Int -> Int -> v Double -> (Double, Double)
- skewness :: Vector v Double => v Double -> Double
- kurtosis :: Vector v Double => v Double -> Double
- variance :: Vector v Double => v Double -> Double
- varianceUnbiased :: Vector v Double => v Double -> Double
- meanVariance :: Vector v Double => v Double -> (Double, Double)
- meanVarianceUnb :: Vector v Double => v Double -> (Double, Double)
- stdDev :: Vector v Double => v Double -> Double
- varianceWeighted :: Vector v (Double, Double) => v (Double, Double) -> Double
- stdErrMean :: Vector v Double => v Double -> Double
- fastVariance :: Vector v Double => v Double -> Double
- fastVarianceUnbiased :: Vector v Double => v Double -> Double
- fastStdDev :: Vector v Double => v Double -> Double
- covariance :: (Vector v (Double, Double), Vector v Double) => v (Double, Double) -> Double
- correlation :: (Vector v (Double, Double), Vector v Double) => v (Double, Double) -> Double
- pair :: (Vector v a, Vector v b, Vector v (a, b)) => v a -> v b -> v (a, b)
Types
type WeightedSample = Vector (Double, Double) Source #
Sample with weights. First element of sample is data, second is weight
Descriptive functions
range :: Vector v Double => v Double -> Double Source #
O(n) Range. The difference between the largest and smallest elements of a sample.
Statistics of location
mean :: Vector v Double => v Double -> Double Source #
O(n) Arithmetic mean. This uses Kahan-Babuška-Neumaier
summation, so is more accurate than welfordMean
unless the input
values are very large.
welfordMean :: Vector v Double => v Double -> Double Source #
O(n) Arithmetic mean. This uses Welford's algorithm to provide numerical stability, using a single pass over the sample data.
Compared to mean
, this loses a surprising amount of precision
unless the inputs are very large.
meanWeighted :: Vector v (Double, Double) => v (Double, Double) -> Double Source #
O(n) Arithmetic mean for weighted sample. It uses a single-pass
algorithm analogous to the one used by welfordMean
.
harmonicMean :: Vector v Double => v Double -> Double Source #
O(n) Harmonic mean. This algorithm performs a single pass over the sample.
geometricMean :: Vector v Double => v Double -> Double Source #
O(n) Geometric mean of a sample containing no negative values.
Statistics of dispersion
The variance — and hence the standard deviation — of a sample of fewer than two elements are both defined to be zero.
Functions over central moments
centralMoment :: Vector v Double => Int -> v Double -> Double Source #
Compute the kth central moment of a sample. The central moment is also known as the moment about the mean.
This function performs two passes over the sample, so is not subject to stream fusion.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
centralMoments :: Vector v Double => Int -> Int -> v Double -> (Double, Double) Source #
Compute the kth and jth central moments of a sample.
This function performs two passes over the sample, so is not subject to stream fusion.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
skewness :: Vector v Double => v Double -> Double Source #
Compute the skewness of a sample. This is a measure of the asymmetry of its distribution.
A sample with negative skew is said to be left-skewed. Most of its mass is on the right of the distribution, with the tail on the left.
skewness $ U.to [1,100,101,102,103] ==> -1.497681449918257
A sample with positive skew is said to be right-skewed.
skewness $ U.to [1,2,3,4,100] ==> 1.4975367033335198
A sample's skewness is not defined if its variance
is zero.
This function performs two passes over the sample, so is not subject to stream fusion.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
kurtosis :: Vector v Double => v Double -> Double Source #
Compute the excess kurtosis of a sample. This is a measure of the "peakedness" of its distribution. A high kurtosis indicates that more of the sample's variance is due to infrequent severe deviations, rather than more frequent modest deviations.
A sample's excess kurtosis is not defined if its variance
is
zero.
This function performs two passes over the sample, so is not subject to stream fusion.
For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.
Two-pass functions (numerically robust)
These functions use the compensated summation algorithm of Chan et al. for numerical robustness, but require two passes over the sample data as a result.
Because of the need for two passes, these functions are not subject to stream fusion.
variance :: Vector v Double => v Double -> Double Source #
Maximum likelihood estimate of a sample's variance. Also known as the population variance, where the denominator is n.
varianceUnbiased :: Vector v Double => v Double -> Double Source #
Unbiased estimate of a sample's variance. Also known as the sample variance, where the denominator is n-1.
meanVariance :: Vector v Double => v Double -> (Double, Double) Source #
Calculate mean and maximum likelihood estimate of variance. This function should be used if both mean and variance are required since it will calculate mean only once.
meanVarianceUnb :: Vector v Double => v Double -> (Double, Double) Source #
Calculate mean and unbiased estimate of variance. This function should be used if both mean and variance are required since it will calculate mean only once.
stdDev :: Vector v Double => v Double -> Double Source #
Standard deviation. This is simply the square root of the unbiased estimate of the variance.
varianceWeighted :: Vector v (Double, Double) => v (Double, Double) -> Double Source #
Weighted variance. This is biased estimation.
stdErrMean :: Vector v Double => v Double -> Double Source #
Standard error of the mean. This is the standard deviation divided by the square root of the sample size.
Single-pass functions (faster, less safe)
The functions prefixed with the name fast
below perform a single
pass over the sample data using Knuth's algorithm. They usually
work well, but see below for caveats. These functions are subject
to array fusion.
Note: in cases where most sample data is close to the sample's mean, Knuth's algorithm gives inaccurate results due to catastrophic cancellation.
fastVariance :: Vector v Double => v Double -> Double Source #
Maximum likelihood estimate of a sample's variance.
fastVarianceUnbiased :: Vector v Double => v Double -> Double Source #
Unbiased estimate of a sample's variance.
fastStdDev :: Vector v Double => v Double -> Double Source #
Standard deviation. This is simply the square root of the maximum likelihood estimate of the variance.
Joint distributions
covariance :: (Vector v (Double, Double), Vector v Double) => v (Double, Double) -> Double Source #
Covariance of sample of pairs. For empty sample it's set to zero
correlation :: (Vector v (Double, Double), Vector v Double) => v (Double, Double) -> Double Source #
Correlation coefficient for sample of pairs. Also known as Pearson's correlation. For empty sample it's set to zero.
pair :: (Vector v a, Vector v b, Vector v (a, b)) => v a -> v b -> v (a, b) Source #
Pair two samples. It's like zip
but requires that both
samples have equal size.
References
- Chan, T. F.; Golub, G.H.; LeVeque, R.J. (1979) Updating formulae and a pairwise algorithm for computing sample variances. Technical Report STAN-CS-79-773, Department of Computer Science, Stanford University. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
- Knuth, D.E. (1998) The art of computer programming, volume 2: seminumerical algorithms, 3rd ed., p. 232.
- Welford, B.P. (1962) Note on a method for calculating corrected sums of squares and products. Technometrics 4(3):419–420. http://www.jstor.org/stable/1266577
- West, D.H.D. (1979) Updating mean and variance estimates: an improved method. Communications of the ACM 22(9):532–535. http://doi.acm.org/10.1145/359146.359153