Generating Descriptive Statistics in Python

last updated Jan 11, 2017

Descriptive statistics are statistics that summarize a sample. They range from the simple (mean, median, and mode) to the complex (skew, kurtosis). They can be a great way to get to know a sample and to direct further analysis.

Without external libraries

Many descriptive statistics are easy to generate with the core Python library alone. Python 3.4 introduced a dedicated statistics library with a few of the simpler functions: mean, median, mode, standard deviation, and variance. By way of illustration, here are all the descriptive statistics functions present in Python’s statistics library of as of version 3.6:

>>> import statistics
>>> data = [1,1,2,3,4,100]
>>> statistics.mean(data)
18.5
>>> statistics.median(data)
2.5
>>> statistics.mode(data)
1
>>> statistics.stdev(data)
39.943710393502506
>>> statistics.variance(data)
1595.5

Obviously, these functions are present more for convenience than because they would be genuinely difficult to implement manually. Let’s examine how we might throw them together ourselves. (I’ll ignore mean(), median(), and mode(), which are outright trivial.)

stdev() ($s$) and variance() ($s^2$) are nearly identical functions, the latter being no more than the square of the former. The (bias-corrected) sample standard deviation is defined as:¹

$s = \sqrt{\frac{\sum_{i=1}^{n}{\left( x_i-\bar{x}\right)^2}}{n-1}}$

They might be implemented as follows:

def stdev(data):
  sample_mean = sum(data)/len(data)
  difference_sum = sum((x-sample_mean)**2 for x in data)
  return (difference_sum/(len(data)-1))**0.5

def variance(data):
  sample_mean = sum(data)/len(data)
  difference_sum = sum((x-sample_mean)**2 for x in data)
  return difference_sum/(len(data)-1)

variance() could be implemented in terms of stdev(), but this can introduce very small rounding errors that are obvious when the expected variance is a nice number.

With `scipy`

The statistics library par excellence for Python is scipy.stats.

It contains a function for generating a whole bunch of descriptive statistics at once, scipy.stats.describe():

>>> from scipy import stats
>>> stats.describe([1,1,2,3,4,100])
DescribeResult(nobs=6, minmax=(1, 100), mean=18.5, variance=1595.5, skewness=1.7854344245054827, kurtosis=1.1938486347286092)

nobs stands for “number of observations” and, in the case of single-variable analysis, is equivalent to the output of len(). minmax, mean, and variance should be self-explanatory. Note that describe() does not return the sample standard deviation, probably to avoid redundancy; $s$ can be produced with either **0.5 or the math.sqrt() function.

skewness is a measure of the sample’s asymmetry. If skewness is positive, the sample is skewed right; if skewness is negative, the sample is skewed left. Our sample data set, [1,1,2,3,4,100], obviously has a positive skewness. Fundamentally, the “skew” of a distribution, sample, or population is a matter of visual inspection: a distribution is skewed in a direction if a frequency chart seems to be stretched in one direction or another. This means that calculating a scalar value for skew is always a hair arbitrary.

Nonetheless, there are standard ways of arriving at a value for skewness. SciPy implements the “moment” definition of sample skewness (represented here as $g_1$):²

$g_1 = \frac{m_3}{m_2^{3/2}}$

Where $m_k$ is the $k$th central moment of the data set or distribution function.³ The SciPy documentation describes their algorithm for generating the $k$th central moment symbolically:

$m_k = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^k$

where $n$ is the sample size and $\bar{x}$ is the sample mean. This could be easily implemented, but scipy complicates things somewhat by employing exponentiation by squares rather than Python’s built-in exponentiation function. This could make a performance difference at scale, but for our purposes the old-fashioned exponentiation function works fine:

def moment(data, k):
  data_mean = sum(data)/len(data)
  return sum((x-data_mean)**k for x in data)/len(data)

def skewness(data):
  return moment(data, 3)/(moment(data, 2)**1.5)

kurtosis is a measure of the sharpness of the sample’s peak. A normal distribution has a kurtosis value of 3, so one version of kurtosis (the Fisher version) subtracts 3 from the value returned by the main equation:

$g_2 = \frac{m_4}{m_2^2}$

In Python, using the moment() function defined above:

def kurtosis(data):
  return moment(data,4)/(moment(data, 2)**2)

SciPy also includes options for bias correction, which uses k-statistics rather than unadjusted moments to compute $g_2$. From the documentation:

If bias is False then the kurtosis is calculated using k statistics to eliminate bias coming from biased moment estimators.⁴

Standard Deviation entry on Wolfram MathWorld. ↩
Skewness entry on Wolfram MathWorld. For precise information on SciPy’s implementation of the skew function, see the relevant code comments. ↩
More information on central moments can be found in the Central Moment article on Wolfram MathWorld. Note that in this case we’re dealing with the sample central moments, which are computed using the discrete rather than continuous methods. ↩
scipy.stats v0.18.1, ll. 1054-55. See also Wolfram MathWorld’s article on kurtosis and its article on k-statistics. ↩

Generating Descriptive Statistics in Python

Without external libraries

With scipy

With `scipy`