Performing a Chi-Squared Goodness of Fit Test in Python
The chi-squared goodness of fit test or Pearson’s chi-squared test is used to assess whether a set of categorical data is consistent with proposed values for the parameters.
The equation for computing the test statistic, \(\chi^2\), may be expressed as:
where \(O_i\) is the value observed in the sample and \(E_i\) is the value we would expect to see in the sample if the null hypothesis is true. For a more detailed discussion of the mechanics of performing a chi-squared test, have a look at NIST’s Engineering Statistics Handbook.
The easiest way to implement this in Python is to make use of the scipy.stats.chisquare
function, which is a part of the SciPy scientific computing package. Advanced usage cases are available in the linked documentation, but the basic usage is something along the lines of:
import scipy, scipy.stats.chisquare
observed_values=scipy.array([18,21,16,7,15])
expected_values=scipy.array([22,19,44,8,16])
scipy.stats.chisquare(observed_values, f_exp=expected_values)
This returns:
(18.943480861244019, 0.00080629555484801861)
The first value in the returned tuple is the \(\chi^2\) value itself, while the second value is the p-value computed using \(\nu=k-1\) where \(k\) is the number of values in each array.1
This is all well and good, but it isn’t terribly difficult to write your own function to calculate goodness of fit.
Calculating the value for the test statistic, \(\chi^2\) is simple:
def chisquare(observed_values,expected_values):
test_statistic=0
for observed, expected in zip(observed_values, expected_values):
test_statistic+=(float(observed)-float(expected))**2/float(expected)
return test_statistic
Things get a bit trickier when we want to compute the p-value for our test statistic. The cumulative distribution function (CDF) for a chi-squared distribution for a chi-squared value of \(x\) and degrees of freedom \(k\) is:
In Python:
def ilgf(s,z):
val=0
for k in range(0,100):
val+=(((-1)**k)*z**(s+k))/(math.factorial(k)*(s+k))
return val
The gamma function \Gamma may be evaluated by numerical approximation of the improper integral:
In Python:
def gf(x):
upper_bound=100.0
resolution=1000000.0
step=upper_bound/resolution
val=0
rolling_sum=0
while val<=upper_bound:
rolling_sum+=step*(val**(x-1)*2.7182818284590452353602874713526624977**(-val))
val+=step
return rolling_sum
The two can then be easily combined into the cdf function for a chi-squared distribution and an expanded version of the chisquare
clone from above:
import math
def gf(x):
#Play with these values to adjust the error of the approximation.
upper_bound=100.0
resolution=1000000.0
step=upper_bound/resolution
val=0
rolling_sum=0
while val<=upper_bound:
rolling_sum+=step*(val**(x-1)*2.7182818284590452353602874713526624977**(-val))
val+=step
return rolling_sum
def ilgf(s,z):
val=0
for k in range(0,100):
val+=(((-1)**k)*z**(s+k))/(math.factorial(k)*(s+k))
return val
def chisquarecdf(x,k):
return 1-ilgf(k/2,x/2)/gf(k/2)
def chisquare(observed_values,expected_values):
test_statistic=0
for observed, expected in zip(observed_values, expected_values):
test_statistic+=(float(observed)-float(expected))**2/float(expected)
df=len(observed_values)-1
return test_statistic, chisquarecdf(test_statistic,df)
And you’re done!
-
\(\nu\) is the degrees of freedom in the sample; in this case, \(\nu = 4\). ↩