Chi-Squared Test for Machine Learning:
A common problem in applied machine learning is determining whether input features are relevant to the outcome to be predicted.
This is the problem of feature selection.
In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset.
The Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.
We will discover the chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.
Pearson’s Chi-Squared Test:
The Pearson’s Chi-Squared test, or just Chi-Squared test for short, is named for Karl Pearson, although there are variations on the test.
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.
Given the Sex/Interest example, the number of observations for a category (such as male and female) may or may not the same. Nevertheless, we can calculate the expected frequency of observations in each Interest group and see whether the partitioning of interests by Sex results in similar or different frequencies.
The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.
The result of the test is a test statistic that has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.
Pyhon code for this post:
# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2
# contingency table
table = [ [10, 20, 30], [6, 9, 17]]
print(table)
stat, p, dof, expected = chi2_contingency(table)
print(‘dof=%d’ % dof)
print(expected)
# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print(‘probability=%.3f, critical=%.3f, stat=%.3f’ % (prob, critical, stat))
if abs(stat) >= critical:
print(‘Dependent (reject H0)’)
else:
print(‘Independent (fail to reject H0)’)
# interpret p-value
alpha = 1.0 – prob
print(‘significance=%.3f, p=%.3f’ % (alpha, p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
Running the example first prints the contingency table. The test is calculated and the degrees of freedom (dof) is reported as 2, which makes sense given:
degrees of freedom: (rows – 1) * (cols – 1)
degrees of freedom: (2 – 1) * (3 – 1)
degrees of freedom: 1 * 2
degrees of freedom: 2
Refrence: https://machinelearningmastery.com