The chi-squared test is a widely used statistical test that is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categorical variables. In other words, it is a test of independence between two categorical variables.
Python is a popular programming language that has a number of libraries for statistical analysis, one of which is the scipy library. The scipy library has a function called scipy.stats.chisquare
that can be used to perform a chi-squared test.
To understand how the chi-squared test works, let’s consider an example. Imagine that we are conducting a survey to determine the preferred type of ice cream among a group of individuals. We ask 100 individuals to select their preferred type of ice cream from a list of five options: vanilla, chocolate, strawberry, mint, and cookies and cream. After collecting the data, we observe the following frequencies: vanilla (25), chocolate (30), strawberry (20), mint (15), and cookies and cream (10).
We want to know if there is a significant difference between the observed frequencies and the expected frequencies, assuming that all five types of ice cream are equally popular. To perform the chi-squared test, we first need to calculate the expected frequencies for each type of ice cream. In this case, since there are five types of ice cream and we surveyed 100 individuals, the expected frequency for each type of ice cream is 20 (100 / 5 = 20).
Now that we have the observed frequencies and the expected frequencies, we can use the scipy.stats.chisquare
function to perform the chi-squared test. The function takes two arguments: the observed frequencies and the expected frequencies. Here is the code:
from scipy import stats
# Observed frequencies
observed_frequencies = [25, 30, 20, 15, 10]
# Expected frequencies
expected_frequencies = [20, 20, 20, 20, 20]
# Perform the chi-squared test
chi2, p = stats.chisquare(observed_frequencies, f_exp=expected_frequencies)
# Print the test statistic and p-value
print("Chi-squared test statistic:", chi2)
print("p-value:", p)
The function returns two values: the chi-squared test statistic, and the p-value. The p-value is the probability that the test statistic would be as extreme or more extreme than the one observed, assuming that the null hypothesis (that the observed frequencies are the same as the expected frequencies) is true. A small p-value (typically less than 0.05) indicates that the observed frequencies are significantly different from the expected frequencies, and that the null hypothesis can be rejected.
In this example, the p-value is very small, which means that we can reject the null hypothesis and conclude that there is a significant difference between the observed frequencies and the expected frequencies. This suggests that not all types of ice cream are equally popular among the individuals surveyed.