The point biserial correlation coefficient is a measure of the correlation between a binary variable (such as a yes/no or pass/fail variable) and a continuous variable. It is similar to the Pearson correlation coefficient, but is used specifically for this type of data.
The point biserial correlation coefficient is calculated as:
r = (P(1) — P(0)) * sqrt(n(1) * n(0)) / sqrt(P(1) * (1-P(1)) * n(1) + P(0) * (1-P(0)) * n(0))
where:
- P(1) is the proportion of the binary variable that is equal to 1
- P(0) is the proportion of the binary variable that is equal to 0
- n(1) is the number of observations of the continuous variable for which the binary variable is equal to 1
- n(0) is the number of observations of the continuous variable for which the binary variable is equal to 0
The point biserial correlation coefficient ranges from -1 to 1, with positive values indicating a positive correlation and negative values indicating a negative correlation. Values close to 0 indicate little or no correlation.
It’s worth noting that the point-biserial correlation is not appropriate when the binary variable is ordinal or when the sample size is small.
In Python, you can use the scipy
library to calculate the point biserial correlation coefficient. The scipy.stats
module includes the pointbiserialr()
function which can be used to calculate the point biserial correlation coefficient between two variables.
Here is an example of how to use the pointbiserialr()
function to calculate the point biserial correlation coefficient between a binary variable x
and a continuous variable y
:
from scipy.stats import pointbiserialr
# Sample data
x = [1, 0, 1, 1, 0, 0, 1, 0, 1, 1]
y = [2.5, 3.2, 2.8, 3.7, 3.1, 2.9, 3.2, 2.5, 2.8, 3.0]
# Calculate the point biserial correlation coefficient
corr, p = pointbiserialr(x, y)
print("Point biserial correlation coefficient:", corr)
print("p-value:", p)
The pointbiserialr()
function returns two values: the point biserial correlation coefficient (corr
) and the p-value (p
). The p-value represents the probability that the correlation between the two variables is due to chance. Typically, a p-value of less than 0.05 is considered to be statistically significant.