Using SQL to Find the Correlation Between Two Columns: A Practical Guide

KoshurAI
3 min readOct 21, 2024

--

In the world of data analysis, understanding relationships between variables is crucial for making informed decisions. One of the most common ways to explore these relationships is by calculating correlation, which measures the strength and direction of a linear relationship between two variables.

While Python and R are often the go-to tools for calculating correlation, SQL, a powerful language for working with data, can also be used for this purpose. In this article, I will show you how to calculate the correlation between two columns in SQL.

What is Correlation?

Correlation, typically measured by Pearson’s correlation coefficient, ranges from -1 to +1:

  • A correlation of +1 means a perfect positive linear relationship.
  • A correlation of -1 indicates a perfect negative linear relationship.
  • A correlation of 0 implies no linear relationship between the variables.

The Formula for Correlation

The Pearson correlation coefficient rrr between two variables XXX and YYY is calculated using the formula:

Where:

  • cov (X,Y) is the covariance between X and Y.
  • σX​ and σY​ are the standard deviations of X and Y, respectively.

Calculating Correlation in SQL

While SQL doesn’t provide a built-in function to directly calculate correlation, we can break down the formula and compute the correlation coefficient step by step. Here’s how you can do it.

1. Sample Dataset

Let’s say we have a table called sales_data with two columns:

  • price: The price of a product.
  • quantity_sold: The number of units sold.
CREATE TABLE sales_data (
id INT PRIMARY KEY,
price DECIMAL(10,2),
quantity_sold INT
);

2. SQL Query to Calculate Correlation

To calculate the correlation between price and quantity_sold, we need to compute the covariance and the standard deviations. Here’s the query:

SELECT 
(COUNT(*) * SUM(price * quantity_sold) - SUM(price) * SUM(quantity_sold)) /
(SQRT(COUNT(*) * SUM(POWER(price, 2)) - POWER(SUM(price), 2)) *
SQRT(COUNT(*) * SUM(POWER(quantity_sold, 2)) - POWER(SUM(quantity_sold), 2)))
AS correlation_coefficient
FROM sales_data;

Breaking Down the Query:

  • COUNT(*): The total number of rows.
  • SUM(price * quantity_sold): The sum of the products of price and quantity_sold.
  • SUM(price): The sum of the price column.
  • SUM(quantity_sold): The sum of the quantity_sold column.
  • SUM(POWER(price, 2)): The sum of squared values of price.
  • SUM(POWER(quantity_sold, 2)): The sum of squared values of quantity_sold.
  • SQRT(...): Computes the square root, which is used to calculate the standard deviations.

The formula here is based on the Pearson correlation coefficient formula, with SQL syntax to calculate covariance and standard deviations.

3. Example Output

Let’s assume the following sales_data:

id price quantity_sold
1 10.00 100
2 20.00 80
3 15.00 90
4 25.00 70
5 30.00 60

Running the SQL query on this dataset would return the correlation coefficient between the price and quantity_sold. For this hypothetical example, the correlation might be negative, indicating that as the price increases, the number of units sold decreases (which is a common economic trend).

Interpreting the Results

Once you get the correlation coefficient:

  • If the value is close to +1, there’s a strong positive relationship between price and quantity sold (higher price, more units sold).
  • If the value is close to -1, there’s a strong negative relationship (higher price, fewer units sold).
  • If the value is near 0, it means there’s no significant linear relationship between the two variables.

Why Use SQL for Correlation?

While SQL isn’t often the first choice for statistical analysis, using SQL for correlation can be useful when:

  • You are working with large datasets that are stored in databases.
  • You want to calculate correlation directly within your database without exporting data to external tools like Python or Excel.
  • You are building data pipelines where preliminary analysis is done directly in SQL before further processing.

Conclusion

Calculating correlation in SQL can be a bit more manual than using tools like Python, but it’s a powerful skill to have when working with databases. By leveraging SQL’s aggregation functions and arithmetic capabilities, we can easily compute correlation and better understand relationships between variables.

This method is especially useful when dealing with large datasets in production databases, where exporting to another tool might be impractical or slow.

Give it a try with your own datasets and explore how SQL can help you uncover patterns and relationships!

--

--

KoshurAI
KoshurAI

Written by KoshurAI

Passionate about Data Science? I offer personalized data science training and mentorship. Join my course today to unlock your true potential in Data Science.

No responses yet