Photo by Jay Wennington on Unsplash

A Beginner’s Guide to Using pd.crosstab in Pandas

KoshurAI
3 min readJun 15, 2024

--

In the world of data analysis, summarizing categorical data efficiently is crucial. Pandas, a powerful Python library, offers a versatile function called pd.crosstab for this very purpose. If you're familiar with pivot tables in Excel, you'll find pd.crosstab remarkably similar and incredibly useful. This article will guide you through its basics with a simple, clear example.

What is pd.crosstab?

pd.crosstab is a function in Pandas that computes a cross-tabulation of two or more factors, providing a table that displays the frequency distribution of these variables. It is particularly useful for understanding the relationship between categorical variables.

pd.crosstab Syntax

Here’s the basic syntax for pd.crosstab:

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
  • index: array-like, Series, or list of arrays/Series. Values to group by in the rows.
  • columns: array-like, Series, or list of arrays/Series. Values to group by in the columns.
  • values: array-like, optional. Array of values to aggregate according to the factors.
  • aggfunc: function, optional. If values are supplied, this function is applied to aggregate them.
  • margins: bool, default False. Add row/column margins (subtotals).
  • normalize: bool, {‘all’, ‘index’, ‘columns’}, or {0/1}, default False. Normalize by dividing all values by the sum of values.

For this article, we’ll focus on a straightforward use case with just two columns: Gender and Favorite_Subject.

Example: Summarizing Favorite Subjects by Gender

Let’s consider a simple dataset that captures students’ genders and their favorite subjects. We’ll use pd.crosstab to summarize this data.

Step-by-Step Guide

  1. Import Pandas and Create the DataFrame
  2. First, we’ll import Pandas and create a DataFrame with our sample data:
import pandas as pd

# Sample data
data = {
'Gender': ['F', 'M', 'M', 'M', 'F', 'M'],
'Favorite_Subject': ['Math', 'Math', 'Science', 'Math', 'Science', 'Science']
}

# Creating DataFrame
df = pd.DataFrame(data)

Generate the Crosstab

Next, we’ll use pd.crosstab to generate a summary table:

# Using pd.crosstab to summarize favorite subjects by gender
crosstab_result = pd.crosstab(index=df['Gender'], columns=df['Favorite_Subject'], margins=True)

print(crosstab_result)

Understanding the Output

The resulting crosstab would look like this:

Favorite_Subject  Math  Science  All
Gender
F 1 1 2
M 2 2 4
All 3 3 6

This table provides a clear summary:

  • Female students have their favorite subjects equally split between Math and Science.
  • Male students have an equal split in their favorite subjects between Math and Science as well.
  • The ‘All’ column and row provide the totals for each category and the overall total.

Explanation of Parameters

  • index: We group by ‘Gender’, meaning the rows of our table will represent different genders.
  • columns: We group by ‘Favorite_Subject’, meaning the columns will represent different subjects.
  • margins: Setting margins=True includes totals for each row and column, making it easier to see the overall distribution.

Conclusion

pd.crosstab is a powerful and flexible tool for summarizing categorical data in Pandas. It's especially useful for quick exploratory data analysis. By understanding and utilizing pd.crosstab, you can efficiently create summary tables that provide valuable insights into your data.

Experiment with different datasets and parameters to see how pd.crosstab can help you in your data analysis tasks. Happy coding!

--

--

KoshurAI
KoshurAI

Written by KoshurAI

Passionate about Data Science? I offer personalized data science training and mentorship. Join my course today to unlock your true potential in Data Science.

No responses yet