Python Pandas Categorize Data: All You NEED to Know

Created OnAugust 15, 2023

Pandas is one of the most popular libraries for data analysis in Python. One of the most popular techniques data analysts often use for data analysis is categorizing the data. This has several advantages, like data analysis, visualization, and less computing when we need statistical analysis. In this article, we shall learn how to categorize the data in pandas.

What is Data Categorization?

Before moving on further, we first need to understand data categorization. Data categorization is grouping data based on specific characteristics or properties. For example, suppose you have data on the product you are selling. Then you can categorize the data based on the region where it was sold, the variant which sells the most, etc. Such analysis has many advantages. For example, this categorization of data would help you to understand more about your business and help you to make better decisions.

How to Categorize Data in Pandas Using groupby

Pandas offers several methods to categorize the data. One of the standard methods is the group-by method. It takes the column’s name as the argument and returns the grouped elements as the panda’s data frame. Hence we can also iterate over the object to access the key and the group’s values.

Example (1)

# Import the pandas library as pd
import pandas as pd

# Create a dictionary of data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
    'Age': [25, 30, 35, 40, 45, 50],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
    'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}

# Create a data frame using the dictionary
df = pd.DataFrame(data)

# Group the DataFrame by the 'Gender' column
grouped = df.groupby('Gender')

# Iterate over each group in the grouped object
for key, group in grouped:
    # Print the key (i.e., the gender) of the group
    print(key)
    # Print the group itself, which is a data frame containing all the rows with the current gender
    print(group)

Output:

Female

idx	Name	Age	Gender	Country
0	Alice	25	Female	USA
4	Emily	45	Female	Canada

Male

idx	Name	Age	Gender	Country
1	Bob	30	Male	Canada
2	Charlie	35	Male	USA
3	David	40	Male	USA
4	Frank	50	Male	Canada

Explanation:

In this example, we have categorized the data based on gender.
First, we imported the panda’s library in the code using the import statement. Next, we have created the data dictionary. We created a data frame out of the data using the DAataFrame function of Python pandas.
Next, we used the groupby function to categorize the data concerning gender.
We have used the iteration to iterate over the group.

You can also group a data frame by multiple columns. For example, the following code groups the DataFrame by both the Gender and Country columns:

Example (2)

# Import the pandas library as pd
import pandas as pd

# Create a dictionary of data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
    'Age': [25, 30, 35, 40, 45, 50],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
    'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}

# Create a data frame using the dictionary
df = pd.DataFrame(data)

# Group the DataFrame by both 'Gender' and 'Country' columns
grouped = df.groupby(['Gender', 'Country'])

# Iterate over each group in the grouped object
for key, group in grouped:
    # Print the key (i.e. a tuple of the current gender and country)
    print(key)
    # Print the group itself, which is a data frame containing all the rows with the current gender and country
    print(group)

Output:

(‘Female’, ‘Canada’)

idx	Name	Age	Gender	Country
4	Emily	45	Female	Canada

(‘Female’, ‘USA’)

idx	Name	Age	Gender	Country
4	Alice	25	Female	USA

(‘Male’, ‘Canada’)

idx	Name	Age	Gender	Country
4	Bob	30	Male	Canada
5	Frank	50	Male	Canada

(‘Male’, ‘USA’)

idx	Name	Age	Gender	Country
2	Charlie	35	Male	USA
3	Frank	40	Male	USA

Explanation:

In this example, we used the groupby() function to group the DataFrame by the Gender and Country columns.
The resulting groups are then printed to the console. We can see four groups: Female-Canada, Female-USA, Male-Canada, and Male-USA. Each group contains one or more rows that share the same values in the Gender and Country columns.
The example shows that we can group the data frame by multiple columns.

Using the apply Method With Grouped Object

Once you have grouped a data frame, you can apply various functions to the resulting groups. For example, you can calculate the mean, median, or sum of the values in each group. You can also apply custom functions to each group using the apply() function.

Example (3)

# Import the pandas library as pd
import pandas as pd

# Create a dictionary of data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
    'Age': [25, 30, 35, 40, 45, 50],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
    'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}

# Create a data frame using the dictionary
df = pd.DataFrame(data)

# Apply any function to the categorical data
grouped = df.groupby('Gender')

# Apply any function to the categorical data
print(grouped['Age'].mean())

Output:

Gender

Female 35.00

Male 38.75

Name: Age, dtype: float64

Explanation:

In this example, we have used the groupby() function to group the DataFrame by the Gender column. We then used the mean() function to calculate the mean age for each group. The resulting means are then printed to the console.

How to Categorize Data in Pandas Using agg Method

You can also use the agg() function to apply multiple functions to each group. For example, the following code calculates the mean and median age for each gender:

Example (4)

# Import the pandas library as pd
import pandas as pd

# Create a dictionary of data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
    'Age': [27, 35, 32, 40, 28, 44],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
    'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}

# Create a data frame using the dictionary
df = pd.DataFrame(data)

# Group the DataFrame by 'Gender'
grouped = df.groupby('Gender')

# Calculate the mean and median age for each gender
age_stats = grouped['Age'].agg(['mean', 'median'])

# Print the results
print(age_stats)

Output:

Gender	mean	median
Female	27.50	27.5
Male	37.75	37.5

Explanation:

In this example, we have used the groupby() function to group the DataFrame by the Gender column. We then used the agg() function to calculate each group’s mean and median age. The resulting means and medians are then printed to the console.

How to Categorize Data in Pandas Using Cut Method

In addition to the groupby() function, Pandas provides other functions for categorizing data. For example, you can use the cut() function to bin values into discrete intervals. This is useful when working with numerical data you want to categorize into groups. For example, you could use the cut() function to categorize ages into different age ranges, such as 18-24, 25-34, 35-44, etc.

Here is an example of how to use the cut() function to bin ages into different age ranges:

Example (5)

import pandas as pd

# Create a data frame with ages
data = {'Age': [18, 25, 30, 35, 40, 45, 50, 55, 60]}
df = pd.DataFrame(data)

# Create bins for age ranges
bins = [18, 24, 34, 44, 54, 64]

# Create labels for age ranges
labels = ['18-24', '25-34', '35-44', '45-54', '55-64']

# Categorize ages into different age ranges
df['AgeRange'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)

Output:

idx	Age	AgeRange
0	18	NaN
1	25	25-34
2	30	25-34
3	35	35-44
4	40	35-44
5	45	45-54
6	50	45-54
7	55	55-64
8	60	55-64

Explanation:

In this example, we have used the cut() function to categorize ages into different age ranges. The bins parameter specifies the intervals for the age ranges, and the labels parameter specifies the labels for the age ranges. The resulting data frame includes a new column called AgeRange, which contains the age range for each age.

Conclusion

Categorizing data is a powerful technique for working with data in Pandas. Whether you are grouping rows based on the values in one or more columns or binning numerical values into discrete intervals, Pandas provides a range of functions that make it easy to categorize data and extract meaningful insights from it. With these tools, you can quickly and easily analyze large datasets and gain new insights into your data.

Last Updated OnAugust 15, 2023

byAsif Rahaman

Python Pandas Categorize Data: All You NEED to Know

How Can We Help?

What is Data Categorization?

How to Categorize Data in Pandas Using groupby

Example (1)

Output:

Explanation:

Example (2)

Output:

Explanation:

Using the apply Method With Grouped Object

Example (3)

Output:

Explanation:

How to Categorize Data in Pandas Using agg Method

Example (4)

Output:

Explanation:

How to Categorize Data in Pandas Using Cut Method

Example (5)

Output:

Explanation:

Conclusion

Asif Rahaman

Leave a comment
Cancel reply

Sign Up

Sign In

Forgot Password

How Can We Help?

What is Data Categorization?

How to Categorize Data in Pandas Using groupby

Example (1)

Output:

Explanation:

Example (2)

Output:

Explanation:

Using the apply Method With Grouped Object

Example (3)

Output:

Explanation:

How to Categorize Data in Pandas Using agg Method

Example (4)

Output:

Explanation:

How to Categorize Data in Pandas Using Cut Method

Example (5)

Output:

Explanation:

Conclusion

Asif Rahaman

Related Posts

Python Pandas Date: Parsing, Arithmetic, and Resampling

Exploring Python Pandas Options: Your Data Power Tool

Python Pandas Join: Where Datasets Converge

Matplotlib Cumulative Histograms: Mapping Data's Story

Leave a commentCancel reply

Leave a comment
Cancel reply