How Can We Help?
Pandas is one of the most popular libraries for data analysis in Python. One of the most popular techniques data analysts often use for data analysis is categorizing the data. This has several advantages, like data analysis, visualization, and less computing when we need statistical analysis. In this article, we shall learn how to categorize the data in pandas.
What is Data Categorization?
Before moving on further, we first need to understand data categorization. Data categorization is grouping data based on specific characteristics or properties. For example, suppose you have data on the product you are selling. Then you can categorize the data based on the region where it was sold, the variant which sells the most, etc. Such analysis has many advantages. For example, this categorization of data would help you to understand more about your business and help you to make better decisions.
How to Categorize Data in Pandas Using groupby
Pandas offers several methods to categorize the data. One of the standard methods is the group-by method. It takes the column’s name as the argument and returns the grouped elements as the panda’s data frame. Hence we can also iterate over the object to access the key and the group’s values.
Example (1)
# Import the pandas library as pd
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
'Age': [25, 30, 35, 40, 45, 50],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}
# Create a data frame using the dictionary
df = pd.DataFrame(data)
# Group the DataFrame by the 'Gender' column
grouped = df.groupby('Gender')
# Iterate over each group in the grouped object
for key, group in grouped:
# Print the key (i.e., the gender) of the group
print(key)
# Print the group itself, which is a data frame containing all the rows with the current gender
print(group)
Output:
Female
idx | Name | Age | Gender | Country |
---|---|---|---|---|
0 | Alice | 25 | Female | USA |
4 | Emily | 45 | Female | Canada |
Male
idx | Name | Age | Gender | Country |
---|---|---|---|---|
1 | Bob | 30 | Male | Canada |
2 | Charlie | 35 | Male | USA |
3 | David | 40 | Male | USA |
4 | Frank | 50 | Male | Canada |
Explanation:
- In this example, we have categorized the data based on gender.
- First, we imported the panda’s library in the code using the import statement. Next, we have created the data dictionary. We created a data frame out of the data using the DAataFrame function of Python pandas.
- Next, we used the groupby function to categorize the data concerning gender.
- We have used the iteration to iterate over the group.
You can also group a data frame by multiple columns. For example, the following code groups the DataFrame by both the Gender and Country columns:
Example (2)
# Import the pandas library as pd
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
'Age': [25, 30, 35, 40, 45, 50],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}
# Create a data frame using the dictionary
df = pd.DataFrame(data)
# Group the DataFrame by both 'Gender' and 'Country' columns
grouped = df.groupby(['Gender', 'Country'])
# Iterate over each group in the grouped object
for key, group in grouped:
# Print the key (i.e. a tuple of the current gender and country)
print(key)
# Print the group itself, which is a data frame containing all the rows with the current gender and country
print(group)
Output:
(‘Female’, ‘Canada’)
idx | Name | Age | Gender | Country |
---|---|---|---|---|
4 | Emily | 45 | Female | Canada |
(‘Female’, ‘USA’)
idx | Name | Age | Gender | Country |
---|---|---|---|---|
4 | Alice | 25 | Female | USA |
(‘Male’, ‘Canada’)
idx | Name | Age | Gender | Country |
---|---|---|---|---|
4 | Bob | 30 | Male | Canada |
5 | Frank | 50 | Male | Canada |
(‘Male’, ‘USA’)
idx | Name | Age | Gender | Country |
---|---|---|---|---|
2 | Charlie | 35 | Male | USA |
3 | Frank | 40 | Male | USA |
Explanation:
- In this example, we used the groupby() function to group the DataFrame by the Gender and Country columns.
- The resulting groups are then printed to the console. We can see four groups: Female-Canada, Female-USA, Male-Canada, and Male-USA. Each group contains one or more rows that share the same values in the Gender and Country columns.
- The example shows that we can group the data frame by multiple columns.
Using the apply Method With Grouped Object
Once you have grouped a data frame, you can apply various functions to the resulting groups. For example, you can calculate the mean, median, or sum of the values in each group. You can also apply custom functions to each group using the apply() function.
Example (3)
# Import the pandas library as pd
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
'Age': [25, 30, 35, 40, 45, 50],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}
# Create a data frame using the dictionary
df = pd.DataFrame(data)
# Apply any function to the categorical data
grouped = df.groupby('Gender')
# Apply any function to the categorical data
print(grouped['Age'].mean())
Output:
Gender
Female 35.00
Male 38.75
Name: Age, dtype: float64
Explanation:
In this example, we have used the groupby() function to group the DataFrame by the Gender column. We then used the mean() function to calculate the mean age for each group. The resulting means are then printed to the console.
How to Categorize Data in Pandas Using agg Method
You can also use the agg() function to apply multiple functions to each group. For example, the following code calculates the mean and median age for each gender:
Example (4)
# Import the pandas library as pd
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank'],
'Age': [27, 35, 32, 40, 28, 44],
'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
'Country': ['USA', 'Canada', 'USA', 'USA', 'Canada', 'Canada']
}
# Create a data frame using the dictionary
df = pd.DataFrame(data)
# Group the DataFrame by 'Gender'
grouped = df.groupby('Gender')
# Calculate the mean and median age for each gender
age_stats = grouped['Age'].agg(['mean', 'median'])
# Print the results
print(age_stats)
Output:
Gender | mean | median |
---|---|---|
Female | 27.50 | 27.5 |
Male | 37.75 | 37.5 |
Explanation:
In this example, we have used the groupby() function to group the DataFrame by the Gender column. We then used the agg() function to calculate each group’s mean and median age. The resulting means and medians are then printed to the console.
How to Categorize Data in Pandas Using Cut Method
In addition to the groupby() function, Pandas provides other functions for categorizing data. For example, you can use the cut() function to bin values into discrete intervals. This is useful when working with numerical data you want to categorize into groups. For example, you could use the cut() function to categorize ages into different age ranges, such as 18-24, 25-34, 35-44, etc.
Here is an example of how to use the cut() function to bin ages into different age ranges:
Example (5)
import pandas as pd
# Create a data frame with ages
data = {'Age': [18, 25, 30, 35, 40, 45, 50, 55, 60]}
df = pd.DataFrame(data)
# Create bins for age ranges
bins = [18, 24, 34, 44, 54, 64]
# Create labels for age ranges
labels = ['18-24', '25-34', '35-44', '45-54', '55-64']
# Categorize ages into different age ranges
df['AgeRange'] = pd.cut(df['Age'], bins=bins, labels=labels)
print(df)
Output:
idx | Age | AgeRange |
---|---|---|
0 | 18 | NaN |
1 | 25 | 25-34 |
2 | 30 | 25-34 |
3 | 35 | 35-44 |
4 | 40 | 35-44 |
5 | 45 | 45-54 |
6 | 50 | 45-54 |
7 | 55 | 55-64 |
8 | 60 | 55-64 |
Explanation:
In this example, we have used the cut() function to categorize ages into different age ranges. The bins parameter specifies the intervals for the age ranges, and the labels parameter specifies the labels for the age ranges. The resulting data frame includes a new column called AgeRange, which contains the age range for each age.
Conclusion
Categorizing data is a powerful technique for working with data in Pandas. Whether you are grouping rows based on the values in one or more columns or binning numerical values into discrete intervals, Pandas provides a range of functions that make it easy to categorize data and extract meaningful insights from it. With these tools, you can quickly and easily analyze large datasets and gain new insights into your data.