How Can We Help?
Introduction
While dealing with the data frames, programmers often need to clean the data. This may also include sorting the data in ascending or descending order.
This article will explain how to sort the pandas data frame.
- Introduction
- What is Sorting in Python Pandas?
- How To Sort Using The sort_values Function
- How To Sort In Descending Order
- How To Sort Multiple Columns In Pandas DataFrame
- How To Deal With NaN In Pandas While Sorting
- How To Sort Pandas Data Frame Using Lambda Function
- Changing The sorting Algorithms
- Conclusion
What is Sorting in Python Pandas?
Sorting in Python Pandas is the process of arranging data in a specific order based on values in one or more columns. Pandas provides two main functions for sorting data: sort_values() and sort_index(). Other related functions include nlargest(), nsmallest(), and rank(). Sorting data in Pandas is essential for efficient data analysis.
How To Sort Using The sort_values Function
The most common technique to sort the data frame is to use the sort_values() function of pandas. The sort_values() is a built-in function of pandas. The function takes the column name as the parameter, then sorts the data frame based on the column.
We can sort both numbers as well as strings using the sort_values() function.
Example (1)
import pandas as pd
# Creating a dictionary with data
data = {'name': ['Alex', 'Bob', 'Charlie','David','Emma','Frank','George','Harry','Igor','Jack'],
'age': [35,26,55,48,22,32,44,39,50,40],
'city': ['Seattle','Chicago','New York','San Francisco','Los Angeles','Houston','Phoenix','Philadelphia','San Diego','Austin'],
'gender':['M','M','M','M','F','M','M','M','M','M'],
'income':[50000,70000,60000,90000,80000,55000,65000,85000,100000,75000]
}
# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
# Adding a new column
df['Occupation'] = ['Engineer','Data Analyst','Doctor','Lawyer','Designer','Teacher','Accountant','Police','Manager','Architect']
#Printing the DataFrame
print("The data frame before sorting is as follows:")
print(df)
df=df.sort_values(by=['income'])
print("The data frame after sorting is as follows:")
print(df)
Output:
The data frame before sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Alex | 35 | Seattle | M | 50000 | Engineer |
1 | Bob | 26 | Chicago | M | 70000 | Data Analyst |
2 | Charlie | 55 | New York | M | 60000 | Doctor |
3 | David | 48 | San Francisco | M | 90000 | Lawyer |
4 | Emma | 22 | Los Angeles | F | 80000 | Designer |
5 | Frank | 32 | Houston | M | 55000 | Teacher |
6 | George | 44 | Phoenix | M | 65000 | Accountant |
7 | Harry | 39 | Philadelphia | M | 85000 | Police |
8 | Igor | 50 | San Diego | M | 100000 | Manager |
9 | Jack | 40 | Austin | M | 75000 | Architect |
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Alex | 35 | Seattle | M | 50000 | Engineer |
1 | Bob | 26 | Chicago | M | 70000 | Data Analyst |
2 | Charlie | 55 | New York | M | 60000 | Doctor |
3 | David | 48 | San Francisco | M | 90000 | Lawyer |
4 | Emma | 22 | Los Angeles | F | 80000 | Designer |
5 | Frank | 32 | Houston | M | 55000 | Teacher |
6 | George | 44 | Phoenix | M | 65000 | Accountant |
7 | Harry | 39 | Philadelphia | M | 85000 | Police |
8 | Igor | 50 | San Diego | M | 100000 | Manager |
9 | Jack | 40 | Austin | M | 75000 | Architect |
The data frame after sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Alex | 35 | Seattle | M | 50000 | Engineer |
5 | Frank | 32 | Houston | M | 55000 | Teacher |
2 | Charlie | 55 | New York | M | 60000 | Doctor |
6 | George | 44 | Phoenix | M | 65000 | Accountant |
1 | Bob | 26 | Chicago | M | 70000 | Data Analyst |
9 | Jack | 40 | Austin | M | 75000 | Architect |
4 | Emma | 22 | Los Angeles | F | 80000 | Designer |
7 | Harry | 39 | Philadelphia | M | 85000 | Police |
3 | David | 48 | San Francisco | M | 90000 | Lawyer |
8 | Igor | 50 | San Diego | M | 100000 | Manager |
Notice the change in the index numbers in sequence.
Explanation:
- First, we imported the pandas library into our code.
- Next, we have created a dictionary data type, a collection of key-value pairs. We called this dictionary data.
- Next, we created a data frame from the dictionary using the DataFrame function of pandas.
- Next, we added new columns to the data frame Occupation. We printed the data frame before the sorting process.
- Next, we used the sort_values() function to sort the data frame. We have passed the “income” as the parameter, which signifies that we need to sort the data frame based on the values of the income column.
- Next, we printed the data frame after the sorting process.
How To Sort In Descending Order
Sorting the data frame in descending order is relatively simple. We only need to pass one additional parameter to the function sort_values(), i.e., ascending=False.
Example (2)
import pandas as pd
# Creating a dictionary with data
data = {'name': ['Alex', 'Bob', 'Charlie','David','Emma','Frank','George','Harry','Igor','Jack'],
'age': [35,26,55,48,22,32,44,39,50,40],
'city': ['Seattle','Chicago','New York','San Francisco','Los Angeles','Houston','Phoenix','Philadelphia','San Diego','Austin'],
'gender':['M','M','M','M','F','M','M','M','M','M'],
'income':[50000,70000,60000,90000,80000,55000,65000,85000,100000,75000]
}
# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
# Adding a new column
df['Occupation'] = ['Engineer','Data Analyst','Doctor','Lawyer','Designer','Teacher','Accountant','Police','Manager','Architect']
#Printing the DataFrame
print("The data frame before sorting is as follows:")
print(df)
df=df.sort_values(by=['income'],ascending=False)
print("The data frame after sorting is as follows:")
print(df)
Output:
The data frame before sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Alex | 35 | Seattle | M | 50000 | Engineer |
1 | Bob | 26 | Chicago | M | 70000 | Data Analyst |
2 | Charlie | 55 | New York | M | 60000 | Doctor |
3 | David | 48 | San Francisco | M | 90000 | Lawyer |
4 | Emma | 22 | Los Angeles | F | 80000 | Designer |
5 | Frank | 32 | Houston | M | 55000 | Teacher |
6 | George | 44 | Phoenix | M | 65000 | Accountant |
7 | Harry | 39 | Philadelphia | M | 85000 | Police |
8 | Igor | 50 | San Diego | M | 100000 | Manager |
9 | Jack | 40 | Austin | M | 75000 | Architect |
The data frame after sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
8 | Igor | 50 | San Diego | M | 100000 | Manager |
3 | David | 48 | San Francisco | M | 90000 | Lawyer |
7 | Harry | 39 | Philadelphia | M | 85000 | Police |
4 | Emma | 22 | Los Angeles | F | 80000 | Designer |
9 | Jack | 40 | Austin | M | 75000 | Architect |
1 | Bob | 26 | Chicago | M | 70000 | Data Analyst |
6 | George | 44 | Phoenix | M | 65000 | Accountant |
2 | Charlie | 55 | New York | M | 60000 | Doctor |
5 | Frank | 32 | Houston | M | 55000 | Teacher |
0 | Alex | 35 | Seattle | M | 50000 | Engineer |
How To Sort Multiple Columns In Pandas DataFrame
Pandas allows us to sort the data frame based on multiple columns.
The rule to sort using multiple columns is as follows:
- You need to pass multiple names of columns as the parameter for sorting the data frame.
- Pandas will sort based on the order you provided to the sort_values() function. Suppose you have provided three columns named A, B, and C as the function’s parameters.
- Pandas will sort based on A. If there is a tie, then sorting will be done based on column B. If there is any tie again, then pandas will sort based on the value of column C.
Example (3)
import pandas as pd
data = {'name': ['Sophia', 'Liam', 'Emma','Noah','Mia','Aiden','Isabella','Jacob','Madison','Ethan'],
'age': [6,6,6,8,4,9,3,10,2,1],
'city': ['New York','Chicago','Los Angeles','Houston','Phoenix','San Francisco','Seattle','Philadelphia','Austin','San Diego'],
'gender':['F','M','F','M','F','M','F','M','F','M'],
'income':[1000,2000,3000,4000,5000,6000,9000,9000,9000,10000]
}
df = pd.DataFrame(data)
df['Occupation'] = ['Student','Student','Student','Student','Student','Student','Student','Student','Student','Student']
#Printing the DataFrame
print("The data frame before sorting is as follows:")
print(df)
df=df.sort_values(by=['income',"age"])
print("The data frame after sorting is as follows:")
print(df)
Output:
The data frame before sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Sophia | 6 | New York | F | 1000 | Student |
1 | Liam | 6 | Chicago | M | 2000 | Student |
2 | Emma | 6 | Los Angeles | F | 3000 | Student |
3 | Noah | 8 | Houston | M | 4000 | Student |
4 | Mia | 4 | Phoenix | F | 5000 | Student |
5 | Aiden | 9 | San Francisco | M | 6000 | Student |
6 | Isabella | 3 | Seattle | F | 9000 | Student |
7 | Jacob | 10 | Philadelphia | M | 9000 | Student |
8 | Madison | 2 | Austin | F | 9000 | Student |
9 | Ethan | 1 | San Diego | M | 10000 | Student |
The data frame after sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | Sophia | 6 | New York | F | 1000 | Student |
1 | Liam | 6 | Chicago | M | 2000 | Student |
2 | Emma | 6 | Los Angeles | F | 3000 | Student |
3 | Noah | 8 | Houston | M | 4000 | Student |
4 | Mia | 4 | Phoenix | F | 5000 | Student |
5 | Aiden | 9 | San Francisco | M | 6000 | Student |
8 | Madison | 2 | Austin | F | 9000 | Student |
6 | Isabella | 3 | Seattle | F | 9000 | Student |
7 | Jacob | 10 | Philadelphia | M | 9000 | Student |
9 | Ethan | 1 | San Diego | M | 10000 | Student |
Notice that pandas have sorted based on the value of their income first. So three rows have the same 9000 values. Next, padas sorted those three rows with respect to age values.
How To Deal With NaN In Pandas While Sorting
Sometimes, we do not have the values for the cells to fill with. In such a situation, pandas put the rows having the nan values at last while sorting, irrespective of whether you are sorting based on the ascending or descending order. However, pandas also allow us to have control over such situations. We can use the parameter na_position to control where we want to place the NaN values.
It takes several values as the arguments like “first”,”last” etc.
Example (4)
import pandas as pd
import numpy as np
data = {'name': ['John', 'Mary', 'Mike','Sara','Kevin','Emily','Daniel','Jessica','Brian','Ashley'],
'age': [35,28,np.nan,32,40,25,np.nan,29,22,37],
'city': ['Houston','Chicago','New York','San Francisco','Los Angeles','Phoenix','Seattle','Philadelphia','San Diego','Austin'],
'gender':['M','F','M','F','M','F','M','F','M','F'],
'income':[50000,70000,60000,90000,80000,55000,65000,85000,100000,np.nan]
}
# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
# Adding a new column
df['Occupation'] = ['Engineer','Teacher','Doctor','Lawyer','Designer','Accountant','Police','Manager','Architect','Artist']
#Printing the DataFrame
print("The data frame before sorting is as follows:")
print(df)
df=df.sort_values(by=['income','age'],ascending=[False,True], na_position='first')
print("The data frame after sorting is as follows:")
print(df)
Output:
The data frame before sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | John | 35 | Houston | M | 50000 | Engineer |
1 | Mary | 28 | Chicago | F | 70000 | Teacher |
2 | Mike | NaN | New York | M | 60000 | Doctor |
3 | Sara | 32 | San Francisco | F | 90000 | Lawyer |
4 | Kevin | 40 | Los Angeles | M | 80000 | Designer |
5 | Emily | 25 | Phoenix | F | 55000 | Accountant |
6 | Daniel | NaN | Seattle | M | 65000 | Police |
7 | Jessica | 29 | Philadelphia | F | 85000 | Manager |
8 | Brian | 22 | San Diego | M | 100000 | Architect |
9 | Ashley | 37 | Austin | F | NaN | Artist |
The data frame after sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
9 | Ashley | 37 | Austin | F | NaN | Artist |
8 | Brian | 22 | San Diego | M | 100000 | Architect |
3 | Sara | 32 | San Francisco | F | 90000 | Lawyer |
7 | Jessica | 29 | Philadelphia | F | 85000 | Manager |
4 | Kevin | 40 | Los Angeles | M | 80000 | Designer |
1 | Mary | 28 | Chicago | F | 70000 | Teacher |
6 | Daniel | NaN | Seattle | M | 65000 | Police |
2 | Mike | NaN | New York | M | 60000 | Doctor |
5 | Emily | 25 | Phoenix | F | 55000 | Accountant |
0 | John | 35 | Houston | M | 50000 | Engineer |
Notice that in the output, the NaN value came first.
How To Sort Pandas Data Frame Using Lambda Function
We can also sort the pandas data frame using the lambda function. For this, we can adopt several logic. One unique way of doing this is to use the natsort function. First, you need to install the natsort library in your system. For this, open your terminal and run the following commands:
pip install natsort
Example (5)
import pandas as pd
import numpy as np
from natsort import index_natsorted
data = {'name': ['John', 'Mary', 'Mike','Sara','Kevin','Emily','Daniel','Jessica','Brian','Ashley'],
'age': [35,28,np.nan,32,40,25,np.nan,29,22,37],
'city': ['Houston','Chicago','New York','San Francisco','Los Angeles','Phoenix','Seattle','Philadelphia','San Diego','Austin'],
'gender':['M','F','M','F','M','F','M','F','M','F'],
'income':[50000,70000,60000,90000,80000,55000,65000,85000,100000,np.nan]
}
# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
# Adding a new column
df['Occupation'] = ['Engineer','Teacher','Doctor','Lawyer','Designer','Accountant','Police','Manager','Architect','Artist']
#Printing the DataFrame
print("The data frame before sorting is as follows:")
print(df)
df=df.sort_values(
by="income",
key=lambda x: np.argsort(index_natsorted(df["income"]))
)
print("The data frame after sorting is as follows:")
print(df)
Output:
The data frame before sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
0 | John | 35 | Houston | M | 50000 | Engineer |
1 | Mary | 28 | Chicago | F | 70000 | Teacher |
2 | Mike | NaN | New York | M | 60000 | Doctor |
3 | Sara | 32 | San Francisco | F | 90000 | Lawyer |
4 | Kevin | 40 | Los Angeles | M | 80000 | Designer |
5 | Emily | 25 | Phoenix | F | 55000 | Accountant |
6 | Daniel | NaN | Seattle | M | 65000 | Police |
7 | Jessica | 29 | Philadelphia | F | 85000 | Manager |
8 | Brian | 22 | San Diego | M | 100000 | Architect |
9 | Ashley | 37 | Austin | F | NaN | Artist |
The data frame after sorting is as follows:
IDX | Name | Age | City | Gender | Income | Occupation |
---|---|---|---|---|---|---|
9 | Ashley | 37 | Austin | F | NaN | Artist |
0 | John | 35 | Houston | M | 50000 | Engineer |
5 | Emily | 25 | Phoenix | F | 55000 | Accountant |
2 | Mike | NaN | New York | M | 60000 | Doctor |
6 | Daniel | NaN | Seattle | M | 65000 | Police |
1 | Mary | 28 | Chicago | F | 70000 | Teacher |
4 | Kevin | 40 | Los Angeles | M | 80000 | Designer |
7 | Jessica | 29 | Philadelphia | F | 85000 | Manager |
3 | Sara | 32 | San Francisco | F | 90000 | Lawyer |
8 | Brian | 22 | San Diego | M | 100000 | Architect |
Changing The sorting Algorithms
By default, the sort_values() function uses the quick sort algorithm. However, the users can also choose other algorithms to sort the data. Although the result will be the same, the time complexity and the time taken for execution may differ.
Suppose you want merge sort as the algorithm to sort the data, then use the following code:
df.sort_values(
by=<name of the column>,
ascending=False,
kind="mergesort"
).
Conclusion
In this article, we have learned how to sort the pandas data frame using different techniques in Python. We have seen the usage of sort_values() functions along with their attributes. We strongly recommend the readers practice the discussed concepts and look up the Python pandas documentation to understand the topic better.
References: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html