Data Cleaning Using Pandas

5 mins read4.4K Views Comment

Updated on Sep 21, 2022 17:49 IST

Data preparation involves data collection and data cleaning. When working with multiple sources of data, there are instances where the collected data could be incorrect, mislabeled, or even duplicated. This would lead to unreliable machine learning models and wrong outcomes. Hence, it is important to clean your data and get it into a usable form beforehand. In this article, we cover the concept of data cleaning using Pandas.

As a data scientist, most of your time is going to be spent preparing your data for analysis. In fact, according to Forbes, data preparation is the ‘most time-consuming, least enjoyable data science task’. Naturally, one would want to increase productivity in this phase to move on to the more interesting parts – getting insights from data. Pandas is a very popular Python library mainly used for data pre-processing purposes such as data cleaning, manipulation, and transformation. It provides a quick and efficient way to manage and analyze your data. In this blog on data cleaning using Pandas, we will cover the following sections:

What is Data Cleaning?
Data Cleaning Using Pandas
Finding duplicated values in a DataFrame
Finding missing elements in a DataFrame
Filling the missing values in a DataFrame
Dropping columns in a DataFrame
Changing the index of a DataFrame
Renaming columns of a DataFrame
Converting the data type of columns

What is Data Cleaning?

Data cleaning is the process of dealing with messy, disordered data and eliminating incorrect, missing, duplicated values in your dataset. It improves the quality and accuracy of the data being fed to the algorithms that will solve your data science problem.

Now, let’s get to the fun part, shall we?

Recommended online courses

Best-suited Python for data science courses for you

Learn Python for data science with these high-rated online courses

Python for data science

IIT MadrasCertificate

5.0

Total Fees

Free

Duration

4 weeks

Data Analysis with Python for Managers (with Live Project)

Coding NinjasCertificate

4.6

Total Fees

₹12 K

Duration

8 hours

Data Science using Python

IIT KanpurCertificate

4.0

Total Fees

₹4.24 K

Duration

6 weeks

Data Science Online Training

Besant Technologies, Velachery - ChennaiCertificate

5.0

Total Fees

₹40 K

Duration

100 hours

Certificate Program in Data Science for Finance (CPDSF)

Indian Institute of Quantitative FinanceCertificate

Total Fees

₹68 K

Duration

3 months

Online Course Data Science with Python

ThinkNext TechnologiesCertificate

Total Fees

₹4.99 K

Duration

– / –

Certified Professional Diploma in Data Science

NetTech IndiaCertificate

4.0

Total Fees

– / –

Duration

– / –

DATA SCIENCE COURSE USING PYTHON.

CETPA Infotech Pvt LtdCertificate

5.0

Total Fees

– / –

Duration

60 hours

Python

Seven Mentor Pvt LtdCertificate

4.5

Total Fees

– / –

Duration

90 hours

Introduction to Python for Data Science and Data Engineering

DatabricksCertificate

Total Fees

₹1.27 L

Duration

12 hours

Data Cleaning Using Pandas

We are going to perform data cleaning using pandas. The data used in this blog can be found here. This dataset describes the Airbnb listing activity in New York City for the year 2019. It contains information about hosts, geographical availability, and other metrics required to make predictions and draw conclusions. Let’s start with preparing this data for it.

Firstly, let’s import the Pandas library:

import pandas as pd
Copy code

Now, let’s load the dataset:

Displaying the first 5 rows of the dataset:

df.head()
Copy code

df.info()
Copy code

Use info() to get information about the dataset:

We can see all the 16 columns listed above along with their data types. You can also see the memory usage displayed at the end as 6+ MB.

Let’s start with our data cleaning process now –

Finding duplicated values in a DataFrame

duplicated(): This function displays the boolean values in a columnar format. False means no values are duplicated:

df.duplicated()
Copy code

Each element of the ‘id’ column of the dataset is displayed, showing whether the value is duplicated or not.

But as you can see, there are 48895 elements here and we can’t check against each one individually. So, we will use the any() function to find out if there are any duplicated values at all:

df.duplicated().any()
Copy code

So, there are no duplicate values. But if there were, we could’ve used the following syntax to remove those:

Syntax –

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False, ignore_index=False)

Finding missing elements in a DataFrame

There are four ways to find the null values, if present, in the dataset.

isnull(): This function displays the dataset with boolean values. False means the value is not null:

df.isnull()
Copy code

isna(): This function also displays the dataset with boolean values. False means the value is not N/A:

df.isna()
Copy code

isna().any(): This function provides the boolean values too but in a columnar format:

df.isna().any()
Copy code

We can see there are 4 columns with null values present: ‘name’, ‘host_name’, ‘last_review’, and ‘reviews_per_month’.

isna().sum(): This function gives the column-wise sum of the null values present in the dataset.

df.isna().sum()
Copy code

We can see the number of null values against each of the 4 columns.

Filling the missing values in a DataFrame

fillna(): This function will replace the null values in a DataFrame with the specified values.

Syntax –
DataFrame.fillna(value, method, axis, inplace, limit, downcast)
Copy code

The value parameter can be a dictionary that takes the column names as key.

Let’s fill in the values for the ‘name’, ‘host_name’, and ‘last_review’ columns:

df.fillna({'name':'Not Stated','host_name':'Not Stated','last_review':0}, inplace=True)
Copy code

By default, the method does not make changes to the object directly. Instead, it returns a modified copy of our object. This is avoided by setting the inplace parameter.

Do you want to check if the null values got filled? Let’s do it using the sum() function for missing values again:

df.isna().sum()
Copy code

Can you see? We have successfully removed the null values for the 3 columns!

Now, what shall we do about the ‘reviews_per_month’ column?

Dropping columns in a DataFrame

drop(): This function will remove the columns from the DataFrame.

Syntax –
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Copy code

Let’s drop the ‘reviews_per_month’ column:

df.drop(['reviews_per_month'], axis = 1, inplace=True)
Copy code

Let’s check whether we’ve dropped it:

#Display all column names
df.columns
Copy code

We cannot see the ‘reviews_per_month’ column here as it has been successfully removed.

Changing the index of a DataFrame

When dealing with data, it is helpful in most cases to use a uniquely valued identifying field of the data as its index.

In our dataset, we can assume that the ‘id’ field would serve this purpose. So, let’s first check if all the values in this field are unique or not:

df['id'].is_unique
Copy code

Now that we know that all values in the ‘id’ column are unique, let’s set this column as the index using set_index() function:

df = df.set_index('id', inplace=True)
df.head()
Copy code

You can now access each record directly by using iloc[] as shown:

df.iloc[4]
#Displaying the 5th record from the dataset
Copy code

Renaming Columns of a DataFrame

In many cases, you might require renaming the columns for better interpretation.

You can do this by using a dictionary, where the key is the current column name, and the value is the new column name:

new_col = {'name':'listing_name', 'number_of_reviews':'reviews'}
 
df.rename(columns=new_col, inplace=True)
df.head()
Copy code

Converting the Data Type of Columns

While checking the DataFrame info() above, we saw that the ‘last_review’ column was of object type. Let’s recall it here:

df['last_review'].dtype.type
Copy code

Since the column contains dates, we are going to convert its data type to datetime as shown:

df['last_review'] = pd.to_datetime(df['last_review'], format='%Y-%m-%d')
df['last_review'].dtype.type
Copy code

Converting the Data Type to Reduce Memory Usage

You can reduce memory usage by changing the data types of columns.

Let’s do it for the ‘host_id’ column. We will convert it from int64 to int32 as shown:

df['host_id'] = df['host_id'].astype('int32')
df.info()
Copy code

So, we have reduced the memory usage from 6+ MB to 5.4+ MB.

Data Cleaning in Pandas – Try it yourself

Click the google colab icon below to run the demo in colab.

Endnotes

Pandas is a very powerful data processing tool for the Python programming language. It provides a rich set of functions to process various types of file formats from multiple data sources. The Pandas library is specifically useful for data scientists working with data cleaning and analysis. If you seek to learn the basics and various functions of Pandas, you can explore related articles here.

———————————————————————————-

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Data Cleaning Using Pandas

What is Data Cleaning?

Best-suited Python for data science courses for you

Python for data science

Data Analysis with Python for Managers (with Live Project)

Data Science using Python

Data Science Online Training

Certificate Program in Data Science for Finance (CPDSF)

Online Course Data Science with Python

Certified Professional Diploma in Data Science

DATA SCIENCE COURSE USING PYTHON.

Python

Introduction to Python for Data Science and Data Engineering

Data Cleaning Using Pandas

Finding duplicated values in a DataFrame

Finding missing elements in a DataFrame

Filling the missing values in a DataFrame

Dropping columns in a DataFrame

Changing the index of a DataFrame

Renaming Columns of a DataFrame

Converting the Data Type of Columns

Converting the Data Type to Reduce Memory Usage

Data Cleaning in Pandas – Try it yourself

Endnotes

Top Picks & New Arrivals