Performing Data Manipulation in Python using Pandas

Performing Data Manipulation in Python using Pandas

9 mins read1.2K Views Comment
Updated on Jan 5, 2022 18:05 IST

by Sumeet Jain

Even before the birth of the internet, Data was an integral part of our life. Proper record-keeping and analysis was the key feature of a successful organization. Now with the internet, it has become the most crucial part. From choosing which product to purchase to your life choices, is being driven by Data. Let’s take a look at a scenario that shows the importance of Data and how can extract useful information from it.

2022_01_Add-a-heading-13.jpg

Sarah works in the HR analytics department in a start-up company that handles mostly Big Data and Data Science related projects. Being a start-up, they constantly need new resources and have to spend as little as possible to train the newbies. Her job is to regularly filter out the applications by using the internal resume screen tool and clean the data for further analysis. Let’s have a look at the latest data that was curated. The data contains information about the candidates who have applied for the open positions posted. The data contains the following columns:

  • Application_ID: Unique ID of the application.
  • City: City Name of the applicant.
  • City_Development_Index: Development index of the city as per https://eol.smartcities.gov.in.
  • Major: Major of the applicant.
  • Experience: Experience of the applicant.
  • Prev_Company_Size: Company size where the applicant is working.
  • Company_Type: Type of the company where the applicant is working.
  • Last_New_Job: Time since when the applicant has taken any new job.
  • Profile_Match: Score generated by the resume analysis tool.

Let’s have a look at the tasks she has to do:

  • Clean the data: Handling empty and misspelled values.
  • Categorize the applications into 5 different groups based on profile matching score
  • Generate Report of the data
  • Generate Statistics for each city sorted by profile match score in multiple sheets

Sarah uses pandas for such manipulation tasks as her daily driver. But, Why Pandas? Because pandas is among the most well-known libraries in python for data manipulation and is very efficient in handling data of a variety of formats. Also, she can quickly generate charts with pandas for her reports.

Recommended online courses

Best-suited Data Management courses for you

Learn Data Management with these high-rated online courses

Free
27 hours
Free
2 months
– / –
1 week
Free
7 hours
Free
11 hours
Free
17 hours
Free
6 hours

Reading Data

To get started, you will have to install pandas as it does not come natively with Python. You can use !pip magic on colab or jupyter notebook to install pandas:

!pip install pandas

Sarah has the data in excel format, so to read an excel you just simply have to use read_excel() function:

import pandas as pd
 
import numpy as np
 
data = pd.read_excel('applications.xlsx')

Pandas support various formats like CSV, JSON, parquet and it can even read SQL queries.

Viewing Data

To check the data has been loaded correctly you can use head() or tail() functions to view either top or bottom 5 rows of the data. You can also pass a number to change the number of rows you want to have a look at.

data.head(10)

2022_01_Viewing-Data.jpg

When you run the above code, it will show the top 10 rows of the data:

Shape

To check the size of the data, you can use the shape property of the DataFrame:

2022_01_Shape.jpg

Dataset Information

To get the general summary and information on the data you can use info() and describe() functions:

2022_01_Dataset-Information.jpg

data.info() gives information about columns, their types, and how much memory is used by the dataset. While giving general statistical information like count, mean, standard deviation, and max. It takes numerical columns by default but you can use include=’all’ parameter to get details on categorical columns as well.

2022_01_data.info_.jpg

Empty Records

Sarah has to check null/empty values in each column and handle them separately first. To check for null/empty values, pandas has a isnull() function.

If she uses this function directly it will give her a Boolean dataframe like this:

2022_01_Empty-Records.jpg

To get details curated for each column, she has to any() function as well:

2022_01_any-function.jpg

any() function returns True if there is a single truth value along the series. By default, it works column-wise. To use this row-wise use parameter axis=1, where 1 means along the rows. Similarly, you have all() functions that return True only when all the values are True or vice versa.

Filling Empty Records

To fill the empty/null values Sarah will be using the fillna() function. This function can be used on the entire dataframe or a column (single series). To optimize the effort Sarah fills each column with the desired value in the following manner:

2022_01_Filling-Empty-Records.jpg

She creates a dictionary of empty columns with desired fill value and then iterates through the fillna() function. This way she can reuse the same code with minimal changes. The function fillna() takes the fill value as first argument and the other argument inplace=True is to make changes directly into the dataframe otherwise fillna() returns the filled dataframe or series by default.

As you can see that Profile_Match column also had empty values but the screening tool that Sarah is using does not give empty values. This means that she will have to rerun the screening tool.

Replacing Values

In every dataset, there are some values wrong due to human errors or machine faults. Sarah has to do the same in the current data. She has to check for misspelled errors beforehand. To check for misspellings, she checks the unique values for each column and replaces them later on.

To check for unique values in a pandas dataframe you have to use the unique() function on a column(series). The function returns all the unique values in the series in a list.

2022_01_Replacing-Values.jpg

As you can see that columns: Major, Prev_Company_Size, Company_Type, and Last_New_Job have some spelling errors and unwanted values like “–select—”.

Let’s see how Sarah replaces these errors:

2022_01_replaces-these-errors.jpg

Sarah uses the internal string function replace to replace the errors. Pandas provide a number of string manipulation functions. You can use other functions as well, like capitalizing the Company_Type:

2022_01_Company_Type.jpg

Editing Single Cell Value

We still have to treat the profile match values that are -1. Sarah reruns the resume screening tool and generates the scores once again for the applications that were -1. Now, all she has to do is input these new values:

2022_01_Editing-Single-Cell-Value.jpg

Sarah sets the Application_ID as index in both the dataframes and then for the indices with Profile_Match as -1 she updates them to new value. Then, she resets the index. Here, we see that at[] method was used which takes index and column name to target a single cell.

Grouping in Pandas

Now that Sarah is done with the cleaning part she has to get some details to create a report, like:

  • How many applications are from each city?
  • How many applicants have a profile match score of more than 65% for different company types?
  • For each city, how applicants with different experience in years who have completed Bachelors in Computer Science?
  • Which is the optimal city where experienced candidates are present?

To group the data in the pandas, we use groupby() function. The function can take multiple columns to groupby.

Let’s have a look at the number of applicants from each city:

2022_01_Grouping-in-Pandas.jpg

The above method is one way to iterate through a grouped dataframe.

Number of applicants who have profile match score more than 65% for different company types:

2022_01_grouped-dataframe.jpg

Variable cdata is nothing but the dataframe of that group. We can apply all the functions to it like a normal dataframe.

Crosstab

For each city, the number of applicants with different experience in years and have a Bachelors in Computer Science can simply be found using a simple function i.e. crosstab() which takes two columns as arguments. Sarah needs the details for just one city, for now, let’s see how she does this:

2022_01_Crosstab.jpg

The function get_group() takes the group value as an argument and returns the filtered dataframe. The above data shows how many candidates are in each group. For example, we can say that 9 candidates have a B.E. in Data Science from Pune. This is very useful in calculating the probabilities.

Plotting in Pandas

Plots explain data really easily and interactively. Pandas supports basic matplotlib plots by default. You can use all the parameters that you use in matplotlib here as well. Experience across different majors:

from pandas.api.types import CategoricalDtype
data.Experience=data.Experience.astype(CategoricalDtype(categories=['0', '1', '1.2', '1.5', '2', '5', '5+','10+', '12+'],
                                                        ordered=True))
cdata=city_data.get_group('Bengaluru')
subplots=pd.crosstab(cdata.Experience,cdata.Major,margins=False).plot(kind='bar',
                    subplots=True,
                    layout=(2,4),title='Experience of different Majors',
                    figsize=(14,8),legend=False)
# To put the labels on top
for plot in subplots:
    for ax in plot:
        for container in ax.containers:
            ax.bar_label(container)

2022_01_Plotting-in-Pandas.jpg

Sarah changed the datatype of the column Experience before plotting using astype() function which takes type as an argument. CategoricalDtype() is another function that converts the column as a categorical type and it also allows us to set the order. Sarah needed the order of experience preserved so she had to change the type before plotting. As you can see that B.S. in Data Science and Analytics has most people that have experience of 12+ years.

Experience across profile match score bins:

# Divide Profile Match Scores into 4 different groups
bin_labels=['Bad','Low','Average','Good','Very Good']
data['Grade']=pd.qcut(data.Profile_Match,5,labels=bin_labels)
subplots=pd.crosstab(data.Experience,data.Grade,margins=False).plot(kind='barh',
                    subplots=True,stacked=True,
                    layout=(3,3),title='Experience of different Grades',
                    figsize=(24,8),legend=False)
# To put the labels on top
for plot in subplots:
    for ax in plot:
        for container in ax.containers:
            ax.bar_label(container)

2022_01_Experience-across-profile-match-score-bins.jpg

In the above code, Sarah used qcut() to bin the scores into 5 different categories and plot them. The function qcut() takes 2 arguments by default the column to cut and the number of cuts. You can pass an additional parameter to apply label names otherwise default range tuple will be kept.

Sarah uses plots to get the overview of the dataframe with null values sometimes:

Pandas DataFrame Styles

Pandas supports css styling of dataframes to make them more interactive and visually appealing. Let’s have a look how Sarah uses this:

def range_color(v,props):
    try:
        return 'background-color: #'+str(hex(int("bfedaf",16)+int(v))).replace('0x','')+';'
    except:
        return None
s=pd.crosstab(data.Prev_Company_Size,data.Last_New_Job,margins=False)
s.style.set_table_styles([  # create internal CSS classes
    {'selector': 'th', 'props': 'background-color: #ffe6e6;'},
 
]).applymap(range_color,props='')

2022_01_Pandas-DataFrame-Styles.jpg

The above dataframe looks really need and colors help you to easily differentiate between stuff. You can set styles in pandas using either properties or styler functions. The styler functions are applied using apply() or applymap() functions which work on either a column or entire dataframe. While set_table_stlyles() sets the properties of the table.

Saving DataFrame

We have gone through a number of functions and finally, we can now save the dataframe. The last thing to do is generate tables for each city sorted by profile match score in multiple sheets. To sort the data Sarah used sort_values() function which sorts the data by a given column. For this Sarah is using ExcelWriter of pandas to save the data into multiple sheets:

writer=pd.ExcelWriter('Curated_Applications.xlsx')
for city,cdata in city_data:
    cdata.sort_values(by='Profile_Match').to_excel(writer,sheet_name=city,index=False)
writer.close()

Always remember to close the writer to save the file.

The final result after saving the file:

2022_01_The-final-result-after-saving-the-file.jpg

Conclusion

In this article, we saw how Sarah was able to apply various data manipulation tasks using pandas. We have also gone through a number pandas functions and methods. Pandas is really useful and a daily driver for data analysis and science purposes.

————————————————————————————————————–

If you have recently completed a professional course/certification, click here to submit a review.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio