Performing Data Manipulation in Python using Pandas
by Sumeet Jain
Even before the birth of the internet, Data was an integral part of our life. Proper record-keeping and analysis was the key feature of a successful organization. Now with the internet, it has become the most crucial part. From choosing which product to purchase to your life choices, is being driven by Data. Let’s take a look at a scenario that shows the importance of Data and how can extract useful information from it.
Sarah works in the HR analytics department in a start-up company that handles mostly Big Data and Data Science related projects. Being a start-up, they constantly need new resources and have to spend as little as possible to train the newbies. Her job is to regularly filter out the applications by using the internal resume screen tool and clean the data for further analysis. Let’s have a look at the latest data that was curated. The data contains information about the candidates who have applied for the open positions posted. The data contains the following columns:
- Application_ID: Unique ID of the application.
- City: City Name of the applicant.
- City_Development_Index: Development index of the city as per https://eol.smartcities.gov.in.
- Major: Major of the applicant.
- Experience: Experience of the applicant.
- Prev_Company_Size: Company size where the applicant is working.
- Company_Type: Type of the company where the applicant is working.
- Last_New_Job: Time since when the applicant has taken any new job.
- Profile_Match: Score generated by the resume analysis tool.
Let’s have a look at the tasks she has to do:
- Clean the data: Handling empty and misspelled values.
- Categorize the applications into 5 different groups based on profile matching score
- Generate Report of the data
- Generate Statistics for each city sorted by profile match score in multiple sheets
Sarah uses pandas for such manipulation tasks as her daily driver. But, Why Pandas? Because pandas is among the most well-known libraries in python for data manipulation and is very efficient in handling data of a variety of formats. Also, she can quickly generate charts with pandas for her reports.
Reading Data
To get started, you will have to install pandas as it does not come natively with Python. You can use !pip magic on colab or jupyter notebook to install pandas:
!pip install pandas
Sarah has the data in excel format, so to read an excel you just simply have to use read_excel() function:
import pandas as pd import numpy as np data = pd.read_excel('applications.xlsx')
Pandas support various formats like CSV, JSON, parquet and it can even read SQL queries.
Viewing Data
To check the data has been loaded correctly you can use head() or tail() functions to view either top or bottom 5 rows of the data. You can also pass a number to change the number of rows you want to have a look at.
data.head(10)
When you run the above code, it will show the top 10 rows of the data:
Shape
To check the size of the data, you can use the shape property of the DataFrame:
Dataset Information
To get the general summary and information on the data you can use info() and describe() functions:
data.info() gives information about columns, their types, and how much memory is used by the dataset. While giving general statistical information like count, mean, standard deviation, and max. It takes numerical columns by default but you can use include=’all’ parameter to get details on categorical columns as well.
Empty Records
Sarah has to check null/empty values in each column and handle them separately first. To check for null/empty values, pandas has a isnull() function.
If she uses this function directly it will give her a Boolean dataframe like this:
To get details curated for each column, she has to any() function as well:
any() function returns True if there is a single truth value along the series. By default, it works column-wise. To use this row-wise use parameter axis=1, where 1 means along the rows. Similarly, you have all() functions that return True only when all the values are True or vice versa.
Filling Empty Records
To fill the empty/null values Sarah will be using the fillna() function. This function can be used on the entire dataframe or a column (single series). To optimize the effort Sarah fills each column with the desired value in the following manner:
She creates a dictionary of empty columns with desired fill value and then iterates through the fillna() function. This way she can reuse the same code with minimal changes. The function fillna() takes the fill value as first argument and the other argument inplace=True is to make changes directly into the dataframe otherwise fillna() returns the filled dataframe or series by default.
As you can see that Profile_Match column also had empty values but the screening tool that Sarah is using does not give empty values. This means that she will have to rerun the screening tool.
Replacing Values
In every dataset, there are some values wrong due to human errors or machine faults. Sarah has to do the same in the current data. She has to check for misspelled errors beforehand. To check for misspellings, she checks the unique values for each column and replaces them later on.
To check for unique values in a pandas dataframe you have to use the unique() function on a column(series). The function returns all the unique values in the series in a list.
As you can see that columns: Major, Prev_Company_Size, Company_Type, and Last_New_Job have some spelling errors and unwanted values like “–select—”.
Let’s see how Sarah replaces these errors:
Sarah uses the internal string function replace to replace the errors. Pandas provide a number of string manipulation functions. You can use other functions as well, like capitalizing the Company_Type:
Editing Single Cell Value
We still have to treat the profile match values that are -1. Sarah reruns the resume screening tool and generates the scores once again for the applications that were -1. Now, all she has to do is input these new values:
Sarah sets the Application_ID as index in both the dataframes and then for the indices with Profile_Match as -1 she updates them to new value. Then, she resets the index. Here, we see that at[] method was used which takes index and column name to target a single cell.
Grouping in Pandas
Now that Sarah is done with the cleaning part she has to get some details to create a report, like:
- How many applications are from each city?
- How many applicants have a profile match score of more than 65% for different company types?
- For each city, how applicants with different experience in years who have completed Bachelors in Computer Science?
- Which is the optimal city where experienced candidates are present?
To group the data in the pandas, we use groupby() function. The function can take multiple columns to groupby.
Let’s have a look at the number of applicants from each city:
The above method is one way to iterate through a grouped dataframe.
Number of applicants who have profile match score more than 65% for different company types:
Variable cdata is nothing but the dataframe of that group. We can apply all the functions to it like a normal dataframe.
Crosstab
For each city, the number of applicants with different experience in years and have a Bachelors in Computer Science can simply be found using a simple function i.e. crosstab() which takes two columns as arguments. Sarah needs the details for just one city, for now, let’s see how she does this:
The function get_group() takes the group value as an argument and returns the filtered dataframe. The above data shows how many candidates are in each group. For example, we can say that 9 candidates have a B.E. in Data Science from Pune. This is very useful in calculating the probabilities.
Plotting in Pandas
Plots explain data really easily and interactively. Pandas supports basic matplotlib plots by default. You can use all the parameters that you use in matplotlib here as well. Experience across different majors:
from pandas.api.types import CategoricalDtype data.Experience=data.Experience.astype(CategoricalDtype(categories=['0', '1', '1.2', '1.5', '2', '5', '5+','10+', '12+'], ordered=True)) cdata=city_data.get_group('Bengaluru') subplots=pd.crosstab(cdata.Experience,cdata.Major,margins=False).plot(kind='bar', subplots=True, layout=(2,4),title='Experience of different Majors', figsize=(14,8),legend=False) # To put the labels on top for plot in subplots: for ax in plot: for container in ax.containers: ax.bar_label(container)
Sarah changed the datatype of the column Experience before plotting using astype() function which takes type as an argument. CategoricalDtype() is another function that converts the column as a categorical type and it also allows us to set the order. Sarah needed the order of experience preserved so she had to change the type before plotting. As you can see that B.S. in Data Science and Analytics has most people that have experience of 12+ years.
Experience across profile match score bins:
# Divide Profile Match Scores into 4 different groups bin_labels=['Bad','Low','Average','Good','Very Good'] data['Grade']=pd.qcut(data.Profile_Match,5,labels=bin_labels) subplots=pd.crosstab(data.Experience,data.Grade,margins=False).plot(kind='barh', subplots=True,stacked=True, layout=(3,3),title='Experience of different Grades', figsize=(24,8),legend=False) # To put the labels on top for plot in subplots: for ax in plot: for container in ax.containers: ax.bar_label(container)
In the above code, Sarah used qcut() to bin the scores into 5 different categories and plot them. The function qcut() takes 2 arguments by default the column to cut and the number of cuts. You can pass an additional parameter to apply label names otherwise default range tuple will be kept.
Sarah uses plots to get the overview of the dataframe with null values sometimes:
Pandas DataFrame Styles
Pandas supports css styling of dataframes to make them more interactive and visually appealing. Let’s have a look how Sarah uses this:
def range_color(v,props): try: return 'background-color: #'+str(hex(int("bfedaf",16)+int(v))).replace('0x','')+';' except: return None s=pd.crosstab(data.Prev_Company_Size,data.Last_New_Job,margins=False) s.style.set_table_styles([ # create internal CSS classes {'selector': 'th', 'props': 'background-color: #ffe6e6;'}, ]).applymap(range_color,props='')
The above dataframe looks really need and colors help you to easily differentiate between stuff. You can set styles in pandas using either properties or styler functions. The styler functions are applied using apply() or applymap() functions which work on either a column or entire dataframe. While set_table_stlyles() sets the properties of the table.
Saving DataFrame
We have gone through a number of functions and finally, we can now save the dataframe. The last thing to do is generate tables for each city sorted by profile match score in multiple sheets. To sort the data Sarah used sort_values() function which sorts the data by a given column. For this Sarah is using ExcelWriter of pandas to save the data into multiple sheets:
writer=pd.ExcelWriter('Curated_Applications.xlsx') for city,cdata in city_data: cdata.sort_values(by='Profile_Match').to_excel(writer,sheet_name=city,index=False) writer.close()
Always remember to close the writer to save the file.
The final result after saving the file:
Conclusion
In this article, we saw how Sarah was able to apply various data manipulation tasks using pandas. We have also gone through a number pandas functions and methods. Pandas is really useful and a daily driver for data analysis and science purposes.
————————————————————————————————————–
If you have recently completed a professional course/certification, click here to submit a review.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio