Pandas Interview Questions for Data Scientists

Pandas Interview Questions for Data Scientists

9 mins read4.5K Views Comment
Updated on Aug 21, 2023 12:35 IST

For an aspiring data scientist, the most important data analytics libraries you need to be well-versed with are – NumPy and Pandas. We have already covered NumPy Interview Questions for Data Scientists. This article will help you brush up on your foundations of the Pandas library and prepare you for the most common Pandas Interview Questions that come up during Data Science interviews.

2022_04_pandas.jpg

We will be covering the following sections today:

 Quick Intro to Pandas

2022_04_image-50.jpg

Pandas is an open-source Python library and the most popular package for data pre-processing in Data Science. It offers extensive features and powerful built-in functions that help with tasks such as data cleaning, data manipulation, data modeling, and analysis.

Let’s discuss the common Pandas-related questions you might come across during your interviews.

Pandas Interview Questions

What is meant by the mutability of objects?

Mutability refers to the ability to modify a portion of a data structure without needing to recreate it. Examples of mutable objects are lists, sets, and values in a dictionary.

Similarly, objects are said to be immutable if they cannot be modified once created. Examples of immutable objects are integers, floats, strings, tuples, and keys of a dictionary.

1. Discuss Pandas Data Structures.

Data Structures are basically a specialized way of organizing, processing, and storing data in a manner that one can operations on it.

Pandas offer two main data structures –

Pandas Series:

  • 1D labeled array.
  • Homogeneous i.e, the data types of series elements are the same.
  • Size-immutable i.e., the size of series objects cannot be changed once created.
#pandas.DataFrame
df = pd.DataFrame(data=None, index=None, columns=None, dtype=None)
 
 
Index Data
0 element 1
1 element 2
2 element 3
3 element 4

Pandas DataFrames:

  • 2D labeled tabular structure.
  • Heterogeneous i.e, the data frame elements can have different data types.
  • Size-mutable i.e., elements can be added or dropped from an existing data frame.

 

Index COlumn 1 COLUMN 2
0 element 1 element a
1 element 2 element b
2 element 3 element c
3 element 4 element d
import pandas as pd
 
#Creating a sample DataFrame
df = pd.DataFrame({
    'id': [ 101, 129, 120, 126, 136, 125],
    'age': [ 30, 32, 43, 45, 29, 46],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F']
    })
 
df 

2. How do you get the column names of your DataFrame?

import pandas as pd
 
#Creating a sample DataFrame
df = pd.DataFrame({
    'id': [ 101, 129, 120, 126, 136, 125],
    'age': [ 30, 32, 43, 45, 29, 46],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F']
    })
 
df   
Now, to get the column names of the DataFrame, we can try multiple ways –
  • Iterating over columns and printing them:
for col in df.columns:
    print(col)  

 

Table

Description automatically generated with medium confidence

  • Using columns() method that returns the column labels of the DataFrame:
list(df.columns)

   

  • Using column.values() method that returns an array of indices:
list(df.columns.values)   

  • Using a sorted() method that returns the list of columns sorted in alphabetical order:
sorted(df)  
 

 3. How do you get the number of rows and columns of a Pandas DataFrame?

You can find this simply through the shape attribute which gives you the shape of the DataFrame:

df.shape

Thus, there are 6 rows and 3 columns in the DataFrame df.

4. How can you slice a Pandas DataFrame?

Slicing simply means extracting a subset of the DataFrame. To do this, you can just pass the relevant columns into the slicer as shown:

df[['age','id']]
   

Graphical user interface, application

Description automatically generated

5. How do you convert a Pandas Series to a DataFrame?

You can use the Series.to_frame() method to convert the series object to a DataFrame:

series = pd.Series(['butter','dynamite','daydream','friends']) 
col ='bts' 
series.to_frame(name=col) 

     

Graphical user interface, application

Description automatically generated

6. How do you convert a String to DateTime?

fromdatetimeimportdatetime 
 
#Define dates in string format 
str1 ='Wednesday, April 20, 2022' 
str2 ='4/20/22' 
str3 ='04-20-2022' 
 
#Define dates as datetime objects 
date1 =datetime.strptime(str1,'%A, %B %d, %Y') 
date2 =datetime.strptime(str2,'%m/%d/%y') 
date3 =datetime.strptime(str3,'%m-%d-%Y') 
 
#Print the converted dates 
print(date1) 
print(date2) 
print(date3)        

Text, letter

Description automatically generated

You can use the datetime.strptime() method to convert a string to DateTime objects:

7. How do you add a single row to an existing DataFrame? 

  • Using a Series: 
#Appending a single row with Series 
series = pd.Series([272,22,'M'],   
                   index=['id','age','gender'] 
) 
 
df.append(series, ignore_index=True) 
    
A picture containing table

Description automatically generated
  • Using a Dictionary: 

Please note that in this case, we would match the dictionary keys with the DataFrame columns: 

#Appending a single row with a dictionary 
dict={'id': 169,'age': 54,'gender': 'M'} 
 
df.append(dict, ignore_index=True) 
     
Table

Description automatically generated

8. How do you add multiple rows to an existing DataFrame? 

  • Using a DataFrame: 
#Creating a new DataFrame that we want to append 
df2 = pd.DataFrame({ 
    'id': [1221,1200,1230], 
    'age': [36,53,26], 
    'gender': ['M','F','M'], 
    }) 
 
#Appending multiple rows with DataFrame 
df.append(df2)        
Table

Description automatically generated
  • Using a List of Dictionaries: 
#Creating a list of lists 
mylist =[ 
  [1434,26,'M'], 
  [1578,34,'M'], 
  [1301,23,'F'], 
  [1207,54,'M'] 
] 
 
#Converting a list of lists to a DataFrame 
df2 = pd.DataFrame(mylist, columns=df.columns) 
 
#Appending multiple rows with a list of lists 
df.append(df2, ignore_index=True) 
    
A picture containing table

Description automatically generated
  • Using a List of Lists: 
#Creating a list of lists 
mylist =[ 
  [1434,26,'M'], 
  [1578,34,'M'], 
  [1301,23,'F'], 
  [1207,54,'M'] 
] 
 
#Converting a list of lists to a DataFrame 
df2 = pd.DataFrame(mylist, columns=df.columns) 
 
#Appending multiple rows with a list of lists 
df.append(df2, ignore_index=True) 
 
 
Table

Description automatically generated

9. How can you find the sum of values of a DataFrame column? 

You can use the sum() function to return the sum of the values. This is done with the help of the following attributes: 

  • axis: specifies whether to sort for row (0) or columns (1). 
  • skipna: is set to True to skip all missing values. 

df.sum(axis = 0, skipna = True)  

Text

Description automatically generated with low confidence

Note: If the input is index axis, then all the values in a column are added and the same gets repeated for all the columns. This returns a series containing the sum of all the values in each column. 

10. How and why should one create a copy of a DataFrame? 

You can simply use the copy() method to copy your existing DataFrame: 

df.copy() 

During data manipulation, it is generally a safe practice to perform operations on the copy of the DataFrame rather than the original one. Even if you work with a subset of the DataFrame, any changes made to the subset would reflect on the original DataFrame because in Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame.  

So, to ensure that the initial DataFrame remains intact, you’d want to use copies and merge them all at the end. 

Series vs. DataFrame in Pandas – Shiksha Online
Adding Columns to Pandas DataFrame

11. How do you sort a DataFrame? 

You can use the sort.values() method to sort a DataFrame by its rows or columns. This can be done with the help of the following attributes: 

  • by: specifies the column/row(s) that determine the sorting order. 
  • axis: specifies whether to sort for row (0) or columns (1). 
  • ascending: by default, a DataFrame is sorted in ascending order. You can specify ascending=False to sort in descending order. 
#Sort by a single column 
df.sort_values(by=['age'])    
Table

Description automatically generated
#Sort by multiple columns 
df.sort_values(by=['age','id']) 
    
Table

Description automatically generated with medium confidence
#Sort in descending order 
df.sort_values(by='age', ascending=False) 
      
Table

Description automatically generated with low confidence

12. How can you merge two DataFrames? 

You can use the merge() method to combine you DataFrames with SQL-style join methods. There are 4 types of joins that you can pass to an optional argument how –  

  • inner (default): output DataFrame consists of records that have the same values in both DataFrames: 
Diagram, venn diagram

Description automatically generated
#Creating two dataframes 
data1 = pd.DataFrame({"Roll No.": [101,103,104,107,106],  
                    "Gender": ['M','F','F','M','M'], 
                    "Name": ['Rob','Pam','Max','Tom','Ben'] 
                    }) 
 
data2 = pd.DataFrame({"Roll No.": [101,102,103,104,105],  
                    "Age": [31,35,36,30,38] 
  }) 
 
#Inner merge 
df_merge = pd.merge(left=data1, right=data2, how='inner') 
df_merge        
Table

Description automatically generated
  • outer: output DataFrame consists of all records with matching values in either the left or right DataFrame. If rows match, values are shown. Else, NaN is displayed: 
Diagram, venn diagram

Description automatically generated

#Outer merge 

df_merge = pd.merge(left=data1, right=data2, how=’outer’) 

df_merge 

Table

Description automatically generated
  • left: output DataFrame consists of all records from the left DataFrame and only the matched records from the right DataFrame: 
Diagram, venn diagram

Description automatically generated
#Left join 
df_merge = pd.merge(left=data1, right=data2, how='left') 
df_merge          
Table

Description automatically generated
  • right: output DataFrame consists of all records from the right DataFrame and only the matched records from the left DataFrame: 
A picture containing text, clipart

Description automatically generated
#Right join 
df_merge = pd.merge(left=data1, right=data2, how='right') 
df_merge           
Table

Description automatically generated

13. How do you deal with a dataset consisting of missing values? 

For larger datasets, the quickest way to deal with missing values is to simply remove the entire rows containing the missing values as this will not affect our predictions in any way. 

For smaller datasets, you can substitute the missing values with the average/mean values or zeros by performing operations on the dataframe. For this, you can use methods such as mean(), fillna(0). 

df_merge.fillna(0) 

Table

Description automatically generated

14. What is the difference between fillna() and interpolate() methods? 

The fillna() method fills the missing or NaN values with a substitute value specified by you. You can fill according to the row indices of a DataFrame or the column names in the form of a Python dictionary. 

On the other hand, the interpolate() method allows you to fill your missing or NaN values with many kinds of interpolations between the values like linear, time, etc. 

15. How to group DataFrame rows in a list? 

You can use the groupby() method to group on the specified column and then use the apply() method to apply a list to every group: 

df1 = df.groupby(‘age’)[‘gender’].apply(list).reset_index(name=’new’) 

df1 

Graphical user interface, application

Description automatically generated

16. How do you delete a column or a row from a Pandas DataFrame? 

You can use the drop() method to delete a column from the DataFrame. This can be done with the help of the following attributes: 

  • axis: specifies whether to sort for row (0) or columns (1). 
  • inplace: can be set to True to delete a column without reassigning the DataFrame. 
  • ascending: by default a DataFrame is sorted in ascending order. You can specify ascending=False to sort in descending order. 
#Dropping a single column  
df.drop(['gender'], axis =1)            
A screenshot of a phone

Description automatically generated with medium confidence
#Dropping multiple columns  
df.drop(['gender', 'id'], axis =1)           
Graphical user interface, application

Description automatically generated
#Dropping multiple columns  
df.drop(['gender', 'id'], axis =1) 
            
 
Table

Description automatically generated

Note: You can also remove duplicate values from the column by using the drop_duplicates() method. 

17. How can you encode a large Pandas DataFrame? 

For a DataFrame with a lot of columns (aka features), encoding a label to each feature is not efficient. So, we can use a LabelEncoder to encode a Pandas DataFrame having string or numerical labels. This can be done using Python’s scikit-learn library: 

OneHotEncoder().fit_transform(df) 

18. How can you split your Pandas DataFrame into training and testing sets? 

You can use Python’s scikit-learn library to split both NumPy arrays and Pandas DataFrames into training and testing sets to be used for creating ML models. For this, you can import train_test_split to perform the split: 

from sklearn.model_selection import train_test_split 
 
train, test = train_test_split(df, test_size=0.2) 
              

The above code splits the DataFrame into 20% testing set and 80% training set. 

19. How can you randomly sample rows from a pandas DataFrame?

In pandas, you can use the

sample()
Copy code
function to randomly sample rows from a DataFrame. To sample, you can simply call
df.sample(n=5)
Copy code
to select 5 random rows from the DataFrame
df
Copy code
. If you want to sample a specific fraction of the rows, you can use
df.sample(frac=0.25)
Copy code
to randomly select 25% of the rows. By default,
sample()
Copy code
performs sampling without replacement, but you can enable replacement by setting the
replace
Copy code
parameter to
True
Copy code
. Additionally, you can set a random seed using the
random_state
Copy code
parameter to obtain reproducible results.

20. What are loc and iloc methods in Pandas? 

The loc[ ] method 

It is a label-based method that is used for selecting data and updating it. You can do so by passing the row or column label you want to select.  

Syntax:

loc[row_label, column_label] 

The iloc[ ] method 

The iloc[ ] is an index-based method that is used for data selection. Here, you can pass the row or column positions that you want to select (0-based integer index).  

Syntax: iloc[row_position, column_position] 

Learn more about the difference between loc and iloc in Pandas

Endnotes 

Hope this article will be helpful for your Data Science as well as Pandas interview preparations. You can also find a similar article on NumPy Interview Questions. If you seek to learn more about Python and its library, you can explore related articles here. 

———————————————————————————- 

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio