Tutorial – Box Plot in Matplotlib

Tutorial – Box Plot in Matplotlib

5 mins read1.6K Views Comment
Updated on Mar 4, 2022 16:12 IST

Python’s most popular visualization library – Matplotlib, provides support for many useful graphical visualizations. For this article, we are going to focus on Box Plots – a graphical technique used to examine the distribution of your data.

2022_02_box-plot-in-Matplotlib-e1646035499203.jpg

Exploratory Data Analysis (EDA) is an important step in data analysis when working on a machine learning or data science project. EDA helps summarize the main characteristics of your data, mostly employing data visualization methods. Let’s see how to perform EDA with a box plot in Matplotlib.

We will be covering the following sections:

Quick Intro to Box Plots

A box plot is used to visually represent the statistical summary of an attribute element in a dataset. Each box plot displays the following:

  • Minimum
  • First Quartile Q1
  • Median
  • Third Quartile (Q3)
  • Maximum
  • Interquartile Range (IQR)
  • Outliers, if any
Chart, box and whisker chart

Description automatically generated

Just as a median breaks the dataset in half, quartiles are used to tell us about the spread and skewness of data by breaking the dataset into quarters. 

Additionally, you can choose to display the mean and standard deviation of your data. Box plots are also called ‘Box and Whisker plot’ and are particularly useful for comparing distributions across groups.

Now, we will understand how to create a box plot to analyze the distribution of variables in a given data. The dataset used in this blog can be found here. It contains information on breast cancer. The patient’s ID number along with the diagnosis is provided along with ten real-valued features for each cell nucleus. 

We are going to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).

Let’s get started!

Recommended online courses

Best-suited Data Visualization courses for you

Learn Data Visualization with these high-rated online courses

Free
4 weeks
1 K
4 weeks
– / –
110 hours
– / –
4 weeks
1.2 L
2 years

Installing and Importing Matplotlib

First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:

 
pip install matplotlib
Copy code

Now let’s import the libraries we’re going to need today:

 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Copy code

In Matplotlib, pyplot is used to create figures and change their characteristics.

The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.

Creating a Box Plot in Matplotlib

Load the dataset

Prior to creating our graphs, let’s check out the dataset:

 
#Read the dataset
df = pd.read_csv('data.csv')
df.head()
Copy code
Table

Description automatically generated
 
#Check out the number of columns
df.shape
Copy code

There are 33 columns in this dataset. Let’s print them all:

 
#List all column names
print(df.columns)
Copy code
Text

Description automatically generated

Based on our requirement, our focus is going to be on the diagnosis and area_mean columns from the dataset. 

Plotting the data

Now, let’s plot a box plot using the plt.boxplot() function:

 
plt.boxplot(df['area_mean'])
Copy code
Chart, box and whisker chart

Description automatically generated

Let’s add a few elements here to help us interpret the visualization in a better way. 

The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:

  • plt.title() for setting a plot title
  • plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
  • plt.show() for displaying the plot
 
plt.boxplot(df['area_mean'])
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Through the above plot, we can visualize the spread of the data distribution and how values over ~1500 are outliers.

Optional Parameters of the Box Plot in Matplotlib

  • notch: This parameter is set to boolean values False or True for simple rectangular and notched plot respectively. The notches represent the confidence interval (CI) around the median:
 
#Parameter notch
plt.boxplot(df['area_mean'], notch=True)
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated
  • vert: This parameter is set to boolean values False or True for horizontal and vertical plot respectively:

#Parameter vert

 
#Parameter vert
plt.boxplot(df['area_mean'], vert=False)
plt.xlabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code

Chart, box and whisker chart

Description automatically generated

  • patch_artist: This parameter is set to boolean False by default and produces boxes with Line2D artist. If set to True produces a plot with Patch artists:

#Parameter patch_artist

 
#Parameter patch_artist
plt.boxplot(df['area_mean'], patch_artist=True)
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated
  • manage: This parameter if set to boolean True by default and the tick locations and labels are adjusted to match the boxplot positions. If set to False, this happens:

#Parameter manage_ticks

 
#Parameter manage_ticks
plt.boxplot(df['area_mean'], manage_ticks=False)
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated
  • showmeans: This parameter if set to boolean value True, displays plot mean:
 
#Parameter showmeans
plt.boxplot(df['area_mean'], vert=False, showmeans=True)
plt.xlabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Styling the box plot – Changing outlier marker colors

  • flierprops: This parameter is a dictionary that specifies the style of the fliers or markers:
 
#Changing the outlier markers using parameter flierprops
dots = dict(markerfacecolor='red', marker='o')
plt.boxplot(df['area_mean'], vert=False, flierprops=dots)
plt.xlabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Styling the box plot – Changing mean marker colors

  • meanprops: This parameter is a dictionary that specifies the style of the mean marker:
 
#Adding the mean using parameter markerprops
mean_shape = dict(markerfacecolor='yellow', marker='D', markeredgecolor= 'green')
plt.boxplot(df['area_mean'], vert=False, showmeans=True, meanprops=mean_shape)
plt.xlabel('Tumor Area Mean Values')
plt.title('Boxplot of Area Mean')
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Grouping Box Plots in a Single Figure

Coming back to our objective, we have to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).

We can do this in two ways –

  1. Using Pandas
 
#Plotting the boxplot using pandas
df.boxplot(column = 'area_mean', by = 'diagnosis')
plt.ylabel('Tumor Area Mean Values')
plt.title('')
Copy code
Chart, box and whisker chart

Description automatically generated
  1. Using Matplotlib
 
#Creating Series
malignant = df[df['diagnosis']=='M']['area_mean']
benign = df[df['diagnosis']=='B']['area_mean']
#Plotting the boxplot using matplotlib
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([malignant,benign], labels=['M', 'B'])
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot grouped by Diagnosis')
Copy code
Chart, box and whisker chart

Description automatically generated

As you can see, a larger distribution of tumors is malignant and they also have a higher area mean. 

Saving your Box Plot

You can save your plot as an image using the savefig() function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats. 

Let’s try saving the ‘Boxplot grouped by Diagnosis’ plot we have created above:

 
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([malignant,benign], labels=['M', 'B'])
plt.ylabel('Tumor Area Mean Values')
plt.title('Boxplot grouped by Diagnosis')
fig.savefig('boxplot.png')
Copy code

The image would have been saved with the filename ‘boxplot.png’. 

To view the saved image, we’ll use the matplotlib.image module, as shown below:

 
#Displaying the saved image
import matplotlib.image as mpimg
image = mpimg.imread("boxplot.png")
plt.imshow(image)
plt.show()
Copy code
Chart, box and whisker chart

Description automatically generated

Outlier Detection using Box Plots

An outlier is an observation that deviates evidently from other observations in the data. Outliers can be anomalies or an error. So, to decide whether to ignore the outliers, we need to identify them first. Box plots are an excellent statistical tool to visualize outliers.  

Let’s take a separate example – We have some data on testosterone levels in males as given below:

Patients  Testosterone Levels in ng/dL
Male 1 683
Male 2 540
Male 3 938
Male 4 67
Male 5 712
Male 6 594
Male 7 429
Male 8 491
Male 9 803

Let’s create a box plot for this data:

 
import pandas as pd
data = {'Patients': ['Male 1','Male 2','Male 3','Male 4','Male 5','Male 6', 'Male 7','Male 8','Male 9'],
'Testosterone': [683,540,938,67,712,594,429,491,803]}
df = pd.DataFrame(data)
df
Copy code
Table

Description automatically generated

import matplotlib.pyplot as plt

plt.boxplot(df[‘Testosterone’])

Chart, box and whisker chart

Description automatically generated

As you can see, one outlier is clearly visible in this box plot which can easily be removed. The exact value of the outlier is not known from the plot, but we know that it is lower than 200. So, let’s filter the outlier value:

 
import matplotlib.pyplot as plt
plt.boxplot(df['Testosterone'])
Copy code

There! The box plot detects the value 67 as an outlier in the dataset. Whether this outlier is an anomaly or not is a different question that has to be answered separately using additional techniques and having domain knowledge.

Endnotes

Box plot is an underrated tool that can summarize a lot of information about your data through a single visualization. When performing exploratory data analysis (EDA), box plots can be a great complement to histograms. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis. Interested in learning more about Data Visualization using Python? Explore related articles here.


About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio