Tutorial – Box Plot in Matplotlib
Python’s most popular visualization library – Matplotlib, provides support for many useful graphical visualizations. For this article, we are going to focus on Box Plots – a graphical technique used to examine the distribution of your data.
Exploratory Data Analysis (EDA) is an important step in data analysis when working on a machine learning or data science project. EDA helps summarize the main characteristics of your data, mostly employing data visualization methods. Let’s see how to perform EDA with a box plot in Matplotlib.
We will be covering the following sections:
- Quick Intro to Box Plots
- Installing and Importing Matplotlib
- Creating a Matplotlib Box Plot
- Adding Elements to the Box Plot
- Optional Parameters of the Box Plot
- Grouping Box Plots in a Single Figure
- Saving your Box Plot
- Outlier Detection using Box Plots
Quick Intro to Box Plots
A box plot is used to visually represent the statistical summary of an attribute element in a dataset. Each box plot displays the following:
- Minimum
- First Quartile Q1
- Median
- Third Quartile (Q3)
- Maximum
- Interquartile Range (IQR)
- Outliers, if any
Just as a median breaks the dataset in half, quartiles are used to tell us about the spread and skewness of data by breaking the dataset into quarters.
Additionally, you can choose to display the mean and standard deviation of your data. Box plots are also called ‘Box and Whisker plot’ and are particularly useful for comparing distributions across groups.
Now, we will understand how to create a box plot to analyze the distribution of variables in a given data. The dataset used in this blog can be found here. It contains information on breast cancer. The patient’s ID number along with the diagnosis is provided along with ten real-valued features for each cell nucleus.
We are going to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).
Let’s get started!
Best-suited Data Visualization courses for you
Learn Data Visualization with these high-rated online courses
Installing and Importing Matplotlib
First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:
pip install matplotlib
Now let’s import the libraries we’re going to need today:
import pandas as pdimport matplotlib.pyplot as plt%matplotlib inline
In Matplotlib, pyplot is used to create figures and change their characteristics.
The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.
Creating a Box Plot in Matplotlib
Load the dataset
Prior to creating our graphs, let’s check out the dataset:
#Read the datasetdf = pd.read_csv('data.csv')df.head()
#Check out the number of columnsdf.shape
There are 33 columns in this dataset. Let’s print them all:
#List all column namesprint(df.columns)
Based on our requirement, our focus is going to be on the diagnosis and area_mean columns from the dataset.
Plotting the data
Now, let’s plot a box plot using the plt.boxplot() function:
plt.boxplot(df['area_mean'])
Let’s add a few elements here to help us interpret the visualization in a better way.
Adding Elements to the Box Plot in Matplotlib
The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:
- plt.title() for setting a plot title
- plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
- plt.show() for displaying the plot
plt.boxplot(df['area_mean'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
Through the above plot, we can visualize the spread of the data distribution and how values over ~1500 are outliers.
Optional Parameters of the Box Plot in Matplotlib
- notch: This parameter is set to boolean values False or True for simple rectangular and notched plot respectively. The notches represent the confidence interval (CI) around the median:
#Parameter notchplt.boxplot(df['area_mean'], notch=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
#Parameter vertplt.boxplot(df['area_mean'], vert=False)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
#Parameter patch_artistplt.boxplot(df['area_mean'], patch_artist=True)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
#Parameter manage_ticksplt.boxplot(df['area_mean'], manage_ticks=False)plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
- showmeans: This parameter if set to boolean value True, displays plot mean:
#Parameter showmeansplt.boxplot(df['area_mean'], vert=False, showmeans=True)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
Styling the box plot – Changing outlier marker colors
- flierprops: This parameter is a dictionary that specifies the style of the fliers or markers:
#Changing the outlier markers using parameter flierpropsdots = dict(markerfacecolor='red', marker='o')plt.boxplot(df['area_mean'], vert=False, flierprops=dots)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
Styling the box plot – Changing mean marker colors
- meanprops: This parameter is a dictionary that specifies the style of the mean marker:
#Adding the mean using parameter markerpropsmean_shape = dict(markerfacecolor='yellow', marker='D', markeredgecolor= 'green') plt.boxplot(df['area_mean'], vert=False, showmeans=True, meanprops=mean_shape)plt.xlabel('Tumor Area Mean Values')plt.title('Boxplot of Area Mean')plt.show()
Grouping Box Plots in a Single Figure
Coming back to our objective, we have to analyze the relationship between a categorical feature (diagnosis: malignant or benign tumor) and a continuous numerical feature (area_mean).
We can do this in two ways –
- Using Pandas
#Plotting the boxplot using pandasdf.boxplot(column = 'area_mean', by = 'diagnosis')plt.ylabel('Tumor Area Mean Values')plt.title('')
- Using Matplotlib
#Creating Seriesmalignant = df[df['diagnosis']=='M']['area_mean']benign = df[df['diagnosis']=='B']['area_mean'] #Plotting the boxplot using matplotlibfig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis')
As you can see, a larger distribution of tumors is malignant and they also have a higher area mean.
Saving your Box Plot
You can save your plot as an image using the savefig() function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats.
Let’s try saving the ‘Boxplot grouped by Diagnosis’ plot we have created above:
fig = plt.figure()ax = fig.add_subplot(111)ax.boxplot([malignant,benign], labels=['M', 'B'])plt.ylabel('Tumor Area Mean Values')plt.title('Boxplot grouped by Diagnosis') fig.savefig('boxplot.png')
The image would have been saved with the filename ‘boxplot.png’.
To view the saved image, we’ll use the matplotlib.image module, as shown below:
#Displaying the saved imageimport matplotlib.image as mpimg image = mpimg.imread("boxplot.png")plt.imshow(image)plt.show()
Outlier Detection using Box Plots
An outlier is an observation that deviates evidently from other observations in the data. Outliers can be anomalies or an error. So, to decide whether to ignore the outliers, we need to identify them first. Box plots are an excellent statistical tool to visualize outliers.
Let’s take a separate example – We have some data on testosterone levels in males as given below:
Patients | Testosterone Levels in ng/dL |
Male 1 | 683 |
Male 2 | 540 |
Male 3 | 938 |
Male 4 | 67 |
Male 5 | 712 |
Male 6 | 594 |
Male 7 | 429 |
Male 8 | 491 |
Male 9 | 803 |
Let’s create a box plot for this data:
import pandas as pd data = {'Patients': ['Male 1','Male 2','Male 3','Male 4','Male 5','Male 6', 'Male 7','Male 8','Male 9'], 'Testosterone': [683,540,938,67,712,594,429,491,803]} df = pd.DataFrame(data)df
import matplotlib.pyplot as plt
plt.boxplot(df[‘Testosterone’])
As you can see, one outlier is clearly visible in this box plot which can easily be removed. The exact value of the outlier is not known from the plot, but we know that it is lower than 200. So, let’s filter the outlier value:
import matplotlib.pyplot as plt plt.boxplot(df['Testosterone'])
There! The box plot detects the value 67 as an outlier in the dataset. Whether this outlier is an anomaly or not is a different question that has to be answered separately using additional techniques and having domain knowledge.
Endnotes
Box plot is an underrated tool that can summarize a lot of information about your data through a single visualization. When performing exploratory data analysis (EDA), box plots can be a great complement to histograms. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis. Interested in learning more about Data Visualization using Python? Explore related articles here.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio