Tutorial – Matplotlib Histogram – Shiksha Online

Tutorial – Matplotlib Histogram – Shiksha Online

5 mins read815 Views Comment
Updated on Mar 4, 2022 16:15 IST

When working with a machine learning project, you almost always perform data analysis and when doing that, you are bound to come across the good old histograms. A histogram represents the frequency (count) or proportion (count/total count) of cases for continuous data. Python’s data visualization library – Matplotlib, provides support for many useful charts for creating cool visualizations. For this article, we are going to look at using Matplotlib Histogram.

2022_02_matplotlib-histogram.jpg

We will be covering the following sections:

Quick Intro to Histograms

A histogram is used to visualize the probability distribution of one-dimensional numerical data. Histograms plot the frequencies of the data instead of the values. It does that by dividing the entire range of values into a series of intervals called bins. It then counts the number of values that fall in each bin and visualizes the results intuitively.

We will understand how this is done through a fun example. The dataset used in this blog contains information on employees working for a company. We need to find out the percentage distribution of employees’ monthly income in this company. 

Let’s get started!

Recommended online courses

Best-suited Python for data science courses for you

Learn Python for data science with these high-rated online courses

Free
4 weeks
12 K
8 hours
4.24 K
6 weeks
40 K
100 hours
4.99 K
– / –
– / –
– / –
– / –
60 hours
– / –
90 hours
1.27 L
12 hours

Installing and Importing Matplotlib

First, let’s install the Matplotlib library in your working environment. Execute the following command in your terminal:

 
pip install matplotlib
Copy code

Now let’s import the libraries we’re going to need today:

 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Copy code

In Matplotlib, pyplot is used to create figures and change their characteristics.

The %matplotlib inline function allows for plots to be visible when using Jupyter Notebook.

Creating a Matplotlib Histogram

Load the dataset

Prior to creating our graphs, let’s check out the dataset:

 
#Read the dataset
df = pd.read_csv('company.csv')
df.head()
Copy code
Graphical user interface, application

Description automatically generated
 
#Check out the number of columns
df.shape()
Copy code

There are 35 columns (or features) in this dataset. Let’s print them all:

 
#List all column names
print(df.columns)
Copy code
Text

Description automatically generated

Based on our requirement, our focus is going to be on the MonthlyIncome column from the dataset. 

Plotting the data

Now, let’s plot a Matplotlib histogram using the plt.hist() function:

plt.hist(df['MonthlyIncome'])
Chart, histogram

Description automatically generated

Although we can get a general idea of the distribution of employees’ income through the above plot, we cannot extract any relevant information from the histogram just yet.

Let’s add a few elements here to help us interpret the visualization in a better way. 

Adding Elements to Matplotlib Histogram

The plot we have created would not be easily understandable to a third pair of eyes without context, so let’s try to add different elements to make it more readable:

  • Use plt.title() for setting a plot title
  • Use plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
  • Use plt.legend() for the observation variables
  • Use plt.show() for displaying the plot
 
plt.hist(df['MonthlyIncome'], label='Employees’ Income')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.title('Income Distribution')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated

Parameters of Matplotlib Histogram

Firstly, let’s specify the bins in our graph through the bins parameter:

  • If bins is an integer, it defines the number of equal-width bins in the range
  • If bins is a sequence, it defines the bin edges
 
plt.hist(df['MonthlyIncome'], label='Employees’ Income', bins=20)
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.title('Income Distribution')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated
  • The edgecolor parameter is used to highlight the bin edges with the specified color:
 
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.title('Income Distribution')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated

The histtype parameter specifies the type of histogram:

  • bar (default) 
  • barstacked
  • step
  • stepfilled 
 
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', histtype='step')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.title('Income Distribution')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated
  • The range parameter specifies the lower and upper range of the bins:

Range has no effect if bins is a sequence.

 
plt.hist(df['MonthlyIncome'], label='Employees’ Income', bins=20, edgecolor='r', range=[1000,25000])
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.title('Income Distribution')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated
  • Histograms are vertical by default. But you can change their orientation as using the orientation parameter:
 
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', orientation='horizontal')
plt.title('Income Distribution')
plt.xlabel('Frequency')
plt.ylabel('Monthly Income')
plt.legend()
plt.show()
Copy code

Chart, bar chart

Description automatically generated

Let’s enlarge our graph to view it clearly:

  • We’ll specify the figsize parameter in the figure() function to set the dimensions of the figure in inches.
 
plt.figure(1, figsize=(15,7))
plt.hist(df['MonthlyIncome'],label='Employees’ Income',bins=20, edgecolor='r', orientation='horizontal')
plt.title('Income Distribution')
plt.xlabel('Frequency')
plt.ylabel('Monthly Income')
plt.legend()
plt.show()
Copy code
Chart, bar chart

Description automatically generated

From the above plot, we can make some inferences: 

  1. The majority of the employees in the company are earning around $2500 monthly
  2. Most of the employees have a salary range of $2500-7500
  3. Few employees command a salary higher than $10,000

Now, what if you want to plot the distribution of monthly incomes department-wise? Let’s try doing that through histograms too!

Let’s group the dataset according to the Department column:

 
dept = df.groupby('Department')
dept.first()
Copy code
Graphical user interface, application

Description automatically generated

Do you see the three departments shown above? Let’s group our original dataset based on the three departments separately:

 
sales = dept.get_group('Sales')
hr = dept.get_group('Human Resources')
rd = dept.get_group('Research & Development')
Copy code

Let’s plot a histogram for the ‘Sales’ department:

 
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales')
plt.title('Income Distribution')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.legend()
plt.show()
Copy code
2022_02_image-25.jpg

Now, let’s try plotting histograms for all three departments on a single plot:

 
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales')
plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources')
plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D')
plt.title('Income Distribution of Departments')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated

Oops! The plots for ‘Sales’ and ‘Human Resources’ are hidden behind the ‘Sales’ histogram. Wouldn’t it be helpful if the histograms were see-through?

  • The alpha parameter takes an integer between 0 and 1 and specifies the transparency of each histogram
 
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)
plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5)
plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2)
plt.title('Income Distribution of Departments')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated
  • Let’s change the colors of the histograms using the color parameter:
 
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)
plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5, color = 'r')
plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2)
plt.title('Income Distribution of Departments')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.legend()
plt.show()
Copy code
Chart, histogram

Description automatically generated

The plot looks attractive as well as informative, doesn’t it now? We can again make some inferences from the above plot: 

  1. We can infer that the R&D department has the highest number of employees within the salary range of $2000-$7000
  2. Few employees command salaries higher than $10,000 and even then, most of them are from the R&D department
  3. Human Resources department has the lowest count of employees

Saving your Matplotlib Histogram

You can save your plot as an image using savefig() function. Plots can be saved in – .png, .jpeg, .pdf, and many other supporting formats. 

Let’s try saving the ‘Income Distribution of Departments’ plot we have created above:

 
fig = plt.figure(1, figsize=(15,7))
plt.hist(sales['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Sales', alpha=0.8)
plt.hist(hr['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='Human Resources', alpha=0.5, color = 'r')
plt.hist(rd['MonthlyIncome'], bins=10, histtype='bar', edgecolor='k', label='R&D', alpha=0.2)
plt.title('Income Distribution of Departments')
plt.ylabel('Frequency')
plt.xlabel('Monthly Income')
plt.legend()
fig.savefig('histogram.png')
Copy code

To view the saved image, we’ll use the matplotlib.image module, as shown below:

 
#Displaying the saved image
import matplotlib.image as mpimg
image = mpimg.imread("histogram.png")
plt.imshow(image)
plt.show()
Copy code
Chart, histogram

Description automatically generated

Endnotes

Histograms take continuous numerical values and can be used to get their frequency in a dataset. Hence, it can prove to be a really efficient tool for data analysis. In fact, The Histogram is often called the “Unsung Hero of Problem-solving” because of its underutilization. Matplotlib is one of the oldest Python visualization libraries and provides a wide variety of charts and plots for better analysis.


Top Trending Articles:

Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio