Histogram in Seaborn

4 mins read1.4K Views Comment

Download as PDF

Shiksha Online

Updated on Nov 22, 2022 19:27 IST

Introduction

A histogram represents the frequency (count) or proportion (count/total count) of cases for continuous data.

We are back with a tutorial on Histogram using Python’s Seaborn library.

Seaborn is built on top of Python’s core visualization library Matplotlib, meaning it makes use of Matplotlib functionalities.

We have already learned Matplotlib Histogram.

Recommended online courses

Best-suited Data Visualization courses for you

Learn Data Visualization with these high-rated online courses

GPS Surveying

NPTELCertificate

Total Fees

Free

Duration

4 weeks

Introduction to Geographic Information Systems

NPTELCertificate

5.0

Total Fees

₹1 K

Duration

4 weeks

Data Science

Seven Mentor Pvt LtdCertificate

Total Fees

– / –

Duration

110 hours

Online Post Graduate Program in Data Engineering (Visualization)

GRV Business Management AcademyCertificate

Total Fees

₹72 K

Duration

6 months

Discontinued (Aug 2024)- Effective Data Visualization for the Data-Driven Organisation (Online)

IIM AhmedabadCertificate

Total Fees

– / –

Duration

22 days

Online Post Graduate Program in Data Engineering (Visualization)

361 Degree MindsCertificate

Total Fees

₹59.8 K

Duration

1 year

Online Post Graduate Program In Data Science And Data Visualization

361 Degree Minds - Annamalai UniversityCertificate

Total Fees

₹89.4 K

Duration

12 months

Data Visualization For Storytelling (Online)

Institute of Product LeadershipCertificate

Total Fees

– / –

Duration

4 weeks

M.Sc. in Data Science

upGrad - Chandigarh University, MumbaiDegree

Total Fees

₹1.2 L

Duration

2 years

Certificate Course in Data Visualisation on Tableau and PowerBI

Calcutta Media InstituteCertificate

Total Fees

₹15 K

Duration

3 months

Introduction to Histogram

A histogram is used to analyze the probability distribution of univariate numerical data by plotting the count of the data instead of the values.

Histogram divides the entire range of values into a series of intervals called bins.

It then counts the number of values that fall in each bin and visualizes the results intuitively.

We will learn how to plot histograms using seaborn through a fun little example.

The dataset used in this blog can be found here. It contains information on the performance of students in math, reading, and writing exams.

We need to ascertain the percentage distribution of students’ marks from this dataset.

Let’s get started, shall we?

Installing and Importing Seaborn

First, let’s install the Seaborn library in your working environment.

Execute the following command in your terminal:

pip install seaborn

Once Seaborn is installed, ensure that you also install the necessary packages and libraries that Seaborn is dependent on:

Pandas
NumPy
Matplotlib
SciPy

Now let’s import the libraries we’re going to need today:

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Creating a Seaborn Histogram

Load the dataset

Prior to creating our graphs, let’s check out the dataset:

#Read the dataset
df = pd.read_csv('StudentsPerformance.csv')
df.head()

#Check out the number of columns
df.shape

There are 8 columns (or features) in this dataset. Let’s print them all:

#List all column names
print(df.columns)

Based on our requirement, our focus is going to be on the math score, reading score, and writing score columns from the dataset.

Plotting the data

Now, let’s plot a histogram for marks in math using Seaborn’s histplot() function:

#Creating Seaborn Histogram
sns.histplot(data=df, x=df["math score"])

A histogram is generated as shown above.

Since Seaborn is based on Matplotlib, you can add different pyplot elements to interpret a graph better:

plt.title() for setting a plot title
plt.xlabel() and plt.ylabel() for labeling x and y-axis respectively
plt.legend() for the observation variables
plt.show() for displaying the plot

sns.histplot(data=df, x=df["math score"], label='Student Marks')
plt.title("Math Score")
plt.xlabel("Marks")
plt.ylabel("Count")
plt.legend()
plt.show()

Parameters of Histogram

Firstly, let’s specify the bins in our graph through the bins parameter:

If bins is an integer, it defines the number of equal-width bins in the range
If bins is a sequence, it defines the bin edges

#Parameter - bins
sns.histplot(data=df, x=df["math score"], bins=30)
plt.title("Math Score")

The kde parameter is a Boolean value. If True, it computes a kernel density estimate to smooth the distribution shows on the plot as line:

#Parameter - kde
sns.histplot(data=df, x=df["math score"], bins=30, kde=True)
plt.title("Math Score")

color: used to specify the color of the bins:

#Parameter - color
sns.histplot(data=df, x=df["math score"], bins=30, color='green')
plt.title("Math Score")

edgecolor: used to highlight the bin edges with the specified color:

#Parameter - edgecolor
sns.histplot(data=df, x=df["math score"], bins=30, color='green', edgecolor='red')
plt.title("Math Score")

Creating a histogram with multiple categories

Let’s try to plot the distribution of the math score variable for parental level of education of students. We will do this using the hue parameter.

The hue parameter maps a column name to determine the color of plot elements:

#Parameter - hue
sns.histplot(data=df, x=df["math score"], hue='parental level of education')

The alpha parameter takes an integer between 0 and 1 and specifies the transparency of each histogram:

#Parameter - alpha
sns.histplot(data=df, x=df["math score"], hue='parental level of education',  alpha=0.4)

Let’s enlarge one of our graphs to view it clearly:

We’ll specify the figsize parameter in the plt.figure() function of Matplotlib to set the dimensions of the figure in inches.

plt.figure(1, figsize=(10,5))
 
sns.histplot(data=df, x=df["math score"], label='Student Marks')
plt.title("Math Score")
plt.xlabel("Marks")
plt.ylabel("Count")
plt.legend()
plt.show()

From the above plot, we can make some inferences:

The majority of the students have scored between 60-80 marks in their math exam
Few students have scored below 30, so the performance of the students in math seems to be good

Now, what if you want to plot the distribution of scores for reading and writing as well?

Now, plot distributions for math score, reading score, and writing score in a single figure:

plt.figure(1, figsize=(10,5))
 
sns.histplot(data=df, x=df["math score"], label='Student Marks in Math')
sns.histplot(data=df, x=df["reading score"], label='Student Marks in Reading', color='orange')
sns.histplot(data=df, x=df["writing score"], label='Student Marks in Writing', color='pink')
plt.title("Comparison of Scores")
plt.xlabel("Marks")
plt.ylabel("Count")
plt.legend()
plt.show()

Let’s alter the transparencies of the histograms and add the kde density lines for better interpretation:

plt.figure(1, figsize=(10,5))
 
sns.histplot(data=df, x=df["math score"], label='Student Marks in Math',  alpha=0.2, kde=True)
sns.histplot(data=df, x=df["reading score"], label='Student Marks in Reading', color='orange', alpha=0.5,  kde=True)
sns.histplot(data=df, x=df["writing score"], label='Student Marks in Writing', color='pink', alpha=0.8,  kde=True)
plt.title("Comparison of Scores")
plt.xlabel("Marks")
plt.ylabel("Count")
plt.legend()
plt.show()

So, from the above plots, we can infer that the scores of all three exams follow more or less the same distribution.

Conclusion

The histogram is a really efficient tool for univariate data analysis. It is often called the “Unsung Hero of Problem-solving” because of its underutilization.

Seaborn is easier to customize and much more functional and organized than Matplotlib for basic plots.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Histogram in Seaborn

Introduction

Best-suited Data Visualization courses for you

GPS Surveying

Introduction to Geographic Information Systems

Data Science

Online Post Graduate Program in Data Engineering (Visualization)

Discontinued (Aug 2024)- Effective Data Visualization for the Data-Driven Organisation (Online)

Online Post Graduate Program in Data Engineering (Visualization)

Online Post Graduate Program In Data Science And Data Visualization

Data Visualization For Storytelling (Online)

M.Sc. in Data Science

Certificate Course in Data Visualisation on Tableau and PowerBI

Table of Content:

Introduction to Histogram

Installing and Importing Seaborn

Creating a Seaborn Histogram

Load the dataset

Plotting the data

Parameters of Histogram

Creating a histogram with multiple categories

Conclusion

Top Picks & New Arrivals