One hot encoding vs label encoding in Machine Learning

One hot encoding vs label encoding in Machine Learning

4 mins read26.2K Views Comment
Updated on Jan 24, 2023 15:37 IST

As in the previous blog, we come to know that the machine learning model can’t process categorical variables. So when we have categorical variables in our dataset then we need to convert them into numerical variables. So for that, there are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set. In the previous blog, I explained one hot encoding(dummy variables). I suggest you go through one hot encoding first in detail before starting this blog. Hereby, I will focus on 2 main methods: One-Hot-Encoding and Label-Encoding. 

2022_03_FI.jpg

Table of contents

Many of you might be confused between these two — Label Encoding and One Hot Encoding. The basic purpose of these two techniques is the same i.e. conversion of categorical variables to numerical variables. But the application is different. So let’s understand the difference between these two with a simple example.

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
8 hours
– / –
6 months

One hot Encoding

Encoding is the action of converting. One-hot encoding converts the categorical data into numeric data by splitting the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our example, we’ll get four new columns, one for each country — India, Australia, Russia, and America. 

Note: If you want to study one hot encoding with proper example and python code then click here.

Label encoding

This approach is very simple and it involves converting each value in a column into a number. 

Consider a dataset having many more columns, to understand label-encoding, we will focus on one categorical column only i.e. State which is having below values.

.

Different labels are there in a State feature that is assigned different numeric values.

So for its implementation, all we have to do is:

  • Import the LabelEncoder class from the sklearn library
  • Fit and transform the first column of the data
  • Replacing the existing text data with the new encoded data.

Don’t worry we will learn it by coding also. Very simple code!!!

Also explore:

Top 10 concepts and Technologies in Machine learning
Handling Categorical Variables with One-Hot Encoding

When to use One-Hot encoding and Label encoding?

Depending upon the data encoding technique is selected. For example, we have encoded different state names into numerical data in the above example. This categorical data is having no relation, of any kind, between the rows. Then we can use Lable encoding.

Label encoder is used when:

  • The number of categories is quite large as one-hot encoding can lead to high memory consumption.
  • When the order does not matter in categorical feature.

One Hot encoder is used when:

  • When the order does not matter in categorical features
  • Categories in a feature are fewer.

Note: The model will misunderstand the data to be in some kind of order, 0 < 1 < 2. For e.g. In the above six classes’ example for “State” column, the model misunderstood a relationship between these values as follows: 0 < 1 < 2 < 3 < 4 < 5. To overcome this problem, we can use one-hot encoding as explained below.

Difference between One-Hot encoding and Label encoding?

Label Encoding One-hot Encoding
1. The categorical values are labeled into numeric values by assigning each category to a number 1. A column with categorical values is split into multiple columns.
2. Different columns are not added. Rather different categories are converted into numeric values. So fewer computations. 2. It will add more columns and will be computationally heavy
3. Unique information is there 3. Redundant information is there
4. Different integers are used to represent data 4. Only 0 and 1 are used to represent data

Also read:

Hyperparameter Tuning: Beginners Tutorial
Cross-validation techniques
GridSearchCV and RandomizedSearchCV:Python code

Label encoding python

Implemented this code using a dataset named adult.csv from Kaggle. It is census data. The goal of this machine learning project is to predict whether a person makes over 50K a year or not given their demographic variation.

1. Importing the Libraries

 
import pandas as pd
import numpy as np
Copy code

2. Reading the file

 
df = pd.read_csv("adult.csv")
df
Copy code

Output:

2. Importing label encoder

 
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Copy code

3. Checking the different columns of the dataset

On checking we come to know there is space between quotes and the name of column as shown in the output. So while writing code we also have to give space.

 
df.columns
Copy code

Output:

4. Fitting and transform

 
df[' relationship']=le.fit_transform(df[' relationship'])
df
Copy code

After running this piece of code, if you check the value of ‘relationship’, you’ll see that the categories have been replaced by the numbers 0, 1,2,3,4, and 5.

Now you can write the same code for different categorical columns like work class, marital status, occupation, race, sex,native_country, income.

Note: Don’t apply the label encoding on the education column because in this category order matters.

We will talk about it in detail quoting different examples in the next blog.

Assignment

It’s my suggestion that simple reading code won’t help you. I suggest you download the adult.csv file from Kaggle(freely available). 

  1. Try to convert other categorical features into numerical features(I have converted only one feature)by using Lable Encoding
  2. Try implementing an algorithm of your choice and find the prediction accuracy.

Endnotes

Congrats on making it to the end!! You should have an idea of what Label encoding is and why it is used and how to use it. We have to first handle categorical variables before moving to other steps like training model, hyperparameter tuning, cross-validation, evaluating the model, etc.

Will be coming up with another blog related to how to handle ordinal features. Stay tuned!!!

If you liked my blog consider hitting the stars.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio