One hot encoding vs label encoding in Machine Learning
As in the previous blog, we come to know that the machine learning model can’t process categorical variables. So when we have categorical variables in our dataset then we need to convert them into numerical variables. So for that, there are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set. In the previous blog, I explained one hot encoding(dummy variables). I suggest you go through one hot encoding first in detail before starting this blog. Hereby, I will focus on 2 main methods: One-Hot-Encoding and Label-Encoding.
Table of contents
- One hot Encoding
- Label encoding
- When to use One-Hot encoding and Label encoding?
- Difference between One-Hot encoding and Label encoding
- Label encoding python
- Assignment
- Endnotes
Many of you might be confused between these two — Label Encoding and One Hot Encoding. The basic purpose of these two techniques is the same i.e. conversion of categorical variables to numerical variables. But the application is different. So let’s understand the difference between these two with a simple example.
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
One hot Encoding
Encoding is the action of converting. One-hot encoding converts the categorical data into numeric data by splitting the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our example, we’ll get four new columns, one for each country — India, Australia, Russia, and America.
Note: If you want to study one hot encoding with proper example and python code then click here.
Label encoding
This approach is very simple and it involves converting each value in a column into a number.
Consider a dataset having many more columns, to understand label-encoding, we will focus on one categorical column only i.e. State which is having below values.
.
Different labels are there in a State feature that is assigned different numeric values.
So for its implementation, all we have to do is:
- Import the LabelEncoder class from the sklearn library
- Fit and transform the first column of the data
- Replacing the existing text data with the new encoded data.
Don’t worry we will learn it by coding also. Very simple code!!!
Also explore:
When to use One-Hot encoding and Label encoding?
Depending upon the data encoding technique is selected. For example, we have encoded different state names into numerical data in the above example. This categorical data is having no relation, of any kind, between the rows. Then we can use Lable encoding.
Label encoder is used when:
- The number of categories is quite large as one-hot encoding can lead to high memory consumption.
- When the order does not matter in categorical feature.
One Hot encoder is used when:
- When the order does not matter in categorical features
- Categories in a feature are fewer.
Note: The model will misunderstand the data to be in some kind of order, 0 < 1 < 2. For e.g. In the above six classes’ example for “State” column, the model misunderstood a relationship between these values as follows: 0 < 1 < 2 < 3 < 4 < 5. To overcome this problem, we can use one-hot encoding as explained below.
Difference between One-Hot encoding and Label encoding?
Label Encoding | One-hot Encoding |
1. The categorical values are labeled into numeric values by assigning each category to a number | 1. A column with categorical values is split into multiple columns. |
2. Different columns are not added. Rather different categories are converted into numeric values. So fewer computations. | 2. It will add more columns and will be computationally heavy |
3. Unique information is there | 3. Redundant information is there |
4. Different integers are used to represent data | 4. Only 0 and 1 are used to represent data |
Also read:
Label encoding python
Implemented this code using a dataset named adult.csv from Kaggle. It is census data. The goal of this machine learning project is to predict whether a person makes over 50K a year or not given their demographic variation.
1. Importing the Libraries
import pandas as pdimport numpy as np
2. Reading the file
df = pd.read_csv("adult.csv")df
Output:
2. Importing label encoder
from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
3. Checking the different columns of the dataset
On checking we come to know there is space between quotes and the name of column as shown in the output. So while writing code we also have to give space.
df.columns
Output:
4. Fitting and transform
df[' relationship']=le.fit_transform(df[' relationship'])df
After running this piece of code, if you check the value of ‘relationship’, you’ll see that the categories have been replaced by the numbers 0, 1,2,3,4, and 5.
Now you can write the same code for different categorical columns like work class, marital status, occupation, race, sex,native_country, income.
Note: Don’t apply the label encoding on the education column because in this category order matters.
We will talk about it in detail quoting different examples in the next blog.
Assignment
It’s my suggestion that simple reading code won’t help you. I suggest you download the adult.csv file from Kaggle(freely available).
- Try to convert other categorical features into numerical features(I have converted only one feature)by using Lable Encoding
- Try implementing an algorithm of your choice and find the prediction accuracy.
Endnotes
Congrats on making it to the end!! You should have an idea of what Label encoding is and why it is used and how to use it. We have to first handle categorical variables before moving to other steps like training model, hyperparameter tuning, cross-validation, evaluating the model, etc.
Will be coming up with another blog related to how to handle ordinal features. Stay tuned!!!
If you liked my blog consider hitting the stars.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio