One hot encoding for multi categorical variables

4 mins read8K Views Comment

Updated on Sep 21, 2022 02:15 IST

Machine learning models cannot process categorical variables so they need to be converted to numerical variables such that the model is able to understand and extract valuable information. So for implementing any model we first have to preprocess the data. In preprocessing we can perform One hot encoding. That means making the data ready by doing cleaning and conversion properly. We have different types of categorical data. For eg.

Ordinal categorical data
categorical data with less number of categories.
categorical data with a large number of categories.

In the previous blog, we handled categorical data with fewer categories by using One hot encoding and label encoding. In this blog, we will learn how to handle Ordinal categorical data and categorical data with a large number of categories.

Ordinal Encoding
One hot encoding
One hot Encoding (Handling multiple categories)
One hot encoding(multiclass variables): Python code
Assignment
Endnotes

Recommended online courses

Best-suited Tableau courses for you

Learn Tableau with these high-rated online courses

Capital IQ Fundamentals

Corporate Finance InstituteCertificate

4.5

Total Fees

₹39.54 K

Duration

2 hours

Tableau Certification Training Course

Intelliq IT InstituteCertificate

Total Fees

– / –

Duration

45 hours

Get Started With Tableau

CourseraCertificate

Total Fees

– / –

Duration

2 hours

Use Tableau for Your Data Science Workflow Specialization

University of California - Irvine CampusCertificate

Total Fees

Free

Duration

3 months

Data Literacy - What is it and why does it matter?

CourseraCertificate

Total Fees

Free

Duration

11 hours

Advanced Data Visualization with Tableau

CourseraCertificate

Total Fees

Free

Duration

21 hours

Communicating Data Insights with Tableau

CourseraCertificate

Total Fees

Free

Duration

26 hours

Introduction to Tableau

CourseraCertificate

Total Fees

Free

Duration

19 hours

Use Tableau for your Data Science Workflow

CourseraCertificate

Total Fees

Free

Duration

3 hours

Getting Started with Tableau

CourseraCertificate

Total Fees

Free

Duration

11 hours

What is encoding?

Encoding is a pre-processing step conversion technique that transforms categorical data into numeric data. The two most popular techniques are Ordinal Encoding and One-Hot Encoding. It is a pre-processing step for making the data ready for other steps like training model, hyperparameter tuning, cross-validation, evaluating the model, etc.

Ordinal Encoding

In this encoding technique order of categorical variables matters.

In this, the integer values have a natural ordered relationship with each other.

By default, integer values are assigned to labels in the order that is observed in the data. But if we want to assign the values in some specific order, it can be specified via Ordinal encoding. Ideally, this should be checked and handled properly when preparing the data. But if no relationship exists between the variables we can go for another encoding technique like one-hot encoding or label encoding.

Let’s understand with an example.

In this, the feature Education has different categories like school, graduation,post-graduation, Ph.D. Suppose different students having different educational backgrounds gave a test Like in the above example the highest degree a person possesses, gives vital information about his qualification. So Ph.D. is the highest qualification among all the educations so it is assigned value 4 and post-graduation have lesser value than Ph.D. so it is given value 3 and so on.

While label encoding is used when no notion of order is present, we just have to consider the presence or the absence of a feature.

One hot encoding

In this encoding technique order of categorical variables does not matters.

Categorical data is converted into numeric data by splitting the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. For a detailed explanation, you can click here.

Handling multiple categories

One-hot encoding can be used to handle a large number of categories also. How does it do this? Suppose 200 categories are present in a feature then only those 10 categories which are the top 10 repeating categories will be chosen and one-hot encoding is applied to only those categories. And then one dummy variable can be dropped as explained in the previous blog.

One hot encoding vs label encoding in Machine Learning

As in the previous blog, we come to know that the machine learning model can’t process categorical variables. So when we have categorical variables in our dataset then we...read more

Read Later

Handling Categorical Variables with One-Hot Encoding

Handling categorical variables with one-hot encoding involves converting non-numeric categories into binary columns. Each category becomes a column with ‘1’ for the presence of that category and ‘0’ for others,...read more

Read Later

One hot encoding(multiclass variables): Python code

Implemented this code using a dataset named mercedesbenz.csv from Kaggle. This dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car.

1. Importing the Libraries

import pandas as pd
import numpy as np

2. Reading the file

df = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2'])
df.head()

3. Number of different labels

counts = df['X1'].value_counts().sum()
counts
Output: 
4209

Checking the number of different labels in column X1.

4. Checking the top 10 repeating columns

top_10_labels = [y for y in df.X1.value_counts().sort_values(ascending=False).head(10).index]
Top_10_labels
 
Output:
['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

5. Top 10 columns in ascending order

df.X1.value_counts().sort_values(ascending=False).head(10)
Output:
aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

When you see in this output image you will notice that the aa label is repeating 833 times in the X1 column and going down this number is decreasing.

So we took the top 10 results from the top and we will convert these top 10 results into one-hot encoding and the labels not occurring in this top ten label list will turn into zero.

6. Applying One hot encoding

df=pd.get_dummies(df['X1']).sample(10)
df

Now, here we apply the one-hot encoding to all multi categorical variables. You can see how the top 10 labels are now converted into binary format.

Assignment

It’s my suggestion that a simple reading code won’t help you. I suggest you download the mercedesbenz.csv file from Kaggle(freely available). This dataset has many categories. That is why we implemented this dataset. Try to convert other categorical features X2 into numerical features(I have converted only one feature)by using one-hot encoding. Try implementing an algorithm of your choice and find the prediction accuracy.

Endnotes

Congrats on making it to the end!! You should have an idea of how to handle multiclass categories using one-hot encoding and how to use it. We have to first handle categorical variables before moving to other steps like training model, hyperparameter tuning, cross-validation, evaluating the model, etc.

If you liked my blog consider hitting the stars. You can explore my other data science blogs on this page.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

One hot encoding for multi categorical variables

Table of contents

Best-suited Tableau courses for you

Capital IQ Fundamentals

Tableau Certification Training Course

Get Started With Tableau

Use Tableau for Your Data Science Workflow Specialization

Data Literacy - What is it and why does it matter?

Advanced Data Visualization with Tableau

Communicating Data Insights with Tableau

Introduction to Tableau

Use Tableau for your Data Science Workflow

Getting Started with Tableau

What is encoding?

Ordinal Encoding

One hot encoding

Handling multiple categories

One hot encoding(multiclass variables): Python code

2. Reading the file

3. Number of different labels

4. Checking the top 10 repeating columns

5. Top 10 columns in ascending order

6. Applying One hot encoding

Assignment

Endnotes

Top Picks & New Arrivals