Curse of Dimensionality

Curse of Dimensionality

5 mins read11.6K Views Comment
Updated on Aug 27, 2024 14:54 IST

Learn about the ‘Curse of Dimensionality’ and its impact through this article.

2022_05_Add-a-heading-19.jpg

The Curse of Dimensionality sounds like it’s something straight out of the wizarding world, but it really is only a very common term you’ll come across in the Machine Learning and Big Data world. This term describes how the increase in input data dimensions results in an exponential increase in computational expense and efforts required for processing and analyzing that data.

In this article, we will cover the following sections:

What is the Curse of Dimensionality?

The curse of dimensionality basically refers to the difficulties a machine learning algorithm faces when working with data in the higher dimensions, that did not exist in the lower dimensions. This happens because when you add dimensions (features), the minimum data requirements also increase rapidly.

This means, that as the number of features (columns) increases, you need an exponentially growing number of samples (rows) to have all combinations of feature values well-represented in our sample.

2022_05_curse-of-dimensionality.jpg

With the increase in the data dimensions, your model –

  • would also increase in complexity.
  • would become increasingly dependent on the data it is being trained on.

This leads to overfitting of the model, so even though the model performs really well on training data, it fails drastically on any real data.

Quite a few algorithms work only on tall, svelte datasets with fewer features and more samples. Hence, to remove the curse afflicting your model, you might need to put your data on a diet – i.e., reduce its dimensions through feature selection and feature engineering techniques. Let’s see how this is done!

Explore machine learning courses

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
8 hours
– / –
6 months

Dimensionality Reduction to the Rescue

What you need to understand first is that data features are usually correlated. Hence, the higher dimensional data is dominated by a rather small number of features. If we can find a subset of the superfeatures that can represent the information just as well as the original dataset, we can remove the curse of dimensionality!

This is what dimensionality reduction is – a process of reducing the dimension of your data to a few principal features.

Fewer input dimensions often correspond to a simpler model, referred to as its degrees of freedom. A model with larger degrees of freedom is more prone to overfitting. So, it is desirable to have more generalized models, and input data with fewer features.

Why is Dimensionality Reduction necessary?

  • Avoids overfitting – the lesser assumptions a model makes, the simpler it will be.
  • Easier computation – the lesser the dimensions, the faster the model trains.
  • Improved model performance – removes redundant features and noise, lesser misleading data improves model accuracy.
  • Lower dimensional data requires less storage space.
  • Lower dimensional data can work with other algorithms that were unfit for larger dimensions.

How is Dimensionality Reduction done?

Several techniques can be employed for dimensionality reduction depending on the problem and the data. These techniques are divided into two broad categories:

Feature Selection: Choosing the most important features from the data

Feature Extraction: Combining features to create new superfeatures.

2022_05_How-is-Dimensionality-Reduction-Done.jpg

Now, we are going to demonstrate how to get rid of the curse of dimensionality. We will be performing dimensionality reduction through a common linear method – Principal Component Analysis (PCA).

How to Become a Machine Learning Expert in 9 Months
How to Become a Machine Learning Expert in 9 Months
Learning machine learning is critical because it opens the door to developing cutting-edge applications in cybersecurity, facial recognition, and other fields. This article aims to guide you through the process...read more
Bagging Technique in Ensemble Learning
Bagging Technique in Ensemble Learning
In this article, we will discuss the concept how to solve machine learning problems using the ensemble learning bagging.
K-means Clustering in Machine Learning
K-means Clustering in Machine Learning
When you are dealing with Machine Learning problems that work with unlabeled training datasets, the most common learning algorithms you will come across are clustering algorithms. Amongst them, the simplest...read more

Demo: Implementing PCA for Dimensionality Reduction

Problem Statement:

A large number of input dimensions can cause a model to slow down during execution. So, we perform Principal Component Analysis (PCA) on the model to speed up the fitting of the ML algorithm.

PCA projects data in the direction of increasing variance. The features having the highest variance are the principal components. Let’s see how to implement PCA using Python:
Dataset Description:

We will be using the iris flower dataset for this.

This dataset has 4 features:

  • sepal_length – Sepal length in centimeters
  • sepal_width – Sepal width in centimeters
  • petal_length – Petal length in centimeters
  • petal_width – Petal width in centimeters
  • species – Species of iris

The ‘species’ column is our target variable

Tasks to be performed:

  1. Loading the dataset
  2. Standardizing the data onto a unit scale
  3. Projecting PCA to two-dimensions
  4. Concatenating the Principal Components with the target variable
  5. Visualizing the 2D Projection

Step 1 – Loading the dataset

import pandas as pd

iris = pd.read_csv(‘IRIS.csv’)  

iris.head()

2022_05_sepal-1.jpg

Step 2 – Standardizing the data onto a unit scale

Before we apply PCA, the features in the given data need to be standardized onto a unit scale (mean = 0 and variance = 1). This ensures the optimal performance of many ML algorithms. For this, we use StandardScaler, as shown:

from sklearn.preprocessing import StandardScaler

features = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’]

#Separating the features

x = iris.loc[:, features].values

#Separating the target variable

y = iris.loc[:,[‘species’]].values

#Standardizing the features

x = StandardScaler().fit_transform(x)

Step 3 – Projecting PCA to 2D

The initial dataset has 4 features: sepal length, sepal width, petal length, and petal width.

Now, we will be projecting this data into two principal components – PC1 and PC2. These will now be our main dimensions of variance:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

components = pca.fit_transform(x)

data = pd.DataFrame(data = components, 

                  columns = [‘PC1’, ‘PC2’])

data.head()

2022_05_table-2.jpg

Step 4 – Concatenating the Principal Components with the target variable

Now, we will concatenate the DataFrame, created in the above step, along axis = 1.

final_data = pd.concat([data, iris[[‘species’]]], axis = 1)

final_data.head()

2022_05_table.jpg

Thus, we have performed dimensionality reduction and removed the curse of dimensionality that had afflicted our data, by performing PCA on it.

Step 5 – Visualizing the 2D Projection

import matplotlib.pyplot as plt

fig = plt.figure(figsize = (8,8))

ax = fig.add_subplot(1,1,1) 

ax.set_xlabel(‘Principal Component 1’, fontsize = 15)

ax.set_ylabel(‘Principal Component 2’, fontsize = 15)

ax.set_title(‘2 component PCA’, fontsize = 20)

targets = [‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’]

colors = [‘red’, ‘green’, ‘orange’]

for target, color in zip(targets,colors):

    indicesToKeep = final_data[‘species’] == target

    ax.scatter(final_data.loc[indicesToKeep, ‘PC1’]

               , final_data.loc[indicesToKeep, ‘PC2’]

               , c = color

               , s = 50)

ax.legend(targets)

ax.grid()

2022_05_2-component-pca.jpg

As demonstrated in the graph above, we can see well separated the classes are from each other.

Endnotes

Dimensionality Reduction is an important yet often overlooked step in the ML workflow of organizations in general. In a domain where more data is considered to be a good thing, we have rediscovered how low-quality unnecessary data can create more problems than before.

Hope this article helped you understand the importance of keeping the features of data to a minimum. Artificial Intelligence & Machine Learning is an increasingly growing domain that has hugely impacted big businesses worldwide. Interested in being a part of this frenzy? Explore related articles here.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio