Predicting Categorical Data Using Classification Algorithms
This article will demonstrate how you can build classification models using ML’s favorite programming language – Python.
Classification Algorithms are Supervised Machine Learning Algorithms that use labeled data (aka training datasets) to train classifier models. These models then predict outcomes with the best possible accuracy when new data (aka testing datasets) is fed to them.
The outcome predicted by a classification algorithm is categorical in nature. These algorithms classify variables into a specific set of classes – such as classifying a text message into transactions or promotions through an SMS filter on your iPhones.
We are going to cover the following sections:
- Overview of Classification Algorithms
- How do Classification Algorithms work?
- Types of Classification Algorithms
- Predicting Categorical Values Using Classification Algorithms
- Endnotes
Overview of Classification Algorithms
Classification techniques predict discrete class label output(s) to which the data elements belong. For example, weather prediction is a type of classification problem – ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there are only two classes.
Few more examples of classification problems –
- Speech recognition
- Face detection
- Spam texts/e-mails classification
- Stock market prediction
- Breast cancer detection
- Employee Attrition prediction
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
How do Classification Algorithms work?
A classifier utilizes known (training) data to understand how the given input (dependent) variables relate to the target (independent) variable.
In the above example, we will take into account the outside temperatures of previous days and use that as the training data. This data would be fed into the classifier – if it is trained accurately, it would be able to predict future weather conditions.
We use Binary Classifiers in case there are only two classes and Multi-class Classifiers for more than two class divisions.
Types of Classification Algorithms
When to use which algorithm would depend on the application and the nature of the data. The most common classification algorithms include:
- Logistic Regression
- K Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest
- Naïve Bayes
Logistic Regression
Note that, though the name is Logistic “Regression” it is actually a linear classification algorithm. It is used when the classes are linearly separable and binary – like true (1) or false (0), win (1) or lose (0), etc.
Logistics regression uses a sigmoid function to return the probability of a label. The curve obtained is called a sigmoid curve or an S-curve. The function generates a probability output. By comparing the probability with a pre-defined threshold, the object is assigned to a label accordingly.
K-Nearest Neighbors (KNN)
If your dataset has n-features, KNN represents each data point in an n-dimensional space. It then calculates the distance between the data points. The unobserved data is then assigned the label of the nearest observed data points. KNN is commonly used for recommender systems, credit scoring, etc.
Support Vector Machine (SVM)
Support vector classifier lets you define a set of hyper-planes, called decision boundary, that separates the data points into specific classes. The data points closest to the decision boundary are called support vectors. An optimum decision boundary will have a maximum distance from each of the support vectors. Margins are the shortest perpendicular distance between the support vectors and the decision boundary.
Decision Tree
As the name suggests, this algorithm builds “branches” in a hierarchical manner where each branch can be considered as an if-else statement. The branches divide the dataset into subsets based on the most important features. The “leaves” of the decision tree are where the final classifications happen.
Random Forest
Like a forest has trees, a random forest is a collection of decision trees. This classifier aggregates the results from multiple predictors. It additionally utilizes the bagging technique that allows each tree to be trained on a random sampling of the original dataset and takes the majority vote from trees. A random forest classifier has better generalization but is less interpretable than a decision tree classifier, naturally because more layers are added to the model.
Naïve Bayes
This classifier data into different classes according to the Bayes’ Theorem. But assumes that the relationship between all input features in a class is independent. Hence, the model is called naïve. This algorithm works relatively well even when the size of the training dataset is small. Naïve Bayes is commonly used for text classification, sentiment analysis, etc.
You can understand the working of the Naïve Bayes algorithm in-depth here.
Classification algorithms are either Lazy learners or Eager learners:
- Lazy learners simply store the training data and wait for the testing data. They perform classification after that based on the most related data in the stored training set. Lazy learners take less time to train but more time to predict. KNN is a lazy learner.
- Eager learners are algorithms that build a classification model do not wait for the testing data to build a model. They perform classification based on the given training data before receiving the testing data. Eager learners take a long time to train due to model construction and less time to predict. Decision Trees and Naïve Bayes are examples of eager learners.
Predicting Categorical Values Using Classification Algorithms
For demonstration, we are going to build a model that can predict whether a patient has heart disease or not based on the features provided in the dataset given here.
We will use the six classification algorithms we have discussed above. Based on their accuracy scores, we will select the best algorithm.
Let’s get started!
Step 1 – Import the required libraries
We use Python’s scikit-learn package when working with machine learning models:
#Import required librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import GridSearchCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import accuracy_score
Step 2 – Load the dataset
#Read the datasetdata=pd.read_csv('heart.csv') #Display the first five rowsprint(data.head())
Step 3 – Prepare the data
The column HeartDisease is our target variable. We will check how many classes does our target variable have:
data.groupby('HeartDisease').count()
Two classes: 0 (False – no disease) and 1 (True – heart disease).
#Check for null valuesprint(data.isnull().sum())
Step 4 – Transform the data
Check data types of all columns:
#Check data typesdata.dtypes
So, there are object data types and a float type as well. We have to convert these labels to numeric (int64) form, so they become machine-readable. This is done through label encoding:
def label_encoder(y): le = LabelEncoder() data[y] = le.fit_transform(data[y]) label_list = ["Sex","ChestPainType", "RestingECG","ExerciseAngina","Oldpeak", "ST_Slope"] for l in label_list: label_encoder(l) #Display transformed datadata.head()
Step 5 – Split the data
Split the data into training and testing sets:
#Divide the dataset into independent and dependent variablesX = data.drop(["HeartDisease"],axis=1)y = data['HeartDisease'] #Split the data into training and testing setX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=42, shuffle=True) #Data was splitted as 80% train data and 20% test data. y_train = y_train.values.reshape(-1,1)y_test = y_test.values.reshape(-1,1) print("X_train shape:",X_train.shape)print("X_test shape:",X_test.shape)print("y_train shape:",y_train.shape)print("y_test shape:",y_test.shape)
Step 6 – Standardize the data
We will perform feature scaling to rescale data to have a mean of 0 and standard deviation of 1 (unit variance):
#Feature Scalingsc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.fit_transform(X_test)
Step 7 – Implement Classification Models
We will build all six models and compare their accuracy scores.
#To store results of models, we create two dictionariesresult_dict_train = {}result_dict_test = {}
- Logistic Regression
reg = LogisticRegression(random_state = 42)accuracies = cross_val_score(reg, X_train, y_train, cv=5)reg.fit(X_train,y_train)y_pred = reg.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",reg.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["Logistic Train Score"] = np.mean(accuracies)result_dict_test["Logistic Test Score"] = reg.score(X_test,y_test)
- KNN Classifier
knn = KNeighborsClassifier()accuracies = cross_val_score(knn, X_train, y_train, cv=5)knn.fit(X_train,y_train)y_pred = knn.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",knn.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["KNN Train Score"] = np.mean(accuracies)result_dict_test["KNN Test Score"] = knn.score(X_test,y_test)
- Support Vector Classifier
svc = SVC(random_state = 42)accuracies = cross_val_score(svc, X_train, y_train, cv=5)svc.fit(X_train,y_train)y_pred = svc.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",svc.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["SVM Train Score"] = np.mean(accuracies)result_dict_test["SVM Test Score"] = svc.score(X_test,y_test)
- Decision Tree Classifier
dtc = DecisionTreeClassifier(random_state = 42)accuracies = cross_val_score(dtc, X_train, y_train, cv=5)dtc.fit(X_train,y_train)y_pred = dtc.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",dtc.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["Decision Tree Train Score"] = np.mean(accuracies)result_dict_test["Decision Tree Test Score"] = dtc.score(X_test,y_test)
- Random Forest Classifier
rfc = RandomForestClassifier(random_state = 42)accuracies = cross_val_score(rfc, X_train, y_train, cv=5)rfc.fit(X_train,y_train)y_pred = rfc.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",rfc.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["Random Forest Train Score"] = np.mean(accuracies)result_dict_test["Random Forest Test Score"] = rfc.score(X_test,y_test)
- Naïve Bayes Classifier
gnb = GaussianNB()accuracies = cross_val_score(gnb, X_train, y_train, cv=5)gnb.fit(X_train,y_train)y_pred = gnb.predict(X_test) #Obtain accuracyprint("Train Score:",np.mean(accuracies))print("Test Score:",gnb.score(X_test,y_test))
#Store results in the dictionariesresult_dict_train["Gaussian NB Train Score"] = np.mean(accuracies)result_dict_test["Gaussian NB Test Score"] = gnb.score(X_test,y_test)
Step 8 – Compare Accuracy Scores
df_result_train = pd.DataFrame.from_dict(result_dict_train,orient = "index", columns=["Score"])df_result_train
Let’s display the accuracy scores of all testing models:
df_result_test = pd.DataFrame.from_dict(result_dict_test,orient = "index",columns=["Score"])df_result_test
Let’s visualize the scores, shall we?
import seaborn as sns fig,ax = plt.subplots(1,2,figsize=(20,5))sns.barplot(x = df_result_train.index,y = df_result_train.Score,ax = ax[0])sns.barplot(x = df_result_test.index,y = df_result_test.Score,ax = ax[1])ax[0].set_xticklabels(df_result_train.index,rotation = 75)ax[1].set_xticklabels(df_result_test.index,rotation = 75)plt.show()
From the above graphs, we can conclude the following:
- The Random Forest classifier has the highest test score
- The Decision Tree classifier has the lowest score among all classifiers.
Once you have trained your model, the next important step is to evaluate and optimize the classifier to verify its applicability. Learn how to perform model evaluation here.
Endnotes
Having a clear understanding of choosing the correct classification model that deploys the best possible solution plays is instrumental in solving supervised Machine Learning problems. Artificial Intelligence & Machine Learning is an increasingly growing domain that has hugely impacted big businesses worldwide. Interested in being a part of this frenzy? Explore related articles here.
Top Trending Articles:
Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio