How to choose the Value of k in K-fold Cross-Validation
Many of you must be having queries regarding the value of k in KFold cross-validation. Let’s unravel the mystery.
In the previous article, we talked about Cross-validation and its different techniques. But in this article, we will understand how to set the value of k in K-fold cross-validation by working on a cancer dataset? We will find different accuracy scores(corresponding to k values).
Cross-validation is a technique for evaluating a machine learning model and testing its performance. It is used commonly in applied ML tasks. It helps in comparing and selecting an appropriate model.CV tends to have a lower bias than other methods. To know more about Cross-validation and its different techniques explore: Cross-validation techniques.
Here’s how to set the value of K In K-fold cross-validation…
Choose the value of ‘k’ such that the model doesn’t suffer from high variance and high bias. In most cases, the choice of k is usually 5 or 10, but there is no formal rule. However, the value of k relies upon the size of the dataset. The runtime of the cross-validation algorithm and the computational cost with large values of k.Let’s understand this with python code by implementing different classifiers like Decision tree, random forest, and SVM.
You can read this blog for more understanding:
Let’s jump to python code:
Suppose want to classify that cancer as Benign or malignant.
1. Import Libraries
from numpy import mean from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression import pandas as pd import numpy as np
All the libraries are imported. We are going to use LogisticRegression in this example.
2. Loading the dataset
df= pd.read_csv('/content/cancer_dataset.csv') df.head()
Loading cancer_dataset.csv file.
3. Independent And dependent features
### Independent And dependent features X=df.iloc[:,2:] y=df.iloc[:,1] X=X.dropna(axis=1)
Y
0 M
1 M
2 M
3 M
4 M
..
564 M
565 M
566 M
567 M
568 B
Name: diagnosis, Length: 569, dtype: object
X has independent features and y have dependent feature.
4. Splitting the dataset into train and test
#Splitting the dataset into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
Dataset is split into training and test data using train_test_split.
test_size=0.30 means train data is 70% and test data is 30%.
5. Define folds to test the values of k in the given range
# define folds to test the values of k in the given range folds = range(2,31)
We want to check accuracies till k=30. So defined the range here.
6. Evaluating the model using a given test condition
# evaluate the model using a given test condition def evaluate_model(cv): # get the dataset ### Independent And dependent features X=df.iloc[:,2:] y=df.iloc[:,1] X=X.dropna(axis=1) X # get the model model = LogisticRegression() # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return scores return mean(scores), scores.min(), scores.max()
Next, evaluate_model(cv) is used to evaluate the model on the dataset by dividing the data into independent and dependent features.Implemented LogisticRegression().
cross_val_score() is used to calculate the score.This function returns the mean classification accuracy as well as the min and max accuracy scores from the folds.
- n_jobs=-1 represents the number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 and -1 means 100% usage of the CPU(one of the cores).
- cv=Determines the cross-validation splitting strategy.
Note: Your results may vary according to the evaluation procedure and stochastic nature of the algorithm or differences in numerical precision. Consider running the example a few times and comparing the average outcome.
7. Evaluating each k value
# evaluate each k value for k in folds: # define the test condition cv = KFold(n_splits=k, shuffle=True, random_state=10) # record mean and min/max of each set of results k_mean, k_min, k_max = evaluate_model(cv) # report performance print('-> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
Applied KFold cross-validation. Mean, min, and max accuracy for each k value that was evaluated. The random state is used as a seed to the random number generator. This parameter ensures that the generation of random numbers is in the same order.
Output:
k_max)) folds=2, accuracy=0.935 (0.905,0.965) -> folds=3, accuracy=0.944 (0.932,0.963) -> folds=4, accuracy=0.935 (0.887,0.965) -> folds=5, accuracy=0.939 (0.895,0.974) -> folds=6, accuracy=0.942 (0.895,0.968) -> folds=7, accuracy=0.947 (0.915,0.975) -> folds=8, accuracy=0.937 (0.859,0.986) -> folds=9, accuracy=0.951 (0.906,0.984) -> folds=10, accuracy=0.946 (0.860,1.000) -> folds=11, accuracy=0.939 (0.846,1.000) -> folds=12, accuracy=0.954 (0.872,1.000) -> folds=13, accuracy=0.944 (0.864,1.000) -> folds=14, accuracy=0.942 (0.854,1.000) -> folds=15, accuracy=0.947 (0.868,1.000) -> folds=16, accuracy=0.941 (0.861,1.000) -> folds=17, accuracy=0.941 (0.853,1.000) -> folds=18, accuracy=0.946 (0.875,1.000) -> folds=19, accuracy=0.947 (0.833,1.000) -> folds=20, accuracy=0.940 (0.828,1.000) -> folds=21, accuracy=0.947 (0.815,1.000) -> folds=22, accuracy=0.942 (0.846,1.000) -> folds=23, accuracy=0.949 (0.840,1.000) -> folds=24, accuracy=0.946 (0.833,1.000) -> folds=25, accuracy=0.941 (0.783,1.000) -> folds=26, accuracy=0.940 (0.818,1.000) -> folds=27, accuracy=0.947 (0.857,1.000) -> folds=28, accuracy=0.944 (0.750,1.000) -> folds=29, accuracy=0.942 (0.750,1.000) -> folds=30, accuracy=0.942 (0.789,1.000)
Here we got different accuracies for different values of k. Now which one to choose? We will choose accuracy=0.954 or 95.4% which we got at k=12. As we are getting the higher accuracy at this value. So we will choose k=12 in this case.
Note: Accuracy score and k value will vary with different classifiers and different cross-validation techniques.
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
Endnotes
I hope this blog answered the query regarding the value of k in KFold cross-validation. If you liked this blog please consider hitting the stars below for my motivation.
Top Trending Articles:
Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst
Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.
Click here to submit its review with Shiksha Online.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio