How to choose the Value of k in K-fold Cross-Validation

How to choose the Value of k in K-fold Cross-Validation

4 mins read8.8K Views Comment
Updated on Jun 14, 2022 11:16 IST

Many of you must be having queries regarding the value of k in KFold cross-validation. Let’s unravel the mystery.

2022_02_How-to-Set-the-Value-of-k-in-K-fold-Cross-Validation.jpg

In the previous article, we talked about Cross-validation and its different techniques. But in this article, we will understand how to set the value of k in K-fold cross-validation by working on a cancer dataset? We will find different accuracy scores(corresponding to k values).

Cross-validation is a technique for evaluating a machine learning model and testing its performance. It is used commonly in applied ML tasks. It helps in comparing and selecting an appropriate model.CV tends to have a lower bias than other methods. To know more about Cross-validation and its different techniques explore: Cross-validation techniques.

Here’s how to set the value of K In K-fold cross-validation

Choose the value of ‘k’ such that the model doesn’t suffer from high variance and high bias. In most cases, the choice of k is usually 5 or 10, but there is no formal rule. However, the value of k relies upon the size of the dataset. The runtime of the cross-validation algorithm and the computational cost with large values of k.Let’s understand this with python code by implementing different classifiers like Decision tree, random forest, and SVM.

You can read this blog for more understanding:

Cross-validation techniques
Bias and Variance with Real-Life Examples

Let’s jump to python code:

Suppose want to classify that cancer as Benign or malignant.

1. Import Libraries

from numpy import mean
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

All the libraries are imported. We are going to use LogisticRegression in this example.

2. Loading the dataset

df= pd.read_csv('/content/cancer_dataset.csv') 
df.head()
 

Loading cancer_dataset.csv file.

3. Independent And dependent features

### Independent And dependent features
X=df.iloc[:,2:]
y=df.iloc[:,1]
X=X.dropna(axis=1)

Y

0      M

1      M

2      M

3      M

4      M

      ..

564    M

565    M

566    M

567    M

568    B

Name: diagnosis, Length: 569, dtype: object

X has independent features and y have dependent feature.

4. Splitting the dataset into train and test

#Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
 

Dataset is split into training and test data using train_test_split.

test_size=0.30 means train data is 70% and test data is 30%.

5. Define folds to test the values of k in the given range

# define folds to test the values of k in the given range
folds = range(2,31)
 

We want to check accuracies till k=30. So defined the range here. 

6. Evaluating the model using a given test condition

# evaluate the model using a given test condition
def evaluate_model(cv):
  # get the dataset
  ###  Independent And dependent features
  X=df.iloc[:,2:]
  y=df.iloc[:,1]
  X=X.dropna(axis=1)
  X
  # get the model
  model = LogisticRegression()
  # evaluate the model
  scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
  # return scores
  return mean(scores), scores.min(), scores.max()
 

Next, evaluate_model(cv) is used to evaluate the model on the dataset by dividing the data into independent and dependent features.Implemented LogisticRegression().

cross_val_score() is used to calculate the score.This function returns the mean classification accuracy as well as the min and max accuracy scores from the folds.

  • n_jobs=-1 represents the number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 and -1 means 100% usage of the CPU(one of the cores). 
  • cv=Determines the cross-validation splitting strategy.

Note: Your results may vary according to the evaluation procedure and stochastic nature of the algorithm or differences in numerical precision. Consider running the example a few times and comparing the average outcome.

7. Evaluating each k value

# evaluate each k value
  for k in folds:
  # define the test condition
  cv = KFold(n_splits=k, shuffle=True, random_state=10)
  # record mean and min/max of each set of results
  k_mean, k_min, k_max = evaluate_model(cv)
  # report performance
  print('-> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))
 

Applied KFold cross-validation. Mean, min, and max accuracy for each k value that was evaluated. The random state is used as a seed to the random number generator. This parameter ensures that the generation of random numbers is in the same order.     

Output:

k_max))
folds=2, accuracy=0.935 (0.905,0.965)
-> folds=3, accuracy=0.944 (0.932,0.963)
-> folds=4, accuracy=0.935 (0.887,0.965)
-> folds=5, accuracy=0.939 (0.895,0.974)
-> folds=6, accuracy=0.942 (0.895,0.968)
-> folds=7, accuracy=0.947 (0.915,0.975)
-> folds=8, accuracy=0.937 (0.859,0.986)
-> folds=9, accuracy=0.951 (0.906,0.984)
-> folds=10, accuracy=0.946 (0.860,1.000)
-> folds=11, accuracy=0.939 (0.846,1.000)
-> folds=12, accuracy=0.954 (0.872,1.000)
-> folds=13, accuracy=0.944 (0.864,1.000)
-> folds=14, accuracy=0.942 (0.854,1.000)
-> folds=15, accuracy=0.947 (0.868,1.000)
-> folds=16, accuracy=0.941 (0.861,1.000)
-> folds=17, accuracy=0.941 (0.853,1.000)
-> folds=18, accuracy=0.946 (0.875,1.000)
-> folds=19, accuracy=0.947 (0.833,1.000)
-> folds=20, accuracy=0.940 (0.828,1.000)
-> folds=21, accuracy=0.947 (0.815,1.000)
-> folds=22, accuracy=0.942 (0.846,1.000)
-> folds=23, accuracy=0.949 (0.840,1.000)
-> folds=24, accuracy=0.946 (0.833,1.000)
-> folds=25, accuracy=0.941 (0.783,1.000)
-> folds=26, accuracy=0.940 (0.818,1.000)
-> folds=27, accuracy=0.947 (0.857,1.000)
-> folds=28, accuracy=0.944 (0.750,1.000)
-> folds=29, accuracy=0.942 (0.750,1.000)
-> folds=30, accuracy=0.942 (0.789,1.000)
 

Here we got different accuracies for different values of k. Now which one to choose? We will choose accuracy=0.954 or 95.4% which we got at k=12. As we are getting the higher accuracy at this value. So we will choose k=12 in this case.

Note: Accuracy score and k value will vary with different classifiers and different cross-validation techniques.

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

1.53 L
11 months
2.5 L
2 years
2.5 L
2 years
34.65 K
11 months
5.6 L
18 months
– / –
6 months
– / –
8 hours

Endnotes

I hope this blog answered the query regarding the value of k in KFold cross-validation. If you liked this blog please consider hitting the stars below for my motivation.

Top Trending Articles:
Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.

Click here to submit its review with Shiksha Online.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio