How to choose the Value of k in K-fold Cross-Validation

4 mins read8.8K Views Comment

Updated on Jun 14, 2022 11:16 IST

Many of you must be having queries regarding the value of k in KFold cross-validation. Let’s unravel the mystery.

In the previous article, we talked about Cross-validation and its different techniques. But in this article, we will understand how to set the value of k in K-fold cross-validation by working on a cancer dataset? We will find different accuracy scores(corresponding to k values).

Cross-validation is a technique for evaluating a machine learning model and testing its performance. It is used commonly in applied ML tasks. It helps in comparing and selecting an appropriate model.CV tends to have a lower bias than other methods. To know more about Cross-validation and its different techniques explore: Cross-validation techniques.

Here’s how to set the value of K In K-fold cross-validation…

Choose the value of ‘k’ such that the model doesn’t suffer from high variance and high bias. In most cases, the choice of k is usually 5 or 10, but there is no formal rule. However, the value of k relies upon the size of the dataset. The runtime of the cross-validation algorithm and the computational cost with large values of k.Let’s understand this with python code by implementing different classifiers like Decision tree, random forest, and SVM.

You can read this blog for more understanding:

Cross-validation techniques

Read Later

Bias and Variance with Real-Life Examples

This blog revolves around bias and variance and its tradeoff. These concepts are explained with respect to overfitting and underfitting with proper examples.

Read Later

Let’s jump to python code:

Suppose want to classify that cancer as Benign or malignant.

1. Import Libraries

from numpy import mean
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

All the libraries are imported. We are going to use LogisticRegression in this example.

2. Loading the dataset

df= pd.read_csv('/content/cancer_dataset.csv') 
df.head()

Loading cancer_dataset.csv file.

3. Independent And dependent features

### Independent And dependent features
X=df.iloc[:,2:]
y=df.iloc[:,1]
X=X.dropna(axis=1)

0 M

1 M

2 M

3 M

4 M

564 M

565 M

566 M

567 M

568 B

Name: diagnosis, Length: 569, dtype: object

X has independent features and y have dependent feature.

4. Splitting the dataset into train and test

#Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)

Dataset is split into training and test data using train_test_split.

test_size=0.30 means train data is 70% and test data is 30%.

5. Define folds to test the values of k in the given range

# define folds to test the values of k in the given range
folds = range(2,31)

We want to check accuracies till k=30. So defined the range here.

6. Evaluating the model using a given test condition

# evaluate the model using a given test condition
def evaluate_model(cv):
  # get the dataset
  ###  Independent And dependent features
  X=df.iloc[:,2:]
  y=df.iloc[:,1]
  X=X.dropna(axis=1)
  X
  # get the model
  model = LogisticRegression()
  # evaluate the model
  scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
  # return scores
  return mean(scores), scores.min(), scores.max()

Next, evaluate_model(cv) is used to evaluate the model on the dataset by dividing the data into independent and dependent features.Implemented LogisticRegression().

cross_val_score() is used to calculate the score.This function returns the mean classification accuracy as well as the min and max accuracy scores from the folds.

n_jobs=-1 represents the number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 and -1 means 100% usage of the CPU(one of the cores).
cv=Determines the cross-validation splitting strategy.

Note: Your results may vary according to the evaluation procedure and stochastic nature of the algorithm or differences in numerical precision. Consider running the example a few times and comparing the average outcome.

7. Evaluating each k value

# evaluate each k value
  for k in folds:
  # define the test condition
  cv = KFold(n_splits=k, shuffle=True, random_state=10)
  # record mean and min/max of each set of results
  k_mean, k_min, k_max = evaluate_model(cv)
  # report performance
  print('-> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max))

Applied KFold cross-validation. Mean, min, and max accuracy for each k value that was evaluated. The random state is used as a seed to the random number generator. This parameter ensures that the generation of random numbers is in the same order.

Output:

k_max))
folds=2, accuracy=0.935 (0.905,0.965)
-> folds=3, accuracy=0.944 (0.932,0.963)
-> folds=4, accuracy=0.935 (0.887,0.965)
-> folds=5, accuracy=0.939 (0.895,0.974)
-> folds=6, accuracy=0.942 (0.895,0.968)
-> folds=7, accuracy=0.947 (0.915,0.975)
-> folds=8, accuracy=0.937 (0.859,0.986)
-> folds=9, accuracy=0.951 (0.906,0.984)
-> folds=10, accuracy=0.946 (0.860,1.000)
-> folds=11, accuracy=0.939 (0.846,1.000)
-> folds=12, accuracy=0.954 (0.872,1.000)
-> folds=13, accuracy=0.944 (0.864,1.000)
-> folds=14, accuracy=0.942 (0.854,1.000)
-> folds=15, accuracy=0.947 (0.868,1.000)
-> folds=16, accuracy=0.941 (0.861,1.000)
-> folds=17, accuracy=0.941 (0.853,1.000)
-> folds=18, accuracy=0.946 (0.875,1.000)
-> folds=19, accuracy=0.947 (0.833,1.000)
-> folds=20, accuracy=0.940 (0.828,1.000)
-> folds=21, accuracy=0.947 (0.815,1.000)
-> folds=22, accuracy=0.942 (0.846,1.000)
-> folds=23, accuracy=0.949 (0.840,1.000)
-> folds=24, accuracy=0.946 (0.833,1.000)
-> folds=25, accuracy=0.941 (0.783,1.000)
-> folds=26, accuracy=0.940 (0.818,1.000)
-> folds=27, accuracy=0.947 (0.857,1.000)
-> folds=28, accuracy=0.944 (0.750,1.000)
-> folds=29, accuracy=0.942 (0.750,1.000)
-> folds=30, accuracy=0.942 (0.789,1.000)

Here we got different accuracies for different values of k. Now which one to choose? We will choose accuracy=0.954 or 95.4% which we got at k=12. As we are getting the higher accuracy at this value. So we will choose k=12 in this case.

Note: Accuracy score and k value will vary with different classifiers and different cross-validation techniques.

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Amity OnlineDegree

Total Fees

₹1.7 L

Duration

2 years

Advance Certification in Applied Data Science, Machine Learning & IoT

IIT GuwahatiCertificate

4.0

Total Fees

₹95 K

Duration

9 months

Professional Certificate Course In Generative AI And Machine Learning

IIT KanpurCertificate

Total Fees

₹1.53 L

Duration

11 months

MCA in Machine Learning Online

Amity OnlineDegree

Total Fees

₹2.5 L

Duration

2 years

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

TimesProCertificate

4.0

Total Fees

₹2 L

Duration

10 months

MCA in Machine Learning

Amity University Online, NoidaDegree

Total Fees

₹2.5 L

Duration

2 years

Data Science & Machine Learning Course

Coding NinjasCertificate

4.8

Total Fees

₹34.65 K

Duration

11 months

M.Sc. in Machine Learning and AI

upGradDegree

Total Fees

₹5.6 L

Duration

18 months

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

IIT RoorkeeCertificate

Total Fees

– / –

Duration

6 months

Full Stack Machine Learning & AI Program

Jigsaw AcademyCertificate

Total Fees

– / –

Duration

8 hours

Endnotes

I hope this blog answered the query regarding the value of k in KFold cross-validation. If you liked this blog please consider hitting the stars below for my motivation.

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.

Click here to submit its review with Shiksha Online.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

How to choose the Value of k in K-fold Cross-Validation

Here’s how to set the value of K In K-fold cross-validation…

Let’s jump to python code:

1. Import Libraries

2. Loading the dataset

3. Independent And dependent features

4. Splitting the dataset into train and test

5. Define folds to test the values of k in the given range

6. Evaluating the model using a given test condition

7. Evaluating each k value

Best-suited Machine Learning courses for you

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Advance Certification in Applied Data Science, Machine Learning & IoT

Professional Certificate Course In Generative AI And Machine Learning

MCA in Machine Learning Online

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

MCA in Machine Learning

Data Science & Machine Learning Course

M.Sc. in Machine Learning and AI

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

Full Stack Machine Learning & AI Program

Endnotes

Top Picks & New Arrivals