All About Train Test Split
Train test split technique is used to estimate the performance of machine learning algorithms which are used to make predictions on data not used to train the model. In this article you will learn the importance of Train test split technique and its implementation in python.
One of the most important aspects of machine learning is training. The act of providing data to the algorithm training is vital in creating algorithms that perform tasks efficiently and effectively. The training can be time consuming and difficult to perform if using your own computer that’s where the train test split comes in handy.
A train test is the way of structuring your machine learning project so that you can test your hypothesis quickly and inexpensively. Basically it’s a way to divide the training data so that you can try your algorithm to one half and evaluate the result on the other half.
This article will tell you, how to use the train_test_split Sklearn function to split machine learning data into training and test sets.
Table of contents
- Train Test Split
- Parameters of Train Test Split
- When to Use Train Test Split
- Train Test Split in Classification and Regression
- Evaluation Metric for Train Test Split
- Train Test Split Python
- Conclusion
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
Train Test Split
When it comes to data analysis, you can split your data into training and testing sets.
A train test split is when you split your data into a training set and a testing set. The training set is used for training the model, and the testing set is used to test your model. This allows you to train your models on the training set, and then test their accuracy on the unseen testing set. There are a few different ways to do a train test split, but the most common is to simply split your data into two sets. For example 80% for training and 20% for testing. This ensures that both sets are representative of the entire dataset, and gives you a good way to measure the accuracy of your models.
Syntax of Train Test Split
Before continuing, please note that in order to use this feature, you must first import it.
from sklearn.model_selection import train_test_split
After importing the function as above, call it as train_test_split() .
Within the brackets, specify the name of the ‘X’ input data as the first argument. This data should contain characteristic data. Optionally, you can also specify the name of the ‘y’ record containing the label or target record.
Parameters of Train Test Split
- X
- y
- test_size
- random_state
- shuffle
- Stratify
- X (required)
The X argument is the input array containing the feature data (i.e. the variables/columns used to build the model).
This object must be two-dimensional. So for a one dimensional numpy array you may need to reshape the data. This is usually done with code .reshape(-1,1) .
1. Y
The y argument usually contains a vector of target values (that is, data targets or labels).
2. TEST_SIZE
You can specify the size of the output test set using the test_size parameter. The arguments for this parameter can be integers or floating point numbers.
If the argument is an integer, the test set size will be that number.
If the argument is a float, it must be between 0 and 1 and the number represents the percentage of observations in the test set.
3. RANDOM_STATE
If you specify an integer as the argument for this parameter, train_test_split will shuffle the data in the same order before splitting. That means different information is sent to “X_train”, “X_test”, “y_train” and “y_test”.
4. SHUFFLE
By default this is set to shuffle=True. That is, by default the data is randomized before splitting, so observations are randomly assigned to the training and test data. Setting shuffle = False disables random sorting and splits the data in the order it already exists.
Note: If you set shuffle = False, you must set stratify = None. layer
5. STRATIFY
This stratification parameter performs a split so that the ratio of values in the generated samples is the same as the ratio of values provided to the parameter stratification.
By default this is set to Stratify = None.
When to Use Train Test Split
This method is not suitable when the amount of data available is small. This is because there is not enough data in the training dataset for the model to learn an effective mapping from inputs to outputs after the dataset is split into training and testing datasets. Also, the test set does not contain enough data to effectively evaluate the model’s performance. The estimated performance may be too optimistic (good) or too pessimistic (bad).
- In the absence of sufficient data, a suitable alternative model evaluation procedure is the k-fold cross-validation procedure.
- Projects can contain efficient models and large datasets, but sometimes you need a quick estimate of your model’s performance. The train-test-split method is also used in this situation.
- If the dataset is imbalanced then also train test split is not a good option.In that case first you need to balance the data first.
Also read: Difference between Regression and Classification Algorithms
How to Evaluate Train Test Split
There are a few things to consider when evaluating a train test split. The first is the size of the training set. The larger the training set, the more accurate the model will be. However, if the training set is too large, it can take longer to train the model and may overfit the data. The second thing for consideration is the size of the test set. The larger the test set, the more reliable the results will be. However, if the test set is too large, it can take longer to run and may not be representative of all data. Finally, consider the balance of each class in both sets. If one class is much larger than another, it can skew the results.
Also read: Evaluating a machine learning algorithm
Evaluation Metric for Train Test Split
1. Mean Square Error
The evaluation metric for this is mean squared error (or MSE for short). This is a metric that takes the actual value and produces a squared value. This squared value is the mean squared error between the model’s prediction and the actual value.
2. R Squared Error
The best metric for the train test split is the one that balances generalization with accuracy on the testing set. This metric is called R-sq loss. R-sq loss takes both the model’s accuracy and its generalization into account, and produces a single number for judging a model’s performance.
3. Accuracy
Accuracy refers to how close a predicted result is to an actual value. That is, how close the actual measured value is to the standard value. The closer the measurement, the higher the accuracy.
Also read:Difference between Accuracy and Precision
Explore: How to Calculate R squared in Linear Regression
Explore: R-squared vs. adjusted R-squared
Train Test Split Python
We have to predict the species of the fish.
About Data Set
This dataset contains seven species of fish data for market sale. We will perform it using python. The dataset is freely available on Kaggle.
1. Species–species name of fish
2. Weight- the weight of fish in Gram
3. Length1- vertical length in cm
4. Length2- diagonal length in cm
5. Length3- cross length in cm
6. Height- height in cm
7. Width- diagonal width in cm
Importing Libraries and Reading Dataset
import pandas as pdimport numpy as npdf=pd.read_csv('Fish.csv')df
Dependent and Independent Variables
#Get Target datay = dataset[‘Species’]#Load X Variables into a Pandas Dataframe with columnsX = dataset.drop(['Species'], axis = 1)
Splitting Dataset Into Training and Testing Set
# Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Next, split both x and y into training and testing sets with the help of the train_test_split() function. In this training data set is 0.8, which means 80%. Random _state =0 which means there is no randomness in the data.
Now after this we can implement any algorithm.Like in this we are implementing Logistic regression.
Importing and Fitting Logistic Regression to the Training Set
# Fitting Multiple Linear Regression to the Training setfrom sklearn.linear_model import LogisticRegressionregressor = LogisticRegression()model=regressor.fit(X_train,y_train)
# Predicting the test set resultsy_pred = model.predict(X_test)print(y_pred)
Checking the Accuracy
from sklearn.metrics import accuracy_score,confusion_matrixscore= accuracy_score(y_test,y_pred)score
Output 0.83333
Conclusion
As you can see, the train test split is an important technique in machine learning. It allows you to train a model on the training set and then test its accuracy on the testing set. This allows you to get a general sense of how well your model is performing, and also tells you whether or not your model is performing as expected. Now that you know how to do a train test split in Python, you can apply this technique to any machine learning problem you might encounter.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio