Robust Statistics: Techniques to Deal with Outliers and Non Normal Data
Robust statistics refer to statistical methods and techniques that are devloped for providing reliable and accurate results even in the presence of outliers or violations of assumptions. These methods aim to mitigate the impact of extreme or influential observations that may skew traditional statistical analyses.
In this article on Robust Statistics, we will discuss various approaches to handling outliers and non-normal data.
Table of Contents
Introduction
The assumption that the data is distributed properly is one that traditional statistics usually make. This presumption might not hold true in actual situations where the data might include outliers or not have a normal distribution. In this case, robust statistics are helpful. Data that may or may not contain outliers or follow a normal distribution can be analysed using a group of techniques collectively referred to as resilient statistics.
A dataset’s outliers are data points that stand out substantially from the rest of the data. Outliers may be caused by inaccurate data entry, measurement issues, or even just the data’s own intrinsic variability. Outliers have a substantial impact on the results of statistical analysis, including the mean and standard deviation.
Non-normal data is data that does not follow a normal distribution. Numerous statistical techniques assume a normally distributed distribution of the data. In robust statistics, these strategies are not suitable if the data does not have a normal distribution.
Best-suited Statistics for Data Science courses for you
Learn Statistics for Data Science with these high-rated online courses
Robust Statistics: Techniques for Dealing with Outliers
1. Winsorization
Winsorization is a method for dealing with outliers that involves swapping them out for the values that are closest to them but are not outliers. To put it another way, the nearest non-outlier value is used to replace an outlier data point. Charles P. Winsor, who developed this method in 1865, is honoured with his name.
To further understand how Winsorization operates, let’s look at an example. Let’s say we have the information below:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 99
This 100 is an anomaly. By substituting the outlier with the nearest non-outlier value, 20, we may apply Winsorization to the data. Therefore, these are the Winsorized data:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 19
2. Trimmed Mean
The trimmed mean is a method for handling outliers that involves taking a specified percentage of the dataset’s greatest and smallest values out before figuring out the mean of the remaining values. Trimming % refers to the percentage of values that must be deleted.
To better understand how the trimmed mean functions, let’s look at an example. Let’s say we have the information below:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 99
This 100 is an outlier. By deleting the highest and smallest values from this data, we can use the trimmed mean to analyse the remaining values. The top as well as the bottom 10% of the values should be eliminated:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 99
The mean of these values is:
(1 + 3 + 5 + 7 + 9 + 11 + 13 + 15 + 17 + 19) / 10 = 9.5
So, the trimmed mean of the data is 9.5.
Robust Statistics: Techniques for Dealing with Non-Normal Data
1. Median
When data are sorted in ascending or descending order, the median statistic represents the midpoint value in the dataset. In comparison to the mean, the median is a more reliable indicator of central tendency since it is less vulnerable to outliers.
To better grasp the median’s operation, let’s look at an example. Let’s say we have the information below:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 99
Here, 100 is an outlier. The median of this data can be calculated as follows:
Arrange the data in ascending order:
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 99
Find the middle value of the dataset:
(11 + 13) / 2 = 12
So, the median of the data is 12.
2. Robust Regression
It is assumed that the mistakes in conventional linear regression have a normal distribution. However, the findings of linear regression might not be reliable if the data are not regularly distributed. In regression analysis, robust regression is a method for handling non-normal data.
In contrast to classic linear regression, robust regression is based on an idea to minimize the sum of absolute deviations (L1 norm), not the total of squared deviations (L2 norm). Because of this, robust regression is less susceptible to outliers than conventional linear regression.
To further understand how robust regression functions, let’s look at an example. Let’s say we have the information below:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [1, 3, 5, 7, 9, 11, 13, 15, 17, 99]
This 99 is an outlier. This data can be subjected to robust regression using the Huber loss function, which combines the L1 and L2 norms:
L(a,b) = { (y - ax - b)^2, if |y - ax - b| <= k
{ k(2|y - ax - b| - k), if |y - ax - b| > k
where the tuning parameter k regulates the compromise between the L1 and L2 standards.
In contrast to conventional linear regression, robust regression utilising the Huber loss function yields a line of best fit that is less impacted by outliers.
Example Code Snippet:
Here is a sample Python programme that shows how to carry out robust regression by utilising the Huber loss function:
Python code:
from sklearn.linear_model import HuberRegressorImport statsmodel.api as smImport numpy as np
# Define the datax = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])y = np.array([1, 3, 5, 7, 9, 11, 13, 15, 17, 99])
from sklearn.linear_model import HuberRegressor
# Create a HuberRegressor object with default parametershuber = HuberRegressor()
# Fit the model to the datahuber.fit(x.reshape(-1, 1), y)
# Predict the y values using the fitted modely_pred = huber.predict(x.reshape(-1, 1))
# Print the coefficients of the regression lineprint('Coefficients: ', huber.coef_)
# Calculate the R-squared valuer2 = sm.OLS(y, sm.add_constant(y_pred)).fit().rsquaredprint('R-squared: ', r2)
Output:
You can observe that compared to traditional linear regression, robust regression produced considerably different coefficients for the line of best fit.
Explore free Python courses
3. Winsorizing
Winsorizing is a strategy that replaces extreme numbers with less extreme values to lessen the impact of outliers. Through this method, the data is first sorted in ascending order, and the highest and lowest non-outlying values are substituted for the top and bottom percentages of the data, respectively.
For instance, the 95th percentile value will be utilised in place of the top 5% of the data if we choose to winsorize the top 5% of the data.
Example Code Snippet:
Here is a sample Python programme that shows how to winsorize the given data:
Python code:
import numpy as np
data = np.array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 400])
p5 = np.percentile(data, 5)p95 = np.percentile(data, 95)
print('5th percentile:', p5)print('95th percentile:', p95)
data_winsorized = np.copy(data)
data_winsorized[data < p5] = p5data_winsorized[data > p95] = p95
print('Original data:', data)print('Winsorized data:', data_winsorized)
Output:
As we can see, the value 400 has been replaced with the value 210, which is the highest non-outlying value.
4. Trimmed Mean
By deleting a specific percentage of the data from top and bottom tails of distribution, the trimmed mean is a statistic that is used to estimate the distribution’s centre. When the data is non-normal or contains outliers, this strategy is helpful.
The trimmed mean, for instance, is determined by calculating the mean of the remaining 80% of the data if the top and bottom 10% of the data are to be removed.
Example Code Snippet:
Here is a sample Python programme that shows how to compute the trimmed mean by utilising the numpy library:
Python code:
import numpy as np
data = [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 200]
# Define the percentage to be trimmed from the top and bottompct_to_trim = 0.1
# Calculate the number of values to be trimmed from the top and bottomn_to_trim = int(len(data) * pct_to_trim)
# Sort the datasorted_data = np.sort(data)
# Trim the top and bottom valuestrimmed_data = sorted_data[n_to_trim:-n_to_trim]
# Calculate the mean of the trimmed datatrimmed_mean = np.mean(trimmed_data)
print("Trimmed mean:", trimmed_mean)
Output:
Therefore, after removing the top and bottom 10% of the values, the trimmed mean of the data set [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 200] is 16.0.
5. Robust Regression
Robust regression is a method for fitting a regression model to data that has outliers or isn’t normally distributed. Reducing the impact of outliers on the regression coefficients is the aim of robust regression.
Huber regression is a method that is frequently used for robust regression. In Huber regression, the squared error loss for small residuals and the absolute error loss for big residuals are combined to generate the loss function that is used to estimate the regression coefficients.
Example Code Snippet:
Here is a sample Python programme that uses the scikit-learn module to show how to carry out Huber regression:
Python code:
from sklearn.linear_model import HuberRegressorimport numpy as np
# Define the datax = np.random.normal(size=(250, 2))y = 2 * x[:, 0] + np.random.normal(scale=0.5, size=250)y[0] = 75 # Add an outlier
# Fit the Huber regression modelmodel = HuberRegressor(epsilon=1.35)model.fit(x, y)
# Print the estimated coefficientsprint('Estimated coefficients:', model.coef_)
Output:
(Note: Every time you execute the code, a different set of input data is generated, therefore the results will be unique.)
The estimated coefficients of the linear model computed by the HuberRegressor are indicated in the output as “Estimated coefficients: [2.05412887 0.01363046]”.
The model is calculating the association between two input features and a target variable in this particular instance. Each input feature’s and the target variable’s calculated coefficients show the magnitude and direction of the linear relationship between them.
In more detail, the first coefficient (2.05412887) denotes the estimated contribution of the first input feature to the prediction of the target variable, and the second coefficient (0.01363046) denotes the estimated contribution of the second input feature.
Due to this, the model predicts that the first input feature and the target variable have a greater association than the second input feature. Do note that these coefficients are derived using the training data provided and might not generalise well to fresh, untainted data.
6. Robust Principal Component Analysis
By locating the most significant patterns in the data, Principal Component Analysis (PCA) is a technique used for reducing the dimensionality of the data. PCA, however, can be sensitive to anomalies and atypical data.
Robust PCA is a method for running PCA on data that has outliers or is not typical. Principal Component Pursuit, or PCP, is a method frequently used to achieve robust PCA. In PCP, the data is divided into a sparse component that represents the data’s outliers and a low-rank component that represents the rest of the data.
Example Code Snippet:
Here is a sample Python programme that uses the scikit-learn module to demonstrate robust PCA:
Python code:
from sklearn.decomposition import PCAfrom sklearn.datasets import make_blobsimport numpy as np
# Define the dataX, _ = make_blobs(n_samples=200, centers=3, random_state=42)X[0] = np.array([200, 200])
# Fit the robust PCA modelpca = PCA(n_components=2, svd_solver='full')pca.fit(X)
# Print the principal componentsprint('Principal components:', pca.components_)
Output:
(Note: Every time you execute the code, a different set of input data is generated, therefore the results will be unique.)
Accordingly, the first principal component can be interpreted as the linear combination of first original feature and second original feature. It is composed of 0.695 times the first feature and 0.719 times the second feature. The second principal component, on the other hand, is obtained by taking a linear combination of the two original features. Specifically, it equals -0.719 times the first feature plus 0.695 times the second feature.
7. Robust Clustering
The process of clustering is used to put related data points together. However, outliers and non-normal data might make clustering vulnerable.
A method called robust clustering is used to cluster data that has outliers or is not normal. Utilising a method known as Robust K-Means is one popular method for robust clustering. In Robust K-Means, the data is clustered while accounting for outliers by minimising the sum of squared distances between every data point and nearest cluster centre.
Example Code Snippet:
Here is a sample Python programme that uses the scikit-learn module to illustrate robust clustering:
Python code:
from sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsimport numpy as np
# Define the dataX, _ = make_blobs(n_samples=200, centers=3, random_state=42)X[0] = np.array([200, 200])
# Fit the Robust K-Means modelrobust_kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=1e-4, algorithm='elkan')robust_kmeans.fit(X)
# Print the cluster labels and number of clustersprint('Cluster labels:', robust_kmeans.labels_)print('Number of clusters:', len(np.unique(robust_kmeans.labels_))) In this example, a synthetic dataset with 200 samples and 3 clusters was created using the make_blobs method from sklearn.datasets. By putting the first data point to [200, 200], we are also including an outlier.
Next, we construct a KMeans instance with the n_clusters parameter set to 3, indicating the desired number of clusters. In order to get faster convergence, we additionally set the algorithm to "elkan" and the init parameter to "k-means++".
Finally, we use the fit method to fit the model to our data and display the cluster labels and cluster count. Each data point's cluster name identifies the specific cluster to which it belongs, and the number of clusters indicates how many clusters were actually constructed.
Output:
Three clusters were discovered. This indicates that the data has likely been divided into three groups based on shared characteristics. However, it is challenging to ascertain the significance of the clusters or the calibre of the grouping in the absence of sufficient details about the data and the clustering technique utilised.
Conclusion
When working with data that may include outliers or not follow a normal distribution, robust statistics is crucial. The results of statistical analysis, including the mean and standard deviation, can be greatly impacted by outliers. There are numerous techniques, such as Winsorization, trimmed mean, median, and robust regression, for dealing with outliers and non-normal data. The best strategy to use will rely on the particular qualities of the data and the research issue at hand. Each method in robust statistics has its benefits and drawbacks.
Contributed by: Vishwa Kiran
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio