Standardization is a common preprocessing step in machine learning (ML) that involves transforming the input data into a common scale. The goal of standardization is to improve the performance of the ML model by reducing the influence of outliers, stabilizing the variance of the input data, and improving the convergence of the model during training.
Here are a few examples of standardization in ML:
Scaling: Scaling is a technique that transforms the input data by dividing each feature by its range (maximum value — minimum value). This transformation scales the data to the range [0, 1]. Scaling is often used to standardize features that have a large range of values or that are measured on different scales.
For example, consider a dataset with two features: age
(measured in years) and income
(measured in dollars). The age
feature ranges from 18 to 80, while the income
feature ranges from 20,000 to 200,000. To standardize these features using scaling, you would transform the age
feature as follows:
age_standardized = (age - age_min) / (age_max - age_min)
Where age_min
and age_max
are the minimum and maximum values of the age
feature, respectively. The resulting age_standardized
feature would have a range of [0, 1].
Normalization: Normalization is a technique that transforms the input data by subtracting the mean value of each feature from each data point and dividing the result by the standard deviation of the feature. This transformation results in a standardized feature with a mean of 0 and a standard deviation of 1.
For example, consider the same dataset with the age
and income
features. To standardize these features using normalization, you would transform the age
feature as follows:
age_standardized = (age - age_mean) / age_std
Where age_mean
and age_std
are the mean and standard deviation of the age
feature, respectively. The resulting age_standardized
feature would have a mean of 0 and a standard deviation of 1.
Whitening: Whitening is a technique that transforms the input data by decorrelating the features and scaling them to have unit variance. This transformation is often used to improve the convergence of neural networks and other types of ML models.
For example, consider a dataset with three features: x1
, x2
, and x3
. To standardize these features using whitening, you would first compute the covariance matrix of the features and then apply a linear transformation to the data using the eigenvectors and eigenvalues of the covariance matrix. The resulting transformed features would be uncorrelated and have unit variance.
Standardization is a useful technique for many types of ML models, including linear models, support vector machines, and neural networks. It can help to improve the performance of the model by reducing the influence of outliers, stabilizing the variance of the input data, and improving the convergence of the model during training.