Surely, this article “Normalization vs Standardization” will clear confusion to a great extent. Normalization and standardization are two common methods used to preprocess data in machine learning. Firstly, let’s look at the term Feature Scaling. Feature scaling is a method used to normalize the range of independent variables or features of data. Now, Let’s start.
Normalization & Standardization
Normalization rescales the values into a range of [0,1]. So, this might be useful in some cases where all parameters need to have the same positive scale.
Xchanged=(X−Xmin)/(Xmax−Xmin)
Whereas, Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).
Xchanged=(X−μ)/σ
So, in the business world, “Normalization” typically means that the range of values is “normalized to be from 0.0 to 1.0”. Whereas, “Standardization” means that the range of values is “standardized to measure how many standard deviations the value is from its mean”.
Python Examples
The example below demonstrate data normalization of the Iris flowers dataset.
#Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
#load the iris dataset
iris = load_iris()
print(iris.data.shape)
#separate the data from the target attributes
X = iris.data
y = iris.target
#normalize the data attributes
normalized_X = preprocessing.normalize(X)
Explanation:
This is a Python code that uses the scikit-learn library to normalize the data attributes of the Iris dataset.
The Iris dataset is first loaded using the load_iris()
function from sklearn.datasets
. The iris
variable holds the dataset.
Next, the data and target attributes are separated from each other. The data is stored in the X
variable, and the target is stored in the y
variable.
Finally, the data attributes are normalized using the preprocessing.normalize()
function from sklearn
. The normalized data is stored in the normalized_X
variable.
Normalizing data attributes involves rescaling the values of each attribute to fall within a small, specific range. This is useful in machine learning when we want to compare different attributes with different units and ranges. Normalization ensures that each attribute contributes equally to the analysis and can lead to better performance and faster convergence during model training.
The example below demonstrate data standardization of the Iris flowers dataset.
#Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
#load the Iris dataset
iris = load_iris()
print(iris.data.shape)
#separate the data and target attributes
X = iris.data
y = iris.target
#standardize the data attributes
standardized_X = preprocessing.scale(X)
Explanation:
This Python code uses the scikit-learn library to standardize the data attributes of the Iris dataset.
First, the code loads the Iris dataset using the load_iris()
function from sklearn.datasets
. The loaded dataset is stored in the iris
variable, which is a dictionary-like object containing the data, target, and other attributes of the dataset.
Next, the code separates the data and target attributes. The data attributes are stored in the X
variable, and the target attribute is stored in the y
variable.
Finally, the data attributes are standardized using the preprocessing.scale()
function from sklearn
. The standardized data is stored in the standardized_X
variable.
Standardizing data attributes involves scaling the values of each attribute to have zero mean and unit variance. This ensures that each attribute contributes equally to the analysis and helps to reduce the impact of outliers. Standardization is also useful in machine learning when we want to compare different attributes with different units and ranges. Standardized data is easier to work with and can lead to better performance and faster convergence during model training.
Conclusion
Normalization rescales the values of each attribute to fall within a small, specific range, usually between 0 and 1. This method is useful when the range of the attributes varies widely, and we want to compare them on the same scale. Normalization is also useful when we want to prevent attributes with larger values from dominating the analysis. The preprocessing.normalize()
function from sklearn
library is used to normalize the data.
Standardization scales the values of each attribute to have zero mean and unit variance. This method is useful when we want to compare attributes that have different units of measurement or when the range of the attributes varies widely. Standardized data is easier to work with and can lead to better performance and faster convergence during model training. The preprocessing.scale()
function from sklearn
library is used to standardize the data.
The choice of whether to use normalization or standardization depends on the nature of the data and the specific requirements of the machine learning problem. In some cases, a combination of both methods may be used to preprocess the data.
A good tip is to create re-scaled copies of your data set and race them against each other using your test harness and a handful of algorithms you want to spot-check.
So far, you have gathered some points around “Normalization vs Standardization”. Check out the table of contents for Product Management and Data Science to explore those topics.
Curious about how product managers can utilize Bhagwad Gita’s principles to tackle difficulties? Give this super short book a shot. This will certainly support my work.