Data cleaning and preparation are essential steps in data analysis, and identifying and handling outliers is an important part of it. Outliers are extreme values that lie outside the normal range of data, which can significantly affect the results of data analysis and modeling. In this blog, we will discuss some Python Snippets for Outliers Treatment. These are essential snippets for data cleaning and preparation.
Python Snippets for Outliers Treatment: Z-score method
In this section of the blog “Python snippets for outlier treatment”, we will discuss the Z-score method. Z-score is a statistical measure that represents the number of standard deviations an observation is away from the mean. In the z-score method, we calculate the z-score for each data point and identify outliers that fall outside a certain threshold.
import numpy as np # Generate some random data data = np.random.normal(0, 1, 1000) # Calculate the z-score for each data point z_scores = (data - np.mean(data)) / np.std(data) # Identify outliers using z-score outliers = np.where(np.abs(z_scores) > 3)[0] # Remove outliers data_without_outliers = np.delete(data, outliers) print("Shape of Original data:", data.shape) print("Shape of Data without Outliers", data_without_outliers.shape)
Output:
Shape of Original data: (1000,)
Shape of Data without Outliers (997,)
In the above snippet, we generated some random data using NumPy’s random.normal() method. Then we calculated the z-score for each data point and identified outliers that have a z-score greater than 3 or less than -3. Finally, we removed outliers using NumPy’s delete() method.
Python Snippets for Outliers Treatment: Winsorizing method
In this section of the blog “Python snippets for outlier treatment”, we will discuss the Winsorizing method. Winsorizing is a statistical method used to replace extreme values with less extreme values. In the winsorizing method, we replace the extreme values with the nearest non-extreme value within a certain percentile range.
from scipy.stats import mstats data = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) # Winsorize data to 5% and 95% quantiles data_winsorized = mstats.winsorize(data, limits=[0.05, 0.05]) # Compare original data with winsorized data print("Original data:", data) print("Data after winsorizing:", data_winsorized)
Output:
Original data: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
Data after winsorizing: [ 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19]
In the above snippet, we took sample data (just to illustrate). Then we applied the winsorizing method using the mstats.winsorize() method from the SciPy library. We specified the percentile limits for the winsorizing method as 5% and 95%. Finally, we compared the original data with the winsorized data using the print() function.
Python Snippets for Outliers Treatment: IQR method
In this section of the blog “Python snippets for outlier treatment”, we will discuss the IQR method. IQR or interquartile range is a measure of statistical dispersion that represents the range between the 25th and 75th percentiles of the data. In the IQR method, we identify outliers that lie outside the range of the IQR.
# Generate some random data data = np.random.normal(0, 1, 1000) # Calculate the interquartile range q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 # Identify outliers using IQR outliers = np.where((data < q1 - 1.5 * iqr) | (data > q3 + 1.5 * iqr))[0] # Remove outliers data_without_outliers = np.delete(data, outliers) print("Shape of Original data:", data.shape) print("Shape of Data without Outliers", data_without_outliers.shape)
Output:
Shape of Original data: (1000,)
Shape of Data without Outliers (988,)
In the above snippet, we generated some random data using NumPy’s random.normal() method. Then we calculated the interquartile range (IQR) of the data using NumPy’s percentile() method. We identified outliers using the IQR method, where we defined outliers as values that lie below q1 – 1.5 * IQR or above q3 + 1.5 * IQR. Finally, we removed outliers using NumPy’s delete() method.
Python Snippets for Outliers Treatment: Robust Z-score method
In this section of the blog “Python snippets for outlier treatment”, we will discuss the robust Z-score method. The robust Z-score is a modified version of the Z-score method that is less sensitive to outliers. In the robust Z-score method, we replace the mean and standard deviation with their robust equivalents, median, and median absolute deviation (MAD).
from statsmodels.robust.scale import mad # Generate some random data data = np.random.normal(0, 1, 1000) # Calculate the robust Z-score for each data point median = np.median(data) mad = mad(data) robust_z_scores = 0.6745 * (data - median) / mad # Identify outliers using robust Z-score outliers = np.where(np.abs(robust_z_scores) > 3)[0] # Remove outliers data_without_outliers = np.delete(data, outliers)
In the above snippet, we generated some random data using NumPy’s random.normal() method. Then we calculated the robust Z-score for each data point, where we replaced the mean and standard deviation with their robust equivalents, median, and median absolute deviation (MAD). The robust Z-score calculated using the MAD is a measure of the number of MAD units that a data point is away from the median. Multiplying by 0.6745 scales this measure to be comparable to the number of standard deviation units that a data point is away from the mean in a normal distribution. We identified outliers using the robust Z-score method, where we defined outliers as values that have a robust Z-score greater than 3 or less than -3. Finally, we removed outliers using NumPy’s delete() method.
Python Snippets for Outliers Treatment: Tukey’s method
In this section of the blog “Python snippets for outlier treatment”, we will discuss the Tukey’s method. Tukey’s method, also known as the boxplot method, is a graphical method for identifying outliers. In the Tukey’s method, we create a boxplot of the data and identify outliers that lie outside the range of the whiskers.
import matplotlib.pyplot as plt # Generate some random data data = np.random.normal(0, 1, 1000) # Create a boxplot of the data fig, ax = plt.subplots() ax.boxplot(data)
Output:
# Identify outliers using boxplot
whiskers = ax.get_lines()[1:]
whisker_values = [whiskers[i].get_ydata() for i in range(len(whiskers))]
outliers = np.where((data < whisker_values[0][0]) | (data > whisker_values[1][0]))[0]
# Remove outliers
data_without_outliers = np.delete(data, outliers)
# Create a boxplot of the data without outliers
fig, ax = plt.subplots()
ax.boxplot(data_without_outliers)
In the above snippet, we generated some random data using NumPy’s random.normal() method. Then we created a boxplot of the data using Matplotlib’s boxplot() method. We identified outliers using the boxplot method, where we defined outliers as values that lie outside the range of the whiskers. Finally, we removed outliers using NumPy’s delete() method.
Conclusion
In this blog “Python Snippets for outlier treatment”, we discussed some Python snippets for data cleaning and preparation specifically related to outliers. These snippets include the z-score method, winsorizing method, IQR method, robust Z-score method, and Tukey’s method. These methods can be used to identify and handle outliers in different scenarios, depending on the nature of the data and the analysis.
I highly recommend checking out this incredibly informative and engaging professional certificate Training by Google on Coursera:
Google Advanced Data Analytics Professional Certificate
There are 7 Courses in this Professional Certificate that can also be taken separately.
- Foundations of Data Science: Approx. 21 hours to complete. SKILLS YOU WILL GAIN: Sharing Insights With Stakeholders, Effective Written Communication, Asking Effective Questions, Cross-Functional Team Dynamics, and Project Management.
- Get Started with Python: Approx. 25 hours to complete. SKILLS YOU WILL GAIN: Using Comments to Enhance Code Readability, Python Programming, Jupyter Notebook, Data Visualization (DataViz), and Coding.
- Go Beyond the Numbers: Translate Data into Insights: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Python Programming, Tableau Software, Data Visualization (DataViz), Effective Communication, and Exploratory Data Analysis.
- The Power of Statistics: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Statistical Analysis, Python Programming, Effective Communication, Statistical Hypothesis Testing, and Probability Distribution.
- Regression Analysis: Simplify Complex Data Relationships: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Statistical Analysis, Python Programming, Effective Communication, and regression modeling.
- The Nuts and Bolts of Machine Learning: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Machine Learning, Python Programming, Stack Overflow, and Effective Communication.
- Google Advanced Data Analytics Capstone: Approx. 9 hours to complete. SKILLS YOU WILL GAIN: Executive Summaries, Machine Learning, Python Programming, Technical Interview Preparation, and Data Analysis.
It could be the perfect way to take your skills to the next level! When it comes to investing, there’s no better investment than investing in yourself and your education. Don’t hesitate – go ahead and take the leap. The benefits of learning and self-improvement are immeasurable.
You may also like:
Check out the table of contents for Product Management and Data Science to explore the topics. Curious about how product managers can utilize Bhagwad Gita’s principles to tackle difficulties? Give this super short book a shot. This will certainly support my work. After all, thanks a ton for visiting this website.