“Curse of Dimensionality: An Intuitive and practical explanation with Examples”, this article will definitely consolidate your concept.
“As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.”
Charles Isbell, Professor and Senior Associate Dean, School of Interactive Computing, Georgia Tech
Curse of dimensionality
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.
This sparsity is problematic for any method that requires statistical significance.
In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with dimensionality.
Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties. In high-dimensional data, however, all objects appear to be sparse and dissimilar in many ways. It prevents common data organization strategies from being efficient.
Distance metrics such as Euclidean distance are used on a dataset of too many dimensions, and all observations become approximately equidistant from each other. For example, K-means clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations. If the distances are all approximately equal, then all the observations appear equally alike (or equally different). That means, no meaningful clusters can be formed.
In a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required. This is to ensure that there are several samples with each combination of values. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.
The volume (size) of the space increases at an incredible rate relative to the number of dimensions. Even 10 dimensions (which doesn’t seem like it’s very ‘high-dimensional’ ) can bring on the curse.
In short, as the number of dimensions grows, the relative Euclidean distance between a point in a set and its closest neighbor, and between that point and its furthest neighbor, changes in some non-obvious ways.
Explanation of the Curse of dimensionality through examples: 1
Example 1:
Probably the kid will like to eat cookies. So, let us assume that you have a whole truck with cookies having different colors, different shapes, different tastes, different prices.
If the kid has to choose but only take into account one characteristic e.g. the taste, then it has four possibilities: sweet, salt, sour, bitter. So, the kid only has to try four cookies to find what (s)he likes most.
If the kid likes combinations of taste and color, and there are 4 (I am rather optimistic here) different colors. Then, he already has to choose among 4×4 different types;
If he wants, in addition, to take into account the shape of the cookies and there are 5 different shapes. Then, he will have to try 4x4x5=80 cookies
We could go on, but after eating all these cookies he might already have a bellyache. Before he can make his best choice. Apart from bellyache, it can get really difficult to remember the differences in the taste of each cookie.
Explanation of the Curse of dimensionality through examples: 2
Example 2:
It’s easy to catch a caterpillar moving in a tube(1 dimension). Also, it’s harder to catch a dog if it were running around on the plane (two dimensions). It’s much harder to hunt birds, which now have an extra dimension they can move in. If we pretend that ghosts are higher-dimensional beings, those are even more difficult to catch.
Explanation of the Curse of dimensionality through examples: 3
Example 3:
Say, you dropped a coin on a 100-meter line. How do you find it? Simple, just walk on the line and search. But what if it’s a 100 x 100 square meter field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cubic meter space?! You know, football ground now has a thirty-story height. Good luck finding a coin there! That, in essence, is the “curse of dimensionality”.
Conclusion
As you can see things become more complicated as the number of dimensions increases. This holds for adults, for computers, and also for kids. To understand the curse of dimensionality, we have used real-life very simple examples that even a kid can understand, in a non-mathematical way.
I highly recommend checking out this incredibly informative and engaging professional certificate Training by Google on Coursera:
Google Advanced Data Analytics Professional Certificate
There are 7 Courses in this Professional Certificate that can also be taken separately.
- Foundations of Data Science: Approx. 21 hours to complete. SKILLS YOU WILL GAIN: Sharing Insights With Stakeholders, Effective Written Communication, Asking Effective Questions, Cross-Functional Team Dynamics, and Project Management.
- Get Started with Python: Approx. 25 hours to complete. SKILLS YOU WILL GAIN: Using Comments to Enhance Code Readability, Python Programming, Jupyter Notebook, Data Visualization (DataViz), and Coding.
- Go Beyond the Numbers: Translate Data into Insights: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Python Programming, Tableau Software, Data Visualization (DataViz), Effective Communication, and Exploratory Data Analysis.
- The Power of Statistics: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Statistical Analysis, Python Programming, Effective Communication, Statistical Hypothesis Testing, and Probability Distribution.
- Regression Analysis: Simplify Complex Data Relationships: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Statistical Analysis, Python Programming, Effective Communication, and regression modeling.
- The Nuts and Bolts of Machine Learning: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Machine Learning, Python Programming, Stack Overflow, and Effective Communication.
- Google Advanced Data Analytics Capstone: Approx. 9 hours to complete. SKILLS YOU WILL GAIN: Executive Summaries, Machine Learning, Python Programming, Technical Interview Preparation, and Data Analysis.
It could be the perfect way to take your skills to the next level! When it comes to investing, there’s no better investment than investing in yourself and your education. Don’t hesitate – go ahead and take the leap. The benefits of learning and self-improvement are immeasurable.
You may also like:
- A creative way to deal with class imbalance (without generating synthetic samples)
- Regression Imputation: A Technique for Dealing with Missing Data in Python
- How to generate and interpret a roc curve for binary classification?
- Linear Regression for Beginners: A Simple Introduction
- Linear Regression, heteroskedasticity & myths of transformations
- Bayesian Linear Regression Made Simple with Python Code
- Logistic Regression for Beginners
- Understanding Confidence Interval, Null Hypothesis, and P-Value in Logistic Regression
- Logistic Regression: Concordance Ratio, Somers’ D, and Kendall’s Tau
Regression Imputation: A Technique for Dealing with Missing Data in Python
Check out the table of contents for Product Management and Data Science to explore those topics.
Curious about how product managers can utilize Bhagwad Gita’s principles to tackle difficulties? Give this super short book a shot. This will certainly support my work.
AI is fun! Thanks a ton for exploring the AI universe by visiting this website.