Let’s start this post with a question: “How to generate and interpret a roc curve for binary classification?”. This post will try to find out the answer to this question.
Binary classification is the task of classifying the elements of a set into two groups.
ROC curve is used to diagnose the performance of a classification model.
This post will take you through the concept of the ROC curve. You will be able to interpret the graph and tweak your classification model accordingly.
In order to answer the question, first, the concept of the confusion matrix must be understood.
Let’s take an example of the binary classification problem. Let the two classes be 0 and 1, indicating the presence and absence of something (where data belongs to a particular class or not).
Now there are 4 cases listed below:
- True Positive (TP): The object is in class 1 and the prediction is also class 1
- False Positive (FP): The object is in class 0 but the prediction is class 1
- False Negative (FN): The object is in class 1 but the prediction is class 0
- True Negative (TN): The object is in class 0 and the prediction is also class 0
Components of the Confusion Matrix
Accuracy = (TP+TN) / (TP+FP+TN+FN) i.e. the fraction of correct predictions.
It is easy to see why higher accuracy is good. But in certain cases, accuracy may not be the best criterion.
Say you want to design a system that looks at a blood sample and detects cancer. However, only 0.1% of the population has cancer. Now if it always predicts “No Cancer” accuracy is 99.9% but it is absolutely worthless. That is why we need the concepts of specificity and sensitivity.
Specificity or precision =TP/ (TP+FP) i.e. Out of all those predicted positives, how many are truly positive?
Sensitivity or recall or True Positive Rate (TPR) =TP/ (TP+FN) i.e. Out of all the positives, how many are predicted correctly? This can be seen as the accuracy of predicting only the positive objects.
Similarly, we can define FPR(False Positive Rate) which is out of all predicted negative, how many are positive, as FN/ (FN+TN)
Now a good model which has imbalanced misclassification cost and/or imbalanced distribution (eg. only 0.1% has cancer, not 50%) at different classification thresholds models will have different TPR, FPR & precision.
The patient data that the tool has been trained on only contains information about risk factors for cancer (family history, age, weight, that kind of thing), and doesn’t contain enough information to accurately tell whether or not an individual has cancer. The training data also contains whether or not the patient did end up having cancer so that the ML tool can learn to tell the two groups apart.
Because of this imperfect information, the tool assigns the patient a score between 0 and 1 – the higher the score, the more confident the tool is that the patient is at risk of having cancer.
Once the tool is trained, you can measure its effectiveness. There are several measures of this – for example, the false-positive rate (how many not-ill people were recommended tests), and the false-negative rate (how many people with cancer were not recommended further tests). Both of these are bad outcomes that we want to minimize, but not equally bad.
Deciding threshold score
However, before you can measure these things, you have a choice to make – what threshold score do you use to decide whether or not a patient gets additional tests? After all, every patient with a non-zero score from the tool has some risk of having cancer. So, there’s an argument to be made to just test everybody.
But, the reason we’re building this tool in the first place is so we can avoid doing that. Because, tests are expensive, and if we tested everyone the false positive rate would be very high – we’d test heaps of people who didn’t need the tests. Testing everyone corresponds to a very low threshold score.
On the other hand, we could recommend tests only to people with a very high risk of having cancer – our false positive rate would be low (almost everyone that gets tested would need the tests), but we’d also have a lot of false negatives – we’d send a lot of people with cancer home untested. This corresponds to having a very high threshold score.
So how do you compare threshold values, and decide which threshold is best for your tool? You draw a ROC curve.
Before moving ahead, you must visit this website and try to observe & understand: http://navan.name/roc/
With the graphics as a reference, you can go ahead with this blog. Let’s move forward.
ROC curve (receiver operating characteristic)
Hope, you enjoyed playing with the interactive graphics at http://navan.name/roc/
A ROC (Receiver Operating Characteristic) curve is a plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is a graph showing the relationship between the true positive rate (TPR) and the false positive rate (FPR) for various classification thresholds.
The ROC curve is plotted with TPR against the FPR where TPR is plotted on the y-axis and FPR is plotted on the x-axis. Each point on the curve represents a different threshold for classifying a positive or negative result. The diagonal line represents a random classifier, while a perfect classifier has an ROC curve that passes through the upper left corner.
The area under the ROC curve (AUC) is a commonly used metric to evaluate the performance of a binary classifier. The AUC score ranges from 0.5 (random classifier) to 1.0 (perfect classifier).
The ROC curve can be used to determine the optimal threshold value for a binary classification model. The threshold value is the probability above which the model predicts the positive class, and below which it predicts the negative class.
In the ROC curve, the x-axis represents the False Positive Rate (FPR) and the y-axis represents the True Positive Rate (TPR). The FPR is the proportion of negative instances that are incorrectly predicted as positive, while the TPR is the proportion of positive instances that are correctly predicted as positive.
The ROC curve plots the TPR against the FPR for different threshold values. The closer the curve is to the top-left corner of the plot, the better the model’s performance. The optimal threshold value is the point on the ROC curve that is closest to the top-left corner, which corresponds to the highest TPR for the lowest FPR.
Therefore, by analyzing the ROC curve, we can identify the threshold value that maximizes the model’s performance, i.e., the threshold that provides the best balance between the true positives and false positives.
AUC: The area under the curve
The area under the curve gives you an idea of how good your classifier is.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0
Frequently Asked Questions
What is the difference between AUC and ROC?
AUC (Area Under the Curve) and ROC (Receiver Operating Characteristic) are two concepts used to evaluate the performance of binary classification models. ROC is a graphical representation of the true positive rate (TPR) versus the false positive rate (FPR) at different classification thresholds. The AUC is a single numeric metric that measures the overall performance of the classifier and ranges from 0 to 1, with 0 indicating a poor classifier and 1 indicating a perfect classifier. Unlike other evaluation metrics, AUC provides a single number that summarizes the overall performance of the model across all possible classification thresholds.
What is the ROC formula?
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for various classification thresholds of a binary classifier. The TPR is calculated as TP / (TP + FN) and the FPR is calculated as FP / (FP + TN), where TP is true positives, FN is false negatives, FP is false positives, and TN is true negatives. These formulae are used to calculate the TPR and FPR for multiple classification thresholds, and the resulting TPR-FPR pairs are plotted on a graph to create the ROC curve.
Why AUC is better than accuracy?
AUC (Area Under the Curve) is often considered a better evaluation metric than accuracy for binary classification models because it is less sensitive to imbalanced class distributions. While accuracy can be misleading in imbalanced class distributions, AUC considers the entire range of thresholds and summarizes the trade-off between TPR and FPR. Therefore, AUC provides a more comprehensive evaluation of the model’s performance and is less affected by imbalanced class distributions.
What does ROC of 0.8 mean?
If the ROC curve has an AUC of 0.8, it indicates that the model has good performance at distinguishing between positive and negative cases, with 80% of positive cases being correctly identified while keeping the false positive rate low. However, the specific interpretation of an AUC value depends on the problem domain and requirements of the task.
Is 0.7 AUC good or bad?
An AUC (Area Under the Curve) value ranges from 0 to 1, with higher values indicating better binary classification model performance. An AUC of 0.7 is generally considered to be reasonably good performance, as it indicates that the model performs better than random guessing but still has room for improvement. However, the interpretation of what is considered “good” or “bad” depends on the context of the problem, the task’s requirements, and other evaluation metrics. It’s important to consider these factors when assessing the model’s performance.
“How to generate and interpret a roc curve for binary classification?” At this point, you should have some points to answer this question.
To sum up, you explored the following points:
- Confusion matrix and its components (Evaluation metrics for ML models)
- The need to decide the threshold score to classify.
- Concept of ROC curve
- Concept of the area under the curve (AUC)
I highly recommend checking out this incredibly informative and engaging professional certificate Training by Google on Coursera:
There are 7 Courses in this Professional Certificate that can also be taken separately.
- Foundations of Data Science: Approx. 21 hours to complete. SKILLS YOU WILL GAIN: Sharing Insights With Stakeholders, Effective Written Communication, Asking Effective Questions, Cross-Functional Team Dynamics, and Project Management.
- Get Started with Python: Approx. 25 hours to complete. SKILLS YOU WILL GAIN: Using Comments to Enhance Code Readability, Python Programming, Jupyter Notebook, Data Visualization (DataViz), and Coding.
- Go Beyond the Numbers: Translate Data into Insights: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Python Programming, Tableau Software, Data Visualization (DataViz), Effective Communication, and Exploratory Data Analysis.
- The Power of Statistics: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Statistical Analysis, Python Programming, Effective Communication, Statistical Hypothesis Testing, and Probability Distribution.
- Regression Analysis: Simplify Complex Data Relationships: Approx. 28 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Statistical Analysis, Python Programming, Effective Communication, and regression modeling.
- The Nuts and Bolts of Machine Learning: Approx. 33 hours to complete. SKILLS YOU WILL GAIN: Predictive Modelling, Machine Learning, Python Programming, Stack Overflow, and Effective Communication.
- Google Advanced Data Analytics Capstone: Approx. 9 hours to complete. SKILLS YOU WILL GAIN: Executive Summaries, Machine Learning, Python Programming, Technical Interview Preparation, and Data Analysis.
It could be the perfect way to take your skills to the next level! When it comes to investing, there’s no better investment than investing in yourself and your education. Don’t hesitate – go ahead and take the leap. The benefits of learning and self-improvement are immeasurable.
Here are some additional articles that you might find interesting or helpful to read:
- Standard deviation and variance in statistics
- What is data distribution in machine learning?
- Skewness for a data distribution
- Kurtosis for a data distribution
- Interpretation of Covariance and Correlation
- Lorenz Curve and Gini Coefficient Explained
- Normalization vs Standardization
- What is hypothesis testing in data science?
- What do you mean by Weight of Evidence (WoE) and Information Value (IV)?
- Statistics Interview Questions 101
- Logistic Regression for Beginners
- Understanding Confidence Interval, Null Hypothesis, and P-Value in Logistic Regression
- Logistic Regression: Concordance Ratio, Somers’ D, and Kendall’s Tau
Curious about how product managers can utilize Bhagwad Gita’s principles to tackle difficulties? Give this super short book a shot. This will certainly support my work.