Software Engineering for Artificial Intelligence, Spring 2021
Jonathan Soma
February 17, 2021
Zenan Li, Xiaoxing Ma, Chang Xu, Jingwei Xu, Chun Cao, Jian Lü
When a DNN is wrong, it’s often more than just wrong: it’s remarkably confident about these wrong answers. They’re optimized for correct answers, not correct confidence.
Why is this a problem for you? It encourages you to make decisions that might have grave consequences!
How the problem is created, input-wise: the difference between training data and the real world “operational data”
Do you have millions of dollars to to train GPT-3 or AlphaGo? No, you’ll use COTS models and maybe fine-tune them.
Even though we’re given a confidence score by softmax, we can’t trust it.
Correct for systematic bias by finding a function to process the logit pre-softmax to match the result to the real proability.
But it’s only systematic bias, i.e. every input that generates same confidence has the same output probability.
It’s a given that the system will produce errors, so calibrate the confidence measures to something within reason. But acknowledge that it isn’t systemic.
Operational Calibration doesn’t change the prediction, only the estimation of the likelihood that it is correct. Why is this important?
Quantify the accuracy of the confidence with mean-squared error of the estimation.
I(x) is whether it was correctly classified or not.
Test as much operational data as you can (the “budget”), and adjust confidence accordingly.
The challenge here is to strike a balance between the priori that was learned from a huge training dataset but suffering from domain shift, and the evidence that is collected from the operation domain but limited in volume.
We can’t test everything, so we use a Bayesian technique and model the problem as a Gaussian Process. As we get more data about how off the confidence scores are, we adjust.
The mode of the distribution is our true/estimated probability score.
We treat the last hidden layer as the features of input x.
Assumption: A prediction by the network is more likely to be correct if it’s close to a correct prediction, and incorrect if close to an incorrect prediction. Same with confidences!
Assumption: Feature space is lumpy and clusterable.
Allows different clusters to have different covariance functions
Decreases computational cost of the Gaussian Processes - one for each cluster.
Not all mistakes are created equal.
This model assumes no cost or gain if no decision is made, and loss is u for a mistake.
The focus of operational calibration is successful budgeting. You don’t have all the time, money, and labels in the world.
How do you pick the data to improve the scores for?
You want to reduce variance as much as possible, and pay attention to those near the break-even threshold to reduce LCE.
Systemic error is only reliability (e.g. TS). OC also cares about resolution.
If the error is systemic, then you’re doing all of this extra work for nothing. Why put things into groups if each group is the same?
When the cost of a false prediction is low enough, errors don’t matter and LCE doesn’t have an effect.
Varied across domains, operational dataset size, number of classes to classify (classification difficulty), and parameter size (model complexity).
Operational calibration worked wonders on the Brier score.
No matter what kind of regression was used, it almost always came out over Temperature Scaling.
Operational Calibration worked both when fine-tuning was effective (simple tasks, e.g. MNIST, binary classification) and when it was ineffective (non-trivial tasks, e.g. ImageNet). Fine-tuning does not necessarily provide accurate confidence
the Brier score would decrease more if we spent rest operation data on calibration than continuing on the fine-tuning
Worthwhile in all situations when you want to control the impact of incorrect classifications.
Beat out temperature scaling, Platt scaling, enhanced Platt scaling, and Isotonic Regression.
Also tried two other techniques for regression to see if GPR was the right approach. It was!
While it works for metrics like LCE and Brier outcome, what about high-confidence false predictions? Labeled 10% of operational data.
What about accuracy? Is this actually an improvement? LCE went down
The differentiating factor is it focuses on operational data and is easily useable with COTS systems. Is a DNN’s output a feature or a bug? Only know when it’s in production!
Inspired by transfer learning, but operational calibration has very limited data from the target/operational domain.
Active learning selected targets to label deliberately, like is done in OC with GPR.