Detecting and Characterising Dataset Shift

Rabanser, S., Günnemann, S., & Lipton, Z. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32.

Core Problem:

Machine learning (ML) systems often rely on the assumption that training and real-world data (target data) come from the same distribution. However, this is rarely true in practice. When the distribution of input data changes, this is called dataset shift, and it can lead to silent failures in ML systems. This paper explores methods to detect such shifts, identify the most representative shifted samples, and assess the severity of the shift’s impact on model performance.

Key Objectives:

The research aims to:

Detect distribution shifts from as few examples as possible.
Characterize the shift by pinpointing over-represented samples in the target data.
Quantify shift malignancy to determine if the shift is harmful to model performance.
Methodology:

The paper primarily focuses on shift detection using a two-sample statistical hypothesis testing framework. The core approach involves:

Dimensionality Reduction: Applying various techniques (e.g., PCA, Autoencoders, pre-trained label classifiers – BBSD) to reduce the dimensionality of the data. This step is crucial for mitigating the known issues of traditional two-sample tests in high-dimensional spaces.
Two-Sample Testing: Performing suitable statistical tests (e.g., MMD for multivariate data, KS test for univariate continuous data, Chi-squared test for categorical data) on the reduced representations to determine if the source and target distributions differ significantly.
Identifying Anomalous Samples: Leveraging a domain classifier trained to distinguish between source and target data. The samples the classifier most confidently assigns to the target domain are considered the most anomalous.
Determining Shift Malignancy: Comparing the accuracy of the black-box model on source and target data. While target error cannot be directly computed without labels, the paper proposes using the domain classifier’s confidence scores as a proxy for estimating the potential performance degradation.
Key Findings:

BBSD with soft predictions (BBSDs) outperformed other dimensionality reduction techniques across a range of simulated shifts, showcasing its ability to detect shifts even when the label shift assumption is not met.
Different shift types exhibited varying detectability. Large Gaussian noise, significant image modifications, and even adversarial attacks were easily detectable. In contrast, subtle shifts like small Gaussian noise and class imbalance were more challenging to identify.
The domain classifier proved helpful for characterizing shifts qualitatively and determining their potential harm. By ranking samples based on their domain assignment confidence scores, the researchers could identify exemplars typical of the shifted distribution.
Significance for Practitioners:

BBSDs offers a practical way for ML practitioners to incorporate shift detection into their workflows. The same label classifier used for prediction can be repurposed for shift detection, even retrospectively.
Understanding different shift types and their detectability can guide practitioners in choosing appropriate detection methods and monitoring strategies.
The domain classifier provides a valuable tool for understanding the nature of the shift and identifying problematic samples. This information can be used to improve data collection or model adaptation strategies.
Important Quotes:

“We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently.”
“[BBSD] works surprisingly well under a broad set of shifts, even when the label shift assumption is not met.”
“We note that BBSDs being the best overall method for detecting shift is good news for ML practitioners. When building black-box models with the main purpose of classification, said model can be easily extended to also double as a shift detector.”
Further Research:

The paper acknowledges limitations in quantifying shift malignancy and suggests further research on estimating target error without labels.
Exploring alternative dimensionality reduction techniques and two-sample tests could lead to improved detection performance, particularly for subtle shifts.
Applying the proposed methods to real-world datasets and different ML tasks would further validate their effectiveness and practical applicability.

#research #podcast #ai

Tags AB Testing Success, Business Impact, Data Analysis, Experimentation