Joke Collection Website - Bulletin headlines - Overview: generalized non-distributed detection (anomaly detection, open set identification, OOD detection)

Overview: generalized non-distributed detection (anomaly detection, open set identification, OOD detection)

Generalized distributed detection: a survey by Yang Jingkang, Zhou Kaiyang,,/Jingkang 50/OODSurvey

Non-distributed detection is very important to ensure the reliability and security of machine learning system. For example, in autonomous driving, when we encounter an unconventional situation or object that it has never seen before, we need the driving system to give a warning and give control to human beings. Since 20 17 was put forward, researchers have paid more and more attention to this problem, and various solutions have emerged one after another, including classification-based, density-based, reconstruction-based, and distance-based methods. At the same time, several other problems are closely related to out-of-distribution detection in motivation and method. These problems include anomaly detection, novelty detection, open set identification (OSR) and outlier detection (OD). Although their definitions and problems are different, these problems often confuse readers and practitioners, which leads to the misuse of these terms in some existing work. In fact, the five problems of AD, ND, OSR, OOD and OD can all be unified under the generalized framework of out-of-distribution detection, and they can all be regarded as special cases or subtasks of out-of-distribution detection, which can be easily distinguished. In this paper, these five problems are deeply reviewed by summarizing the latest technological development, and the challenges and potential research directions in this field are summarized.

A credible visual recognition system can not only give an accurate prediction under known conditions, but also detect unknown samples and discard them or give them to users for safe processing.

For example, a well-trained food classifier should discard non-food pictures such as user selfies, rather than arbitrarily determining that they belong to a known food category. In applications with high safety requirements, such as unmanned driving, the system should give a warning and give the driver control when encountering abnormal situations or objects that have not been seen in training.

Most of the existing machine learning models are based on the closed-world hypothesis, that is, the test set and the training set are independently and identically distributed, or both come from the same internal distribution. However, when the model is deployed in an open scene, the distribution of test samples may be different from that of the training set, so it needs to be treated with caution. The changes of distribution may be semantic drift (such as OOD samples taken from other categories) and covariant drift (also called input drift, such as OOD samples taken from other fields).

Only semantic drift and covariant drift are considered.

The purpose of anomaly detection is to detect abnormal samples in the testing stage. "Abnormal" refers to the deviation from the predefined "normal". This deviation may be caused by covariant drift or semantic drift. Anomaly detection can be divided into two subtasks:

Difference from anomaly detection: 1) Motivationally, unlike anomaly detection, the new detection does not treat "new" samples that have never been regarded as wrong or harmful, but cherishes these new samples as learning resources for subsequent models; 2) The main focus of new category detection is semantic drift; 3) In the new category detection, ID samples are not limited to a single category, and there can be multiple categories of samples in the training set.

The purpose of new category detection is to detect test samples that do not belong to any training category. The detected new samples are usually used to prepare for the construction of future programs, such as more specific analysis and incremental learning of the current model. According to the different number of training categories, new category detection is divided into:

OSR needs a multi-class classifier to accurately classify the test samples (ID) of training classes at the same time 1); 2) Identify samples (OOD) that do not belong to the training category.

OSR = multistage ND

The model needs to eliminate the samples of label migration to ensure the reliability and safety of prediction.

The purpose of non-distributed detection is to detect test samples.

When a sample is obviously different from other samples, it is considered as an "outlier". In the problem setting of anomaly detection, new class detection, open set identification and out-of-distribution detection, there is this training-testing process, and it is necessary to pick out samples that do not belong to the training distribution.

Outlier detection has no "training distribution" and "test distribution", but directly selects those samples that are significantly different from others from all visible samples.

Given isomorphic ID data, the most direct method is 1) density-based method, which estimates the density of ID and rejects those test samples that deviate from the estimated OOD. Other methods include: 2) identifying abnormal samples according to the quality of image reconstruction, 3) directly learning the judgment boundary to distinguish ID and OOD samples, 4) distance-based method, and 5) meta-learning method.

Density-based method attempts to model the distribution of normal data (ID data), which is based on a practical assumption: abnormal test samples have lower probability values downstream of the estimated density model, while normal samples have higher probability values.

Parameter density estimation assumes that the density of 1d samples can be expressed as a defined distribution. One method is to fit multivariate Gaussian distribution on the training data, and measure Mahalanobis distance (covariance distance) between the test sample and the expectation of the training sample, and calculate the similarity between the two unknown sample sets. Different from Euclidean distance, it considers the relationship between various features). Other work has adopted a more complicated assumption, that is, the training distribution is mixed Gaussian distribution or Poisson distribution.

The nonparametric density estimation takes into account the more realistic situation: the predefined distribution cannot simulate the real distribution. The training distribution can be simply simulated by histogram. Kernel density estimation (KDE) further uses kernel function as a continuous substitute for discrete histogram, and can flexibly use point weight and bandwidth to control the distribution of estimation.

Although classical density estimation methods have achieved good AD performance in many tasks, they are more suitable for low-dimensional tasks.

For high-dimensional data in computer vision tasks, the computational complexity and scalability of these methods are affected. In order to reduce the disaster of dimensionality, some methods reduce dimensionality through feature engineering [277] and [278].

By reconstructing the input from the potential embedding, the self-encoder can learn the effective expression of unlabeled data. The variational encoder encodes the input image into a potential vector that obeys Gaussian distribution. The potential embedding of learning can be regarded as a low-dimensional representation of input. Traditional density estimation methods can be applied to these depth representations.

Generative confrontation network consists of generative network and discriminant network, which compete with each other in zero-sum game. In general, generating network learning maps the potential space to the studied data distribution, while discriminant network tries to distinguish the data generated by generator from the real data. However, unlike the example based on self-encoder/variational self-encoder, the lack of encoder makes it difficult for GAN to directly find the corresponding embedding of the input picture. To solve this problem, ADGAN [90] searches for a good representation of a given sample in the potential space. If no such representative can be found, the sample is regarded as abnormal. This method is very expensive.

Normalized flow describes the transformation process of probability distribution through a series of reversible mappings. By repeatedly applying the variable change rule, the initial density "flows through" a series of reversible maps. Therefore, the possibility of input space can be directly estimated by using the normalized flow method. Stream-based methods have elegant mathematical expressions, but they are only sensitive to low-dimensional features. If dimensionality reduction is not carried out, the calculation cost of flow-based method is high.

In addition to visual embedding by generating models, some methods mainly increase the representation ability of extracted features by expanding the model capacity, which may make the normal (ID) more accurately represented as density estimation. These strategies include data enhancement, confrontation training, refining, loss function enhancement and the use of surface/local features.

The energy-based method uses a scalar energy fraction to represent the probability density of variables, and it adopts non-standardized negative logarithmic probability.

However, compared with the standard deep learning model, it is expensive to train the energy-based method, because the sampling and estimation of Markov chain Monte Carlo method (MCMC, which estimates the posterior distribution of interest parameters through random sampling in probability space) require integral operation.

In order to solve this problem, researchers proposed score matching and random gradient to support efficient training.

The existing work also explores the use of frequency domain analysis method for anomaly detection. Humans understand pictures through low-frequency information, while CNN relies more on high-frequency information to make decisions. People put forward some methods to suppress the influence of high frequency components, such as CNN kernel smoothing and spectrum guidance data enhancement. Some work has found that it is also difficult to detect the counterattack on low frequency components, so it is proposed that

Frequency-based methods focus on sensory anomaly detection (especially on antagonistic samples), which may not be suitable for semantic anomaly detection.

The core of the reconstruction-based method is that the codec framework trained on ID data usually returns different effects to ID and OOD samples.

The difference of model performance can be used as an indicator of anomaly detection. The difference of model performance can be measured by the difference of feature space or reconstruction error.

Coefficient reconstruction assumes that every normal sample can be accurately reconstructed by finite basic functions, while the reconstruction cost of abnormal data is higher, so dense representation is produced. Typical techniques of sparse representation include kernel PCA based on L 1 regularization and low-order embedded network.

The reconstruction error method relies on the assumption that when the input is normal test samples, the reconstruction model trained based on normal data will output higher quality results. Depth reconstruction model (including self-encoder AE, variational self-encoder VAE, generating countermeasure network GAN and U-Net, etc. ) can be used as the backbone of this kind of method.

Besides the standard practice of AE/VAE and reconstruction error, other methods use more elaborate strategies, such as reconstructing the model by memorizing the regularity, adjusting the model architecture, and partial/conditional reconstruction.

In semi-supervised anomaly detection, CoRA trains ID samples and OOD samples respectively, and obtains two self-encoders. The reconstruction errors of these two self-encoders are used as indicators of anomaly detection.

The discriminator in GAN essentially realizes anomaly detection by calculating the reconstruction error. In addition, GAN variants, such as denoising GAN and category conditional GAN, have achieved better performance by increasing the difficulty of reconstruction. Some methods use the performance of reconstructed images in downstream tasks to further amplify the reconstruction error of abnormal samples. Integration can also optimize model performance.

Anomaly detection and single class new class detection are usually formalized as unsupervised learning problems, and all ID samples are regarded as one class.

283 did a fully monitored anomaly detection.

In semi-supervised anomaly detection, unlabeled data is used for model training.

PU learning is put forward to solve this problem.

Self-monitoring method

A single category classification directly learns a decision boundary.

reinfecta

* * * Gender: ID samples have multiple categories (training categories).

Difference: Open set recognition also needs to classify ID samples accurately, while new class detection only needs to get two classifiers that distinguish ID/OOD.

Because there are many training categories for open set recognition and multi-class new class detection, most methods are based on classification. Other methods include identity-based prototyping and reconstruction. Few models are based on density.

To solve

Both open set recognition and multi-class new class detection pay attention to the case that ID samples contain multiple classes. In the classification problem, one-time coding is generally used to encode the category information. However, the unique thermal coding ignores the internal relationship between categories. For example, the distance between "dog"-"cat" and "dog"-"car" is the same, which is obviously unreasonable. In view of this, some work attempts to solve this new category detection problem by using the information in the tag space of new categories. Redistribute a large semantic space to form a hierarchical classification of known categories.

The top-down classification strategy based on label organization reset and grouping softmax training are proved to be effective. A set of tasks should use word vector embedding to automatically build a label space. The sparse hot tags in 169 are replaced by several groups of dense word vectors from different NLP models to form multiple regression heads for robust training.

During the test, the label (the label with the smallest distance from the embedding vectors given by all different headers) is output as the prediction result.

If the minimum distance exceeds the threshold, the sample is classified as "new". Recent work further uses the features output from the language-picture pre-training model to better detect new categories, and the picture coding space also contains rich features from the label space. )

Distance-based open set identification method needs "prototype" to realize class conditions. Maintain the classification performance of ID samples.

Perform classification-based clustering and prototyping operations on the visual features extracted by the classifier.

OOD samples can be identified by calculating the distance between samples and clusters.

Some methods also introduce contrast learning to learn closer clustering of known categories, thus expanding the distance between ID and OOD samples.

CROSR 177 calculates the distance by splicing the classifier and the visual embedding given by the reconstruction model, and obtains enhanced features in the expanded feature space. In addition to using the features given by the classifier, GMVAE 178 also uses the reconstructed VAE to extract features, and models the embedding of the training set as a multi-center mixed Gaussian distribution for subsequent distance-based operations. The nearest neighbor classifier is also suitable for open set recognition. By storing training samples, the nearest neighbor distance ratio is used to identify unknown samples in the test.

The method based on reconstruction hopes that ID and OOD samples will behave differently during reconstruction. This difference can be captured in the latent feature space or pixel space of the reconstructed image.

By converting pictures of known categories into sparse representations, open-set samples can be identified because they are relatively dense. Sparse coding techniques include sparse set exponent 180 and kernel null space method 18 1, 182.

The multi-classification visual encoder fixed on ID samples is used to maintain the classification performance of ID samples. C2AE trains a decoder with the expression as the vector, and uses the extreme value theory to estimate the reconstructed pictures to distinguish the unknown categories. In the next work, conditional Gaussian distribution is used to make different potential features approximate to the class Gaussian model, so as to classify the known class samples and reject the unknown class samples. Other methods generate counterfactual pictures to help the model pay more attention to semantics. Countermeasures against 186 also use this idea to enhance the robustness of the model.

The advantage of post-processing detection method is that it can be easily applied without modifying the training program and objectives. This is very important for the detection method of OOD in actual production environment. The early ODIN is a post-processing method, which uses temperature scaling and input disturbance to amplify the ID/OOD difference. In this method, a sufficiently large temperature has a strong smoothing effect, and the softmax value can be converted into logit space (), thus effectively distinguishing ID and OOD samples. Note that this method is different from confidence calibration, which uses milder tests.

Calibration pays more attention to the correct probability of expressing the truth of ID samples.

ODIN's score maximizes the difference between ID and OOD samples, which may no longer be meaningful from the perspective of prediction confidence.

Based on this view, recently 189 proposed to use energy score for OOD detection. This method does not need hyperparameters, and its performance is equivalent to or even better than ODIN. The energy function maps the logit output to a scalar through the convenient logsumexp operator. The test sample with relatively low energy value is considered as ID, while it is OOD.

55 further put forward the JointEnergy score.

The confidence-based method customized for OOD detection can be realized by designing confidence estimation branches and category data enhancement (combining omission strategy, confrontation training, stronger data enhancement, uncertainty modeling and utilizing the characteristics of ideal depth).

In particular, in order to enhance the sensitivity to covariant deviation, some methods focus on the hidden representation of the middle layer of neural network. The generalized ODIN extends ODIN with DeConf-C as the training target, and selects the disturbance scale on ID data as the superparameter.

ODIN is not classified as post-processing method because it needs model training process.

In order to obtain better hidden layer features for density estimation, hierarchical Mahalanobis distance and Gram matrix technology are introduced.

Another branch of OOD detection uses the collected OOD sample set (outlier sample set) to help the model learn the difference between ID and OOD in training.

Generally speaking, OOD detection with outlier exposure can achieve significantly better performance. However, its performance is obviously affected by the correlation between given OOD samples and real OOD samples. How to extend OOD from exposed OOD to a wider range of OOD needs further exploration.

Outlier exposure method relies on the strong assumption that OOD training data can be obtained, which may not be true in practice. When OOD data is not available, some methods try to synthesize OOD samples so that ID and OOD can be distinguished. The existing work uses GAN to generate OOD training samples, which makes the model output uniform (uniform normal? ) in order to generate boundary samples in low-density areas or, similarly, to generate OOD samples with high confidence.

The existing OOD detection methods mainly rely on the output or feature space to give the OOD score, but ignore the information of gradient space. ODIN 188 explores the use of gradient information to detect OOD for the first time. ODIN uses preprocessing input, which is to apply a slight disturbance caused by the input gradient. The goal of ODIN disturbance is to enhance the confidence of the model in the prediction label, thus increasing the softmax value of any given input. Finally, we can find the interference that makes the difference between ID and softmax score of OOD input bigger, which makes them easier to distinguish and OOD detection performance better. Odin only used the gradient implicitly through perturbation. GradNorm uses the norm of gradient vector to propagate back from the output of softmax and the KL divergence of normal probability distribution.

Bayesian model is a statistical model, and Bayesian rules are used to infer all uncertainties in the model. The most representative one is Bayesian neural network, which constructs the cognitive uncertainty of the model through Markov chain Monte Carlo method, Laplacian method and variational reasoning, and samples it from the posterior distribution of the model. Their most obvious defects are inaccurate prediction and high calculation cost, which makes them difficult to use in practice. Recently, I have tried several less principled (weak in theory? ), including MC-dropout [224] and deep fusion [225], 299] are used to estimate uncertainty faster and better. These methods are not very competitive in the estimation of OOD uncertainty. Further exploration needs to adopt natural gradient variational reasoning while retaining the advantages of Bayesian principle, so as to adopt practical and affordable modern deep learning training. Dirichlet prior network (DPN) is also used for OOD detection, and a series of work has been done by modeling uncertainties from three different sources: model uncertainty, data uncertainty and distribution uncertainty [227], [228] and [229].

Recent work has promoted large-scale food detection to be closer to practical application. Two research directions are: extending OOD detection to a large semantic space and using a large pre-training model. For example, 168 points out that the method based on CIFAR benchmark data doesn't work on ImageNet, a benchmark with a larger semantic space, and emphasizes the necessity of evaluating OOD detection in large-scale real settings. In order to solve the above challenges, the key idea of MOS is to deconstruct a large semantic space into smaller groups with similar concepts, which simplifies the decision boundary between known and unknown data. The powerful pre-training model has achieved amazing performance in various tasks and modes. The work in the same period [17 1], [230] and [23 1] proved that the performance of the pre-trained transformer has been significantly improved in specific difficult OOD tasks.

In the field of OOD detection, the density-based method uses some probability models to explicitly model the data in the distribution, and marks the test data in the low-density area as OOD. Even if the data in the distribution is multi-category, OOD detection is different from anomaly detection, but by unifying the data in the distribution into a whole, the density estimation method in section 3. 1.2 can be directly applied to OOD detection. When the distribution contains multiple categories, the category conditional Gaussian distribution can explicitly model the data in the distribution, so the samples outside the distribution can be identified according to the output prediction probability. Traffic-based methods [92], [232], [233] and [234] can also be used for probabilistic modeling. Direct estimation of OOD probability seems to be a natural solution, and there are some methods [235], [236] and [237] to realize OOD detection by outputting higher probability prediction values to OOD samples. 238 tries to solve this problem with likelihood ratio. 239 found that there was obvious deviation between input complexity and probability value, and proposed a method based on the ratio of probability value to reduce the influence of input complexity. Recent methods have shifted to using new scores, such as likelihood regression 240 or integrating multiple density models 236. Generally speaking, the training and optimization of the generated model is almost unacceptable, and its performance often lags behind the classification-based method (3.3).

The basic idea of distance-based method is that the OOD samples in the test should be relatively far away from the center or prototype of the category in the distribution. 207 Use the minimum Mahalanobis distance from the centers of all categories for detection. A subsequent work, 24 1, divides the picture into foreground and background, and then calculates the Mahalanobis distance ratio of these two spaces. Some work uses cosine similarity between test sample features and class features to determine OOD samples 242, 243. One-dimensional subspace of the first singular vector of training features

In addition, other work uses the distance of radial basis function kernel and Euler distance from input embedding vector to category center.

Since its appearance, the field of food detection has developed rapidly, and its solutions range from classification-based, density-based, to distance-based. Under the multi-class setting, typical OOD detection is an open-set identification problem (Section 4), which accurately classifies the test samples within the distribution in class space Y and discards the samples outside the distribution whose semantics are not supported by Y. However, OOD detection includes a wider range of learning tasks (for example, multi-label classification) and solutions (for example, density estimation and outlier exposure). Some methods relax the limitation of open set detection and obtain stronger performance.

Outlier detection requires all samples to be visible, and its goal is to detect those samples that deviate significantly from the majority distribution. Outlier detection methods are usually transduction, not induction. [13], [14], [15], [16] overview mainly reviews the outlier detection methods in the field of data mining. The following mainly reviews outlier detection methods, especially the method of using deep neural network designed for computer vision. Even though the deep learning method can rarely solve the problem of outlier detection directly, the data cleaning program (a prerequisite for learning from dirty open set data) and the open set semi-supervised learning method are also solving the problem of outlier detection.

The basic idea of outlier detection model is to model the whole data set as Gaussian distribution, and mark the samples that deviate from the mean by more than three cups of standard deviation as outliers 30030 1. Other probabilistic methods with parameters use Mahalanobis distance [266] and Gaussian mixture model [302] to simulate data density. Similar to the "triple standard deviation" rule, quartiles can also detect outlier samples 247 by constructing a traditional nonparametric probability model. For robustness and simplification, the local outlier factor method 248 estimates the density of a given point by the ratio of its neighbors to its own local reachability. RANSAC252 iteratively estimates the parameters of the mathematical model to fit the data, and finds the samples that contribute less to the estimation as outliers.

Generally speaking, the classical density methods of anomaly detection, such as kernel density estimation (3. 1 section), can also be applied to outlier detection. Even if these methods are difficult to apply because of the high dimension of picture data, they can be alleviated by dimensionality reduction methods 253 and 254 and nearest neighbor-based density method (3. 1 section).

A simple way to detect outliers is to calculate the number of neighbors within a certain radius, or to measure the distance between the k nearest neighbors 303 and 304. The following mainly introduces the method based on clustering and graph.

DBSCAN255 accumulates samples according to the distance-based density to form a cluster. Samples outside the main cluster are identified as outliers. Subsequent work improves the clustering method 256 by considering the confidence of clustering labels.

Another method uses the relationship between data points to construct neighborhood graphs [305] and [306] (or its variant [307]), and uses graph attributes and graph mining techniques to find abnormal samples 257 and 258, such as graph clustering [259], [260], graph segmentation 308 and label propagation using graph neural network 2665438.

Previous article:Is there any word to describe Fujian tulou enclosed house ~ ~ ~
Next article:Slogan slogan of campus library