Exploring Anomaly Detection Using Multivariate Gaussian Distribution
Written on
Anomaly detection is a crucial aspect of data analysis, allowing us to spot irregularities that deviate from expected patterns. This article delves into the foundational concepts of anomaly detection, the properties of the multivariate Gaussian distribution, and provides a basic implementation of an anomaly detection algorithm using Python.
What is Anomaly Detection?
Anomaly detection encompasses identifying data points or events that significantly differ from the norm. The primary objective is to pinpoint occurrences that do not align with established patterns or expectations. Understanding the anticipated pattern is essential; if a new data point falls outside this established norm, it may be classified as an anomaly.
Typically, there are three categories of anomalies: 1. Point Anomalies: Individual instances viewed as unusual compared to the entire dataset (e.g., a car moving at an unusually slow speed on a busy highway). 2. Contextual Anomalies: Instances that are anomalous within a specific context (e.g., a credit card transaction that seems normal among all transactions but is out of character for a particular individual). 3. Collective Anomalies: Groups of instances that collectively appear anomalous, even if each instance could be deemed normal on its own (e.g., a series of rapid-fire purchases on Amazon that seem suspicious).
Anomaly detection methods generally fall into three categories: 1. Supervised Detection: Requires labeled data with both normal and anomalous instances. Techniques like neural networks can categorize data points, but imbalanced datasets can hinder effective learning. 2. Semi-supervised Detection: Utilizes partially labeled data, often assuming that only positive instances are present. The model learns the distribution of positive cases to identify potential anomalies. 3. Unsupervised Detection: Works with entirely unlabeled data, establishing a boundary of normality where points outside this boundary are classified as anomalies.
Anomaly detection can be applied across various data types, including time series, tabular data, images, and graph structures.
The applications of anomaly detection are extensive, ranging from fraud detection and cybersecurity to rare disease identification and process monitoring. Access to robust algorithms can significantly impact various fields.
Let's examine a fundamental algorithm for detecting anomalies.
Gaussian Distribution for Anomaly Detection
One foundational technique for anomaly detection involves the Gaussian (or Normal) distribution. This statistical model, recognized for its bell curve shape centered around the mean, is instrumental in identifying outliers.
Gaussian distribution is adept at modeling many natural phenomena. Its probability density function captures the distribution of data, with most data points clustering around the mean and tapering off towards the extremes. Instances farther from the mean are increasingly likely to be classified as outliers.
The probability density function, denoted as f(x), quantifies the likelihood of an outcome x in our dataset. Formally represented as follows:
Assuming a single feature follows a normal distribution, we can apply f(x) to model our anomaly detection algorithm. A threshold epsilon can be established to determine whether a case is anomalous. The selection of epsilon is heuristic and depends on the desired sensitivity to anomalies.
In a normal distribution, 2.5% of observations lie two standard deviations below the mean. By setting our threshold at 0.054, we classify approximately 2.5% of our dataset as anomalies. Adjusting the threshold influences sensitivity, with lower values capturing fewer anomalies and higher values yielding more false positives.
In practice, a balance must be struck, as some genuine cases may fall below the threshold while some anomalies might remain undetected. Testing various epsilon values is crucial to identify the most effective setting.
When dealing with multiple features, the challenge lies in their potential interdependence. If features are uncorrelated, the product of their probability densities can classify anomalies.
If any feature indicates an outlier, the probability of the instance being anomalous increases. However, assuming independence among features is often unrealistic. This is where the multivariate probability density function becomes relevant. By constructing a covariance matrix, we can account for the relationships between features and avoid double-counting.
The formula for the multivariate distribution probability density function is as follows:
Here, x represents the input vector, ? denotes the vector of feature means, and ? is the covariance matrix.
Using the scipy library simplifies implementation. The scipy.stats.multivariate_normal function accepts a vector of feature means and standard deviations, offering a .pdf method to obtain probability density for specified points.
Now, let's apply this implementation in a practical example.
Two-feature Model Implementation in Python
In this section, we will examine a two-feature example to visualize anomalies in Euclidean space. I generated two features with 100 samples from a normal distribution, calculated their means and standard deviations, and fitted a multivariate normal model using the scipy.stats library. Note that the model was trained exclusively on positive samples. It is imperative to clean the dataset to ensure features conform to normal distribution and contain no outliers to enhance the model's anomaly detection capabilities. Finally, I introduced five anomalous samples and utilized the .pdf method to compute probabilities.
The resulting scatterplot illustrates the relationship between the two features, with anomalies highlighted and colors indicating probabilities derived from the multivariate probability density function.
By adjusting our threshold low enough, we can effectively differentiate anomalies from expected values. The subsequent charts contrast epsilon values of 1x10^-7 and 1x10^-9, revealing that an epsilon of 1x10^-9 captures outliers more accurately, while 1x10^-7 mistakenly classifies some positive samples as anomalies.
This example simplifies epsilon selection due to the visual representation of anomalies. Now, let’s explore a scenario involving additional features.
Multivariate Model Implementation with Python
In this next example, I will utilize the wine dataset from the ODDS library. This dataset comprises 13 numerical features across 129 instances, providing insights into various wine characteristics. For anomaly detection, one target class was downsampled to serve as outliers, totaling 10 anomalies within the dataset (~8%). The dataset is relatively clean, with no missing values.
Our first step is to ensure the features conform to a Gaussian distribution. Where feasible, outliers should be removed, and the distribution normalized. In this dataset, four features naturally follow a normal distribution, while four others can be normalized through logarithmic transformation. I opted to exclude any remaining features that did not meet these criteria. Finally, outliers were removed by excluding instances with feature values exceeding two standard deviations from the mean. The remaining code mirrors the previous example.
Unlike the prior two-feature illustration, visualizing results in a multi-dimensional space is impractical. Instead, we can rely on confusion matrix metrics (such as recall and precision) and the area under the ROC curve to determine the optimal epsilon for our use case.
There is often a trade-off between precision and recall, meaning the choice of epsilon is influenced by the sensitivity requirements of our analysis. For this example, I sought an epsilon that maximizes the area under the curve. Depending on the context, some applications may prioritize identifying as many anomalies as possible (even at the risk of including normal instances), while others may only seek to flag anomalies with high certainty.
As epsilon increases, recall improves, although precision remains relatively low across proposed epsilon values, peaking around 0.0035 and 0.0065. The area under the curve seeks to balance precision and recall, peaking at approximately 0.0065. A review of the confusion matrix reveals the model's performance.
The model accurately identifies nearly all anomalies while only missing one, which is commendable given the exclusion of a third of the features. However, it also misclassifies 40 positive instances as anomalies, necessitating manual verification for half of those cases.
To enhance this model, further feature engineering could be beneficial, along with seeking an epsilon value that is less sensitive to outliers. The remainder of this task is straightforward and left for you to explore further.
Potential Drawbacks of Gaussian Anomaly Detection
While the multivariate Gaussian distribution is a straightforward and efficient model for anomaly detection, it does have limitations that may hinder its application in certain scenarios.
Firstly, it may produce low probability density values, which, although generally manageable for modern computing systems, could present challenges in specific cases.
Secondly, ensuring features adhere to a normal distribution can be labor-intensive. Although proper feature engineering can mitigate this issue, the associated risks may deter some from pursuing this route.
Thirdly, this model does not accommodate categorical features. When categorical data is present, it necessitates creating separate models for each combination of categorical variables, which can be burdensome.
Lastly, the model operates under the assumption that all features hold equal significance and that no complex interrelationships exist among them. One approach to address this limitation is to develop the multivariate distribution probability density function from scratch, incorporating a parameter for feature importance. Additionally, further feature engineering may help capture relationships, although this process can be intricate and time-consuming.
Despite these drawbacks, employing multivariate Gaussian distribution for anomaly detection serves as an excellent starting point for tabular data analysis. It can establish a benchmark or prove to be a valuable tool for identifying anomalies, providing an intuitive understanding of anomaly detection.
Thank you for reading! I plan to create a comprehensive series on anomaly detection soon, so if this topic piques your interest, stay tuned.
Sources:
- https://www.kaggle.com/code/matheusfacure/semi-supervised-anomaly-detection-survey
- https://ai.googleblog.com/2023/02/unsupervised-and-semi-supervised.html
- Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.
- Nakao, T., Hanaoka, S., Nomura, Y. et al. Unsupervised Deep Anomaly Detection in Chest Radiographs. J Digit Imaging 34, 418–427 (2021). https://doi.org/10.1007/s10278-020-00413-2
- https://github.com/viyaleta/Anomaly-Detection/blob/main/Examples/1%20Anomaly%20Detection%20with%20Guassian%20Distribution.ipynb
Math typesetting courtesy of Codecogs online LaTeX editor.
For source code examples, please refer to the accompanying Jupyter Notebook.