Are Anomaly Detection and Outlier Detection the same thing?
In the community of intelligent people who understand machine learning, there is an ongoing debate regarding how to categorize the two rogue but quite important occurrences – anomalies and outliers.
While anomalies are typically undesirable in most industrial, computational, and other processes, in machine learning, they are essential in making a piece of tech become the best version of itself, in a way.
Now, while outliers and anomalies may appear similar, there’s a subtle difference between the two.
Anyone attempting to help a machine make its baby steps toward full, error-free functionality must be aware of this difference, lest they end up torpedoing their robot friend’s career before it even started.
In this article, we’re going to explain this difference in slightly more detail, so you can have a clearer idea of what’s the big deal with these two terms and if they’re interchangeable.
Here’s the deal.
What are anomalies?
Computers like logic and 0s and 1s.
You can write hundreds of pages of code for an app as a programmer, but if you’ve missed a single ‘_’ – the entire thing won’t work.
The thing is, machine learning takes how we view computers to a whole new level. Now, there are computers that like the fact that a line of code is missing or that some part of it looks fishy.
In data science analytics, an anomaly is some part of a process that isn’t going the way it is supposed to. This could be because of the rough weather, cyber terrorism, some part of the system suddenly malfunctioning on its own, you name it.
The reason anomalies are of such major interest to analysts and data scientists is because when they manifest themselves – they lose their potential to cause trouble. (If a solution is devised to tackle them, of course.)
This simple principle that can be quite complex to pull off in practice is the basis of what is known as anomaly detection.
What is anomaly detection?
It could materialize itself as suspicious-looking recent credit card withdrawal history of a person suspected of wrongdoing.
Or a tic-tac-looking UFO that a military base-based security camera caught the night before.
Or a blast furnace that’s inexplicably going cold all the time in a smelting plant.
Anomalies occur in pretty much any automated process, and detecting and thoroughly documenting them has become the obsession of many data scientists in recent decades.
This is for a good reason, too.
The thing is, if you can amass a large enough pool of potential anomalies that you’ve already recorded and tackled, in the future, you can prevent the thing that caused them from disturbing the process again.
This way, the maintenance costs of many industrial, banking, military, IT, and other processes can be dramatically cut. At the same time, the security of facilities and research centers that run these processes can be significantly improved.
What are outliers?
In a way, outliers represent the meta version of anomalies.
While anomalies are distortions from what is considered a normal process, outliers represent distortions within the system that is recording the anomalies.
Here’s how this works.
For an anomaly to be found, an analyst gathers a bunch of data. Based on specific patterns of ‘behavior’ of a machine, let’s say this scientist knows what is within the limits of normal operation and what can be considered an anomaly.
Now, if a certain value within this system is dramatically out of line, it’s called an outlier. This outlier value can either be way too high or too low in comparison to other points of data. There are two possible reasons why outliers occur:
1) an error within the measuring system itself, or 2) an anomaly.
The important thing here is the decision that needs to be made on the part of the data scientist.
To discard this outlier as just a bug within the measuring system or consider it a legitimate recording of a potential anomaly.
If you do treat an outlier as an anomaly and use it in your computational efforts, so to speak, the results you get can be drastically different than what you would have had if you chose to ignore them.
Failing to detect an outlier can lead to the whole anomaly detection process failing in a rather grand fashion.
Since detecting anomalies is a part of a machine-learning strategy to make machines more efficient, using wrong, outlier-infested data can make all that number-crunching useless.
Worse still, it can give you a false sense of having made progress in making a process more stable and safer while, in reality, making it considerably less safe and prone to failures.
So, the main purpose of outlier detection is to improve the accuracy of the whole anomaly detection system.
Let’s say a metal process control with ai is subject to some anomaly testing observation.
If the basic oxygen furnace, for example, is giving off too much heat, this might be a sign of an anomaly. To start the process of determining how to prevent a similar anomaly in the future (if this is an anomaly, to begin with) it is necessary to figure out what the outlier values are.
Next, the scientists who conduct this analysis need to decide whether to ignore the outlier spikes they detected or to incorporate some of them as legitimate anomaly values.
Figuring out and recording where the outliers are first is essential to being able to close in on the anomalies in an accurate way.
Figuring out why your steel plates have come out of the furnace all weird can be much easier if you have a clear idea of the anomaly range for that part of the production process. (If it is steel plates you’re making, of course.)
All in all, and to reiterate, there is a difference between anomaly detection and outlier detection. Sometimes these terms are used interchangeably to no grand negative effect, especially if anomaly detection is described and discussed in broader terms without getting into details.
That said, anyone who wants to understand machine learning and learn about it, as well as discuss it in greater detail, should be aware of the aforementioned difference between outliers and anomalies.
Rick Seidl is a digital marketing specialist with a bachelor’s degree in Digital Media and communications, based in Portland, Oregon. He carries a burning passion for digital marketing, social media, small business development, and establishing its presence in a digital world, and is currently quenching his thirst through writing about digital marketing and business strategies for Find Digital Agency.