The DevOps team at Domotz is constantly working to keep the company cloud infrastructure reliable. That’s a crucial part of the service quality experienced by our customers. This kind of activity requires the ability to find and remediate production issues promptly.
We monitor system parameters like CPU, memory, number of processes, network latency and so on. The huge amount of information collected, which produce historical trends can be noisy. As a matter of fact, simple threshold-based alerts may trigger a lot of false positives.
We’ve started to experiment with anomaly detection algorithms to identify patterns that differ from the expected system behavior integrating the same in our infrastructure.
Saltstack implementations of Anomaly Detection
At Domotz we use Saltstack to manage our cloud infrastructure. In particular operations like configuring new virtual machines, deploying and configuring services into production are performed via Saltstack. We have even started leveraging its event-driven system for monitoring and managing our cloud infrastructure.
In fact, Saltstack offers an event-driven infrastructure to raise events related to some system parameters you can monitor.
We extended the Saltstack event-driven infrastructure to perform advanced anomaly detection with machine learning models.
We have followed two different approaches with Saltstack:
Approach 1: minion oriented
Implementation of custom Saltstack ‘beacons’ with anomaly detection algorithms for monitored resources. Our time series anomaly detection is based on the Luminol python library by LinkedIn.
Approach 2: master oriented
Adoption of Saltstack/Umbra project to define machine learning pipelines for monitored resources. Umbra leverages the PyOD python library which offers several state-of-the-art Outlier Algorithms.
See how Domotz implemented AIOps
At SaltConf19, Giancarlo Fanelli, our CTO and Massimiliano Cuzzoli, our Head of Cloud and System Engineering led a breakout session discussing how Domotz is using SaltStack to deploy features commonly exposed in AIOps (Artificial Intelligence for IT Operations), specifically for anomaly detection and root cause analysis. Check out the video below…