Time series anomaly detection with “anomalize” library

towards-data-science

This post was originally published by Mahbubul Alam at Towards Data Science

Like in any other machine learning algorithm, preparing data is probably the most important step you can take towards anomaly detection. On the positive side though, you’ll likely use only one column at a time. So unlike hundreds of features in other machine learning techniques, you can focus on only one column that is being used for modeling.

Make sure that you go through the usual ritual of data cleaning and preparation such as taking care of missing values etc. One essential step is to make sure that the dataset is in a tibble or tbl_time object in the end.

Let’s first install the libraries we are going to need:

# install libraries
library(anomalize)
library(tidyverse)
library(tibbletime)
library(tidyquant)

For this demo we are in good luck, no data processing required. We are going to fetch stock price data using tidyquant library.

# fetch data
data <- tq_get('AAPL',
from = "2019-09-01",
to = "2020-02-28",
get = "stock.prices")# take a peek
head(data)

First, let’s implement anomalize with the data that we just fetched and then talk about what’s going on.

# anomalize 
anomalized <- data %>% 
time_decompose(close) %>%
anomalize(remainder) %>%
time_recompose()

Few things are going on here, the library takes in input data and applies three separate functions to it.

First,time_decompose() function decomposes “close” column of the time series data into “observe”, “season”, “trend” and “remainder” components.

Second,anomalize() function performs anomaly detection on the “remainder” column and gives outputs in 3 columns: “remainder_l1”, “remainder_l2” and “anomaly”. The last column here is what we are after, it’s “yes” if the observation is an anomaly and “no” for a normal data point.

Outputs of anomalize implementation

The final function time_recompose() puts everything back into order by recomposing “trend” and “season” columns created earlier.

For all intents and purposes, our anomaly detection is complete in the previous step. But we still need to visualize the data and the anomalies. Let’s do that and visually check out the outliers.

# plot data with anomalies
anomalized %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) + labs(title = "AAPL Anomalies")

The figure is pretty intuitive. Each dot is an observed data point in the dataset and red circles are anomalies as identified by the model. The shaded areas are the upper and lower limits of the remainders.

If you have come along thus far, you have successfully implemented a sophisticated anomaly detection technique in three simple steps. That was easy because we used default parameters and didn’t change anything. As we saw in the figure above, this out of the box model performed pretty well in detecting outliers. However, you might come across complex time series data that will require better model performance by tuning parameters in step 2. You can read the model documentation and the quick starter guide to get a sense of the parameters, what they do and how & when to change them.

Spread the word

This post was originally published by Mahbubul Alam at Towards Data Science

Related posts