Anomaly Detection with Sci-kit Learn, Keras and TensorFlow

Anomaly detection in time-series data

Background:

Consider an unsteady heat transfer problem, with a solid plate that sits in between two turbulent media with different temperatures. The rate of change of temperature of the plate is driven by the weighted temperature difference of the two gaseous streams, and the weights -heat transfer coefficients- depend on other variables, such as the air velocity and pressure.

Our system is equipped with sensors, and we want to know whether an anomaly occurred at some point (a heat transfer change due to loss of insulation, a sensor that started giving wrong values, ..). We cannot observe this directly from the measured data. An essential difficulty for a traditional, physical model-based approach, is that our observed and unobserved variables have relationships that are not exactly known to us (for instance, the relationship between the heat transfer coefficient and variables such as temperature, pressure and velocity).

Nevertheless, we would like to be able to detect a hidden change in our system -in this case, a reduction in a heat transfer coefficient of 10%-. We will train a machine-learning model to learn how the observed variables are connected, such that it can detect when something does not look quite right anymore.

As a first step, let's generate data that follows from the following non-dimensionalized ordinary differential equation:

\begin{gather} t_{ref} \frac{dy}{dt} &= \eta_0 \theta_0 + \eta_1 \theta_1 - y(\eta_1 + \eta_0) \\ \end{gather}

Where the $\eta$'s denote the heat transfer coefficients, the $\theta$'s the temperatures (at both sides, indicated with 0 and 1), and $t_{ref}$ the time constant (that depends, among other things, on the material properties and thickness). To give some physical intuition:

  • when both $\eta$'s are identical, the steady-state solution is: $y = (\theta_0 + \theta_1)/2$. Note that this is also a stable solution: whenever y is larger, the rate of change becomes negative, and vice versa
  • More general, the steady-state solution is: $y = \left(\eta_0 \theta_0 + \eta_1 \theta_1\right) / \left(\eta_0 + \eta_1 \right)$

In case $\eta_0$ is much larger than $\eta_1$, the temperature $y$ will tend to $\theta_0$.

Note that this functional form applies to many systems that are governed by gradient-driven transport.

The heat transfer coefficients cannot be directly measured, but depend on other variables that can be measured and are therefore known - those are collectively called $\mathbf{x}$ (e.g., temperatures, pressures and velocities), that vary with time:

\begin{align} \eta_0 = \eta(\mathbf{x}(t)), \qquad \eta_1 = \pi(t) \cdot \eta(\mathbf{x}(t)) \end{align} where $\pi$ is the function that suddenly changes from 1 to a different value, at $t=t_{anomaly}$.

The temperature is supposed to be known on one side, and unknown but dependent on known variables on the other side:

\begin{align} \theta_0 = \theta_0(t), \qquad \theta_1 = \theta_1(\mathbf{x}(t)) \end{align}

A typical dependency of the heat transfer coefficients is that they go proportional to several non-dimensional quantities raised to some power. Although this will not be demonstrated here, when these exact exponents are not known, it is very difficult to reconstruct the occurrence of the anomaly. Machine-based learning, on the other hand, can be quite robust - if we have enough data to train our model with -.

We will use two approaches to this problem: a Gaussian Process regression -that offers the advantage of estimating confidence intervals for our predictions-, and a neural network, using keras with a Tensorflow back-end.

In [1]:
%matplotlib inline
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

1. Setting some parameters

In [2]:
N = int(1E5) # number of timesteps
t_anomaly = 0.8 # fraction of N where anomaly occurs
diff_anomaly = -0.1 # the difference of the coefficient that occurs at t_anomaly (-0.1: 10% reduction)
t_train = 0.6 # the first fraction of the data we can train our model on. 
N_events = 100 # The number of changes that occur in the target values for the Ornstein-Uhlenbeck process that generates X
noise_level = 1.E-4 # Additive white noise on top of the measured signals. 1.E-4: very high SNR, clear detection. 
# 1.E-3: decent detection. 1.E-2: pushing it 

2. Generating the independent (observed) variables

To generate data, we start with piecewise linear target functions of $\mathbf{x}$. Using an Ornstein-Uhlenbeck scheme, we introduce inertia in the response and add some noise in the variable values.

In [3]:
def generate_random_pwlin_curves(N, n_x, y_min, y_max, seed, stoch=True, theta=0.01, sigma=0.01):
    """
    Returns piecewise linear time-series. 
    If stoch == True: additionally applies an Ornstein-Uhlenbeck process with the values 
    from the piecewise linear series as target values. 
    """
    np.random.seed(seed)
    y = np.zeros(N)
    # Generate random times at which the series will change its value
    t_delta = np.array([round(xd) for xd in np.random.random(n_x) * N])
    # First time: 0, then sort the random times
    t_delta_all = np.concatenate( (np.array([0]), np.sort(t_delta), np.array([N]) ) )
    
    # Generate the piecewise linear arrays by assigning constant values between the times
    for i in range(n_x + 1):
        y[int(t_delta_all[i]) : int(t_delta_all[i+1])] = y_min + np.random.random() * (y_max - y_min)
        
    if stoch:
        ## Ornstein Uhlenbeck process: reverting to the "target" (y), with added noise
        noise = np.random.normal(loc=0.0, scale=sigma, size=N) #define noise process
        drift = lambda y, mu: theta * (mu - y) # drift term. theta: the larger, the quicker 
        # the values revert to target
        mu = y.copy() # mu is now the target value for the drift process
        
        # Iterate through time
        for i in range(1, N):
            y[i] = y[i-1] + drift(y[i-1], mu[i-1]) + noise[i]
    return y
    
    
In [4]:
# Generate the x-matrix, with the row-index corresponding to time:
X = np.empty((N, 5))
# Generate the values
X[:,0] = generate_random_pwlin_curves(N, N_events, 1.0, 2.5, seed=79) 
X[:,1] = generate_random_pwlin_curves(N, N_events, 2.0, 5, seed=12) 
X[:,2] = generate_random_pwlin_curves(N, N_events, 0, 1, seed=13) 
X[:,3] = generate_random_pwlin_curves(N, N_events, 0.5, 2, seed=443) 
X[:,4] = generate_random_pwlin_curves(N, N_events, 0.5, 1.5, seed=123) 
In [5]:
matplotlib.rcParams.update({'font.size': 14})

# Plot the independent variables
fig, axs = plt.subplots(2,3,figsize=(15,6))
for i in range(len(axs.flat)):
    try:
        axs.flat[i].plot(X[:,i], lw=0.5)
        axs.flat[i].set_xlabel('Time [-]')
        axs.flat[i].set_ylabel('Value [-]')
        axs.flat[i].ticklabel_format(style='sci', axis='x', scilimits=(0,0))

    except IndexError:  # When exceeding the number of columns of X, make an empty axis
        ax=axs.flat[i]
        axs.flat[i].set_axis_off()
        
plt.suptitle('Time traces of the independent variables X', fontsize=16)
plt.tight_layout(pad=2)

3. Generating the hidden variables

The hidden ("latent") variables are the non-observed ingredients of the model. Some arbitrary dependence is assumed, loosely inspired on correlations for the turbulent heat transfer coefficient.

In [6]:
# Set some variables
t_ref = 10 # 1/(rho cp t)
p = np.ones(N)
p[round(N * t_anomaly)+1:] = 1 + diff_anomaly #This is the anomaly: change in the coefficient from 1 to below 1

# Define the functions that determine theta_1(x), eta_0(x) and eta_1(x):
fun_theta1 = lambda x1, x2 : x1 * (0.775 - 0.1*x2**2 + 0.7*x2**3)
fun_eta = lambda x1, x2, x3 : x1 * x2**0.628 * x3**-0.23

# Calculate the T1 and HTC's
theta_1 = fun_theta1(X[:,1], X[:,2])
eta_0 = fun_eta(X[:,3], X[:,4], X[:,0])
eta_1 = p * fun_eta(X[:,3], X[:,4], theta_1)

# Generate sensor values y
y = np.zeros(N)
y[0] = (eta_0[0] * X[0,0] + eta_1[0] * theta_1[1]) / (eta_0[0] + eta_1[0]) #Start with steady value

# Generate noise
eps = 1E-5 * np.random.random(len(y))

# Iterate through y using the difference equation
for i in range(N-1):
    y[i+1] = y[i] + 1/t_ref * (eta_1[i]*(theta_1[i] - y[i]) + eta_0[i]*(X[i,0] - y[i])) + eps[i]
   
In [7]:
# Add noise to both X and y
X += np.random.normal(loc=0.0, scale=noise_level, size=X.shape) 
y += np.random.normal(loc=0.0, scale=noise_level, size=y.shape)
In [8]:
# Zoom in on some intervals, to show that y (the solid temperature) tends to a value in between T0 and T1
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(14,5))

# 1: before anomaly
t1, t2 = 37000, 41000
ax1.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax1.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax1.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax1.legend(fancybox=True, framealpha=0.8, loc=2)
ax1.set_title('Temperatures, before anomaly', fontsize=16);

# 2: around anomaly
t1, t2 = int(N*t_anomaly-2000), int(N*t_anomaly+2000)
ax2.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax2.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax2.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax2.vlines(x=N*t_anomaly, ymin=ax2.get_ylim()[0], ymax=ax2.get_ylim()[1], linestyle='--', color='r', label='start of anomaly')
ax2.legend(fancybox=True, framealpha=0.8, loc=2)

ax2.set_title('Temperatures, around anomaly', fontsize=16);
ax2.set_xlabel('Time [-]')
plt.tight_layout()

The top plot shows how $y$ tends to a value that is in between $\theta_0$ and $\theta_1$, before the anomaly.

The bottom plot zooms in around the anomaly at t=80'000. We really cannot easily detect the event visually, because all variables are continuously changing. Let's see if a machine learning model can help us detect the event

4. Generate the features (X) and target (y)

In [9]:
# The target variable: y
y_target = y

# Create features from X, adding an extra 6th column for dy/dt
X_features = np.empty((N, 6))
X_features[:, :5] = X.copy()
X_features[:, 5] = np.concatenate((np.diff(y), np.array([0])))

# Scale the features to have zero mean and unit variance. 
rr = StandardScaler()
rr.fit(X_features)
X_features = rr.transform(X_features)

5. Gaussian Process Regressor from Sci-kit learn

In [10]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

We will train a Gaussian Process regression model, to predict the temperature $y$ at t based on the measured properties at $t$ and the change between $y(t+1)$ and $y(t)$.

Note that this not predictive modelling in the traditional sense for time-series (future prediction); we are merely reconstructing the missing piece ($y(t)$) from all the other ingredients that together make up the differential equation at time $t$.

Training a GP model is computationally demanding for large N, with the training computational effort scaling with ~N^3. Sci-kit learn's GP model has an internal optimizer, that yields hyperparameters with maximum-likelihood maximization. Finding this optimum is additional effort. The found optimum for the length scale is 5.4 (note: we are dealing with 6 dimensions) which is given explicitly, along with

optimizer=None

thus avoiding the optimization routine for the demo.

For this smooth problem, the common Gaussian/RBF kernel is well-suited.

In [11]:
N_1, train_step, test_step = int(t_train * N), 40, 10
X_train, y_train = X_features[0:N_1:train_step, :], y_target[0:N_1:train_step] # subsample features
X_test, y_test = X_features[::test_step, :], y_target[::test_step] # predict on entire range, subset later for test accuracy
In [12]:
X_train.shape
Out[12]:
(1500, 6)
In [13]:
GP = GaussianProcessRegressor(kernel=RBF(length_scale=5.4), alpha=noise_level, optimizer=None )
GP.fit(X_train, y_train)
Out[13]:
GaussianProcessRegressor(alpha=0.0001, copy_X_train=True,
             kernel=RBF(length_scale=5.4), n_restarts_optimizer=0,
             normalize_y=False, optimizer=None, random_state=None)
In [14]:
GP = GaussianProcessRegressor(kernel=RBF(length_scale=5.4), n_restarts_optimizer=1, alpha=noise_level)
GP.fit(X_train, y_train)
Out[14]:
GaussianProcessRegressor(alpha=0.0001, copy_X_train=True,
             kernel=RBF(length_scale=5.4), n_restarts_optimizer=1,
             normalize_y=False, optimizer='fmin_l_bfgs_b',
             random_state=None)
In [15]:
y_pred, sigma = GP.predict(X_test, return_std=True) # Also return the estimated uncertainty of the prediction
In [16]:
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), y_test, lw=2, label='actual values')
plt.plot(range(0, N, test_step), y_pred,'k-', label='predicted values')
plt.fill_between(range(0, N, test_step), y_pred- 2*sigma, y_pred+2*sigma, 
                 color='k', alpha=0.2, label='95% conf.int.')
plt.vlines(x=N_1, ymin=1, ymax=5,  color='k', linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=1, ymax=5, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('normalized temperature [-]')
plt.xlabel('time [-]')
plt.title('Predictions versus actual y-values', fontsize=16);
In [17]:
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), y_pred - y_test, label='prediction-actual')
plt.fill_between(range(0, N, test_step), -2*sigma, 2*sigma, color='k', alpha=0.5, label='95% conf.int.')
plt.vlines(x=N_1, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), color='r',
           linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]', fontsize=14)
plt.title('Residuals and model uncertainty', fontsize=16);

This plot shows the large advantage of Gaussian Process regression: our model indicates how certain it is regarding its predictions. Before the start of the test data (before black dashed line), the uncertainty band is rather narrow: the model has been trained on this data (with subsampling, which is why there is still some variation in the uncertainty within the training domain).

In the first part of the test-data, before the occurrence of the anomaly (red dashed line), the error increases. Since our features have 6 dimensions, and we had a total of 100 "set-points", we did not quite cover the entire parameter space. However, the model indicates this perfectly well: the uncertainty becomes much larger.

In the final part, the predictions generally lie far outside of our confidence interval. Let's calculate the log-likelihood of our data given our model's prediction and uncertainty, and plot this against time.

In [18]:
import scipy.stats
eps = 1E-12 # eps to avoid -inf
LL = np.log(eps + scipy.stats.norm(y_pred, sigma).pdf(y_target[::test_step])) 
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), LL, label='Log-likelihood')
plt.vlines(x=N_1, ymin=min(LL), ymax=10, linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=min(LL), ymax=10, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('Log-likelihood [-]')
plt.xlabel('time [-]')
plt.title('Log-likelihood of data given prediction', fontsize=16);

The log-likelihood shows a slight degradation after the start of unseen data, but a drastic jump at $t_{novelty}$. Since the change is so drastic, one can easily automate detection by adding a moving average and thresholding.

In [19]:
# Calculate the R-squared and MSE of the GP model
R2_GP = r2_score(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
                 y_test[N_1//test_step:int(N*t_anomaly)//test_step])
MSE_GP = mean_squared_error(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
                            y_test[N_1//test_step:int(N*t_anomaly)//test_step])
In [20]:
# save the residuals for later use
res_GP = y_pred - y_test

6. Artificial neural network with Keras and TensorFlow

In [21]:
import keras
from time import time
from keras.models import Sequential
from keras.layers import Dense
from keras import callbacks
from keras.callbacks import TensorBoard
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
Using TensorFlow backend.

As an alternative, we will train a neural network to do the same task. The effort for training a neural network goes linear with the number of training points; we do not to undersample and just use all training data poitns.

In [22]:
X_train, y_train = X_features[:N_1, :], y_target[:N_1]
X_test, y_test = X_features[:, :], y_target[:]
In [23]:
def twolayer_model(activation_1='relu', activation_2='linear'):
    model = Sequential()
    model.add(Dense(6, input_dim=6, kernel_initializer='normal', activation=activation_1))
    model.add(Dense(1,  kernel_initializer='normal', activation=activation_2))
    model.compile(loss='mean_squared_error',optimizer='adam', metrics=['mean_squared_error'])
    return model
In [24]:
model = twolayer_model(activation_1='sigmoid', activation_2='linear')
# The re-lu gives somewhat poorer accuracy
In [25]:
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
# The tensorboard object will write data for visualization with tensorboard. 
# start tensorboard from the terminal with:
# $ tensorboard --log_dir=logs/
model.fit(X_train, y_train, epochs=200, batch_size=100, verbose=0, callbacks=[tensorboard])
Out[25]:
<keras.callbacks.History at 0x1a2175b240>
In [26]:
y_pred = model.predict(X_test)[:, 0] #keras returns np.ndarray. Take the first column
In [27]:
plt.figure(figsize=(14, 8))
plt.plot(range(len(y_test)), y_test, lw=3, label='actual values')
plt.plot(range(len(y_test)), y_pred, 'k-', label='prediction')
plt.vlines(x=N_1, ymin=0, ymax=5, color='k', linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=0, ymax=5, color='r', linestyle='--')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('normalized temperature [-]', fontsize=14)
plt.xlabel('time [-]', fontsize=14)
plt.title('Predictions versus actual y-values', fontsize=14);
In [28]:
fig = plt.figure(figsize=(14,8))
plt.plot(range(len(y_pred)), y_pred - y_test, label='Residuals, NN')
plt.plot(range(0, N, test_step), res_GP, 'k', linestyle=':', label='Residuals, GP', lw=2)
ax = plt.gca()
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()

plt.vlines(x=N_1, ymin=ymin,  ymax=ymax, linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=ymin,  ymax=ymax, color='r', linestyle='--')
plt.hlines(y=0, xmin=xmin, xmax=xmax)
plt.legend()
plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]')
plt.title('Residuals, GP versus NN', fontsize=16);

Since we do not have an error for each prediction, as in the GP case, we cannot calculate a log-likelihood. It is therefore not possible to distinguish between mismatch because of model uncertainty or because of the anomaly.

In [29]:
# Determine the R-squared and MSE of the neural network model
R2_NN = r2_score(y_pred[N_1:int(N*t_anomaly)], y_test[N_1:int(N*t_anomaly)])
MSE_NN = mean_squared_error(y_pred[N_1:int(N*t_anomaly)], y_test[N_1:int(N*t_anomaly)])
In [30]:
#Compare the GP and NN accuracies:
print('   Model: \t GP \t \t  NN')
print('R-squared \t {:.4f} \t {:.4f}'.format(R2_GP, R2_NN))
print('MSE       \t {:.6f} \t {:.6f}'.format(MSE_GP, MSE_NN))
   Model: 	 GP 	 	  NN
R-squared 	 0.9993 	 0.9991
MSE       	 0.000155 	 0.000187

Concluding:

Both the Gaussian process model and the 2-layer neural network can easily identify the anomaly, that is a mere 10% reduction of a coefficient. Both models succeed to capture the differential equation very well, having good predictive accuracy (R^2 exceeding 99% on test data).

The Gaussian process regressor offers a measure for its prediction uncertainty, which greatly helps detection because it allows for determination of the log-likelihood of the data given the model. This log-likelihood drastically reduces after the anomaly.

Comments