Dimensionality reduction with PCA and SVD

Intuition and mathematics behind PCA

Principal Component Analysis (PCA) is a commonly used method for dimensionality reduction. It is closely related to Singular Value Decomposition (SVD). The aim of this post is to give an intuition on how PCA works, go through the linear algebra behind it, and to illustrate some key properties of the transform.

Data sets often have many (hundreds or even thousands) features or dimensions, typically with many redundancies (i.e.: correlating variables). PCA is a technique based on linear algebra that aims to represent the data as compactly as possible, by introducing a "change of basis". The following text was largely based on this excellent article by Jonathan Shlens. I will use a somewhat different notation/convention though:

  • A data matrix has a shape nxD (or nxR), with n the amount of observations, D (R) the number of (reduced) dimensions
  • The transformed data matrix is called T

Furthermore, note that an unbiased estimator for the variance has $n-1$ in the denominator (in short; because we don't use the true mean but the empirical mean, we underestimate the variance by $(n-1)/n$. I will nevertheless use $n$ to keep the notation short).

The main concepts behind PCA are:

  • Significance of the covariance matrix

When our data is centered (in our $nxD$ convention, meaning that the columns have zero mean), we can calculate the covariance matrix $C_x \equiv 1/n \cdot X^T X$. Large off-diagonal values in $C_x$ signal redundancy. Think for instance of a two-dimensional, diagonal scatter around the line $x=y$: the covariance matrix will have off-diagonal components nearly as large as diagonal components. By rotating our basis by 45°, the off-diagonal components will disappear and we can describe our data quite well with only one component (you have surely seen sketches of this).

This illustrates the main aim of PCA: to transform our data such that the covariance matrix of our transformed data is diagonal. Finally, we want our entries on the diagonal to be ordered from large to smaller, such that we can cut off our data (reduce $D$ to $R$) with as little loss of signal as possible.

  • Covariance matrix of transformed data

Introducing the transformation $T = X \cdot P$, the covariance matrix of our transformed data $T$ is:

$\begin{eqnarray} C_T & \equiv & 1/n \cdot (T^T T) \\ &=& 1/n \cdot (X P)^T (XP) \\ &=& 1/n \cdot P^T X^T X P \\ &=& P^T C_X P \end{eqnarray}$

Which is a simple manipulation of our original covariance matrix $C_X$.

  • Diagonalizing the covariance matrix

A covariance matrix is symmetric: the covariance operator $1/n <x, y>$ does not care about the order of $x$ and $y$. As such we can flip the row and column indices. Any symmetric matrix is diagonalized by a matrix consisting of its eigenvalues, and these eigenvalues are by necessity orthogonal. Because the proof that demonstrates orthogonality is nice and simple, I included it below. The diagonalization of a symmetric matrix $A$ is as follows:

$A = E D E^T$, with $D$ diagonal, and $E$ containing the orthonormal (orthogonal, and unit length) eigenvectors, column-wise. Note that, because $E$ consists of orthonormal vectors, $E E^T = I$, so $E^{-1} = E^T$.

Now the question is: what transformation matrix $P$ will result in $C_T$ being a diagonal matrix? The solution is: choose $P = E$, so the columns in $P$ are also the eigenvectors of $X^TX$:

$\begin{eqnarray} C_T &=& P^T C_X P \\ & =& P^T E D E^T P \\ & =& E^T E D E^T E \\& =& D \end{eqnarray}$

So, concluding:

  • We center our data set (column means are zero), to obtain $X$
  • We compute the covariance matrix $C_x$, its eigenvectors ${\mathbf{u_1}, \mathbf{u_2}, ... \mathbf{u_D}}$ corresponding to the descendingly ordered eigenvalues $\lambda_1, \lambda_2, ..\lambda_D$.
  • We multiply our centered data matrix $X$ ($nxD$) by a $DxD$ matrix $P$: $[\mathbf{u_1}, \mathbf{u_2}, ... \mathbf{u_D}]$ to get the measurements in our new space $T$.
  • The $DxD$ covariance matrix of our data matrix $T$ is diagonal, containing $\lambda_1, \lambda_2, ..\lambda_D$

So far, it seems we haven't gained much. However, now we can reduce the dimensionality of our data as much as we like, throwing away as little information as (linearly) possible, by not considering all columns of $T$. Now, we are compressing the data in a lossy fashion. Because of the ordering of the principal components according to the eigenvalues, we keep as much information as possible by discarding the columns starting from the right. What does it mean for our original data? Instead of being exact, and writing:

$X = T \ \ P^T $, we can approximate:

$X \approx T_r \ \ P_r^T $ , with $T_r$ $(nxR)$, $P_r$ $(RxR)$, and $R < D$.

A concrete example to illustrate this: a hypothetical data set which could be the result of 10 anonymous people rating 6 movies.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
ratings = pd.DataFrame(columns=['The Matrix', 'Star Wars', 'Monsters Inc.', 'Finding Nemo', 'Wall-E', 'FastFurious 8'])
ratings.loc['A', :] = [9, 8, 4, 5, 7, 2]
ratings.loc['B', :] = [3, 2, 8, 8, 6, 1]
ratings.loc['C', :] = [2, 3, 8, 9, 5, 3]
ratings.loc['D', :] = [8, 10, 3, 3, 6, 2]
ratings.loc['E', :] = [9, 7, 2, 1, 5, 2]
ratings.loc['F', :] = [2, 2, 10, 10, 6, 3]
ratings.loc['G', :] = [2, 1, 9, 10, 5, 2]
ratings.loc['H', :] = [7, 9, 1, 1, 5, 2]
ratings.loc['I', :] = [2, 3, 2, 4, 3, 9]
ratings.loc['J', :] = [3, 2, 3, 2, 2, 10]
# show it
The Matrix Star Wars Monsters Inc. Finding Nemo Wall-E FastFurious 8
A 9.0 8.0 4.0 5.0 7.0 2.0
B 3.0 2.0 8.0 8.0 6.0 1.0
C 2.0 3.0 8.0 9.0 5.0 3.0
D 8.0 10.0 3.0 3.0 6.0 2.0
E 9.0 7.0 2.0 1.0 5.0 2.0
F 2.0 2.0 10.0 10.0 6.0 3.0
G 2.0 1.0 9.0 10.0 5.0 2.0
H 7.0 9.0 1.0 1.0 5.0 2.0
I 2.0 3.0 2.0 4.0 3.0 9.0
J 3.0 2.0 3.0 2.0 2.0 10.0
In [3]:
def show_correlationmatrix_df(df, title):
    cov_df = df.cov()
    mask = np.ones_like(cov_df, dtype=np.bool)
    mask[np.tril_indices_from(mask)] = False
    sns.heatmap(cov_df, mask=mask)
    plt.title(title, fontsize=16)
show_correlationmatrix_df(ratings, title='Covariance of ratings along movies')

There are several strong off-diagonal covariances: if you like The Matrix, you probably also like Star Wars (sci-fi/adventure). And if you like Monsters Inc., you probably also like Finding Nemo (animation). Wall-E is a bit of both. And if you like the Fast and the Furious 8, well, .

Let's continue with numpy, and do PCA by eigenvalue decomposition of the covariance matrix.

In [4]:
# Obtain a numpy array from the DataFrame
X = ratings.values
def pca_by_eigdecomp(X, ddof=1):
    Does PCA on the data X (n observations x D features) by eigenvalue decomposition
    of the correlation matrix of X.
    Returns the transformed data T, the principal components P, 
    the eigenvalues, and the mean X_mean
    ddof = 1 results in an unbiased estimator for the covariance
    # Subtract the mean of the original data (and return it for later use)
    X_mean = X.mean(axis=0)
    X_centered = X - X_mean[np.newaxis, :]
    X_c = 1/(X.shape[0] - ddof) * np.dot(np.transpose(X_centered), X_centered)
    # Get the eigenvalues (lambda) and eigenvectors of X_c
    lambdas, P = np.linalg.eig(X_c)
    # Note that the eigenvalues are not necessarily sorted. Sort them, and the corresponding ev's
    idx_sort = np.argsort(-lambdas) # minus to accomplish descending order
    lambdas = lambdas[idx_sort]
    P = P[:, idx_sort]

    # Use P to transform X
    T = np.dot(X_centered, P)
    return T, P, lambdas, X_mean

T, P, lambdas, X_mean = pca_by_eigdecomp(X)    
In [5]:
# Check that P D P^T indeed yields X_c
np.set_printoptions(precision=3, suppress=True)
print(np.dot(np.dot(P, np.diag(lambdas)), np.transpose(P)) - np.cov(np.transpose(X)))
[[ 0.  0.  0.  0.  0. -0.]
 [ 0. -0.  0.  0.  0. -0.]
 [ 0.  0.  0.  0.  0. -0.]
 [ 0.  0.  0.  0.  0. -0.]
 [ 0.  0.  0.  0.  0. -0.]
 [-0. -0. -0. -0. -0.  0.]]
In [6]:
# Verify that the covariance matrix of the transformed data is diagonal
print(np.cov(np.transpose(T))) # NB: np takes the covariance along the columns. Therefore, transpose
show_correlationmatrix_df(pd.DataFrame(T), title= 'Covariance of transformed data')
[[ 37.514  -0.     -0.      0.      0.     -0.   ]
 [ -0.     18.192  -0.     -0.      0.     -0.   ]
 [ -0.     -0.      1.273  -0.     -0.     -0.   ]
 [  0.     -0.     -0.      0.926   0.      0.   ]
 [  0.      0.     -0.      0.      0.296  -0.   ]
 [ -0.     -0.     -0.      0.     -0.      0.098]]
In [7]:
# Verify the reconstruction error 
def reconstruct_X(T, P, X_mean, n_components=None):
    Reconstructs X from the scores T, the principal components P, the mean
    value of X X_mean, taking into account n_components
    When n_components is None, the full reconstruction is done
    if not n_components:
        return X_mean + np.dot(T, np.transpose(P))
        return X_mean + np.dot(T[:, :n_components], np.transpose(P[:, :n_components])) 
In [8]:
# Calculate the reconstruction error as the variance of the difference between the actual and reconstructed X, 
# for all possible number of pca components
rec_error_n_components = [np.var(X - reconstruct_X(T, P, X_mean, i), ddof=1) for i in range(1, 7)]
In [9]:
plt.plot(range(1, 7), rec_error_n_components/np.var(X), label='variance of the error')
plt.plot(range(1, 7), 1- (np.cumsum(lambdas) / np.sum(lambdas)), 's', label='1 - normalized cum.sum of lambda')
plt.xlabel('Number of components', fontsize=14)
plt.ylabel('Normalized reconstruction error', fontsize=14)

We have ascertained that:

  • We can exactly reconstruct X in our transformed space P using the new feature matrix T
  • We can approximate X by reducing the number of columns of T
  • The reconstruction error (variance of the difference) quickly tends to zero, in a convex fashion
  • The reconstruction error can also be directly determined from the $\lambda$'s

Let's look into a final aspect: the variance of the reconstructed data, as a function of the number of components that are considered. We have already seen that these are related to the eigenvalues of $X^T X$.

The total variance is the sum of the diagonal elements of the covariance matrix. Since

$C_X = P C_T P^T = C_X = \Sigma_{i=0}^R \lambda_i u_i u_i^T$ (this is known as a "spectral decomposition"), the sum of the diagonal values are simply $\Sigma_{i=0}^R \lambda_i u_i u_i^T$

we can rewrite the sum of the diagonal, summing over the $R$ components:

$\begin{eqnarray} \mathrm{Var}(X) & = & \Sigma_{i=0}^D (C_x[i, i]) \\ & = & \Sigma_{i=0}^D C_x[i, i] \\ & = & \Sigma_{i=0}^R \lambda_i <u_i, u_i> \\ & = & \Sigma_{i=0}^R \lambda_i \end{eqnarray}$,

because the dot product of the orthonormal eigenvectors with themselves is 1.

Let us verify this result with our toy data:

In [10]:
var_x_reconstruct = np.array([np.sum(np.var(reconstruct_X(T, P, X_mean, i), axis=0, ddof=1)) 
                              for i in range(1, 7) ]) # NB: ddof = 1 makes the denominator equal to n-1 (unbiased)
In [11]:
print('Reconstructed variance: \t {}'.format(var_x_reconstruct) )
print("Cumulative sum of lambda's: \t{}".format(np.cumsum(lambdas)) )
Reconstructed variance: 	 [ 37.514  55.706  56.98   57.906  58.202  58.3  ]
Cumulative sum of lambda's: 	[ 37.514  55.706  56.98   57.906  58.202  58.3  ]

We can therefore look at the lambda's to determine how many components we need to reconstruct the original data.

Sklearn has a pca implementation, such that we don't have to deal with the linear algebra. It is also more efficient for large matrices, as it uses Singular Value decomposition. As a final check, let's compare our results to sklearn's implementation

In [12]:
from sklearn.decomposition import PCA
In [13]:
pca = PCA(n_components=2)
# verify that this is indeed identical
print(T[:, :2])
# Note that the sign is reversed of the second column. 
# This is because the principal component has the opposite direction
[[-4.193 -3.892]
 [ 5.089 -2.419]
 [ 5.597 -0.915]
 [-6.349 -3.031]
 [-6.936 -1.212]
 [ 7.657 -1.78 ]
 [ 7.637 -1.564]
 [-7.539 -0.926]
 [-0.204  7.203]
 [-0.759  8.537]]

[[-4.193  3.892]
 [ 5.089  2.419]
 [ 5.597  0.915]
 [-6.349  3.031]
 [-6.936  1.212]
 [ 7.657  1.78 ]
 [ 7.637  1.564]
 [-7.539  0.926]
 [-0.204 -7.203]
 [-0.759 -8.537]]
In [14]:
# NB: sklearn's PCA does not use the n-1 normalization, but n. 
# Therefore, multiply by (n-1)/n to get identical results
[ 33.763  16.373]
[ 33.763  16.373]

Summarizing, we have looked at some key elements of PCA:

  • By projecting our data on the principal components (the eigenvectors of the covariance matrix of X), we transformed our data X into a new matrix T
  • Using all principal components, T has the same dimension as X, and the reconstruction is exact
  • The covariance matrix of T is diagonal (i.e., the features in the transformed data do not correlate)
  • The eigenvalues of the covariance matrix of X tell us directly how much of the original signal (variance) we get by including the corresponding principal component
  • By leaving away the right column(s) of T, we can reduce the dimensionality of our data with minimal loss of signal
  • In our toy example, we could express roughly 95% of the signal using 2 instead of 6 dimensions

The exact same result was achieved with sklearn's PCA (but internally, the sklearn implementation uses SVD instead of eigendecomposition of the covariance matrix. Note also that sklearn's PCA uses the biased estimator of the covariance - which does not result in significant differences: neither P nor the normalized eigenvalues (normalized by their sum) are affected).

Optimizing with sklearn's GridSearchCV and Pipeline

Using sklearn's gridsearchCV and pipelines for hyperparameter optimization

Sklearn has built-in functionality to scan for the best combinations of hyperparameters (such as regularization strength, length scale parameters) in an efficient manner.

With the Pipeline class, we can also pass data-preprocessing steps such as standardization or PCA. This is a real time-saver. No more writing complex cross-validation loops with .fit() on the train set, calculating the score on the leave-out set, storing it in a dataframe...

We will use the pre-processed data from


that was explored in detail in a previous post.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

from sklearn.linear_model import LogisticRegression # Classifier 1
from sklearn.ensemble import RandomForestClassifier # Classifier 2
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, make_scorer, roc_curve, auc
from sklearn.model_selection import GridSearchCV # to optimize hyperparams
from sklearn.preprocessing import StandardScaler # to standardize numerical features
from sklearn.pipeline import Pipeline # to pass the standardization along with the classifier to GridSearchCV

0. Import data

In [2]:
# after downloading, store locally. Set the path here
data_dir = '../data/'
## Let's write the results to disk for analysis. 
X_train = pd.read_pickle(path=data_dir + 'X_train')
X_test = pd.read_pickle(path=data_dir + 'X_test')
X_train_ohe = pd.read_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe = pd.read_pickle(path=data_dir + 'X_test_ohe')
y_train = pd.read_pickle(path=data_dir + 'y_train')
y_test = pd.read_pickle(path=data_dir + 'y_test')

1. Classification

The data set is already cleaned-up, and one-hot encoded:

In [3]:
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
count 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000
mean 10.080679 38.581647 40.437456 1077.648844 87.303830 0.669205 0.031909 0.095943 0.008323 0.854274 0.064279 0.000215 0.753417 0.034274 0.078038 0.039864 0.000430
std 2.572720 13.640433 12.347429 7385.292085 402.960219 0.470506 0.175761 0.294518 0.090851 0.352837 0.245254 0.014661 0.431029 0.181935 0.268236 0.195642 0.020731
min 1.000000 17.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 28.000000 40.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
50% 10.000000 37.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
75% 12.000000 48.000000 45.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
max 16.000000 90.000000 99.000000 99999.000000 4356.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Let's do classification using logistic regression and random-forest, and compare the results. As features, we have:

  • education_num (as a numerical feature, which seems a fairly decent approach)
  • age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage here
  • hours per week (numerical)
  • sex (categorical)
  • race (categorical)
  • hoursperweek (numerical)
  • workclass (categorical)
  • capital gain (numerical)
  • capital loss (numerical)

Note that zip code was thrown away from the original dataset. It might carry some useful data after suitable binning (for instance, mapping zip to state), but without such an operation it is likely not a suitable feature.

The website mentions typical accuracy scores of about 85% (after removal of unknown records, which is not what was done here), so let's see if we can get close to that. Note that the dataset has an imbalance with only about 24% being positive, so an accuracy of 76% can be obtained trivially. We will therefore also look at the F1-score and at ROC curves.

What is a meaningful score in real-life depends on the business case. For instance: with a fixed budget for targeted marketing on the positive group, we are probably most interested in the precision of our classifier.

Logistic regression

In [19]:
f1_scorer = make_scorer(f1_score)

pipe = Pipeline([
    ('standardscaler', StandardScaler()), #'standardscaler' is the name we chose to give 
    ('classify', LogisticRegression(random_state=1234))
C_grid = np.logspace(-4, 4, 8) # These are the 'C' parameters that are varied

# the parameter_grid is a list with dictionaries. The naming for keys in that dict is strict. 
# It refers to the the element in the pipe, with a __ followed by the parameter name we want to vary
# Whence, we use 'classify__C'

param_grid = [{'classify__C': C_grid }]

# Give GridSearchCV the pipe, the number of folds cv, the param_grid, the scorer we like to use for assessment
# (if we don't pass "scoring", we will just be using the classifier's default scorer)
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 8 candidates, totalling 80 fits
[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   11.4s finished
GridSearchCV(cv=10, error_score='raise',
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classify', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1234, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__C': array([  1.00000e-04,   1.38950e-03,   1.93070e-02,   2.68270e-01,
         3.72759e+00,   5.17947e+01,   7.19686e+02,   1.00000e+04])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)

We just did cross-validation, scanning for 8 different C-values, each time fitting the scaler on the train-set, and scoring on the hold-out set. The results are stored in

In [21]:
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_train_score'], label='train')
#plt.ylim([0.8, 0.82])

C is an inverse regularization-strength. For C about 1 or higher, the results do not change, meaning that regularization is not having any influence any more (the penalty it incurs is relatively small).

Because of the relatively large amount of observations (32.5 K) compared to the number of features (17), regularization is clearly not needed. This is also observed in the accordance between test and train score. We can take C=1000.0 to effectively have unregularized regression

In [6]:
sc = StandardScaler()
X_train_ohe = sc.fit_transform(X_train_ohe)
X_test_ohe = sc.transform(X_test_ohe)
LR = LogisticRegression(C=1000.0)
LR.fit(X_train_ohe, y_train)
y_predict_LR = LR.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_LR, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_LR)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_LR)))
col_0 0 1
0 0.721946 0.041828
1 0.131749 0.104478
accuracy: 0.826
f1-score: 0.546

Note that the actual labels are in the rows, and that predicted labels are displayed column-wise. There seem to be different conventions, but I follow the convention from "The elements of statistical learning (Hastie et al.)" here. Also, giving the 0th index (rows) the true values, and the 1st index (columns) the prediction seems to be more intuitively correct.

Our logistic model has a rather poor recall: of the 23% that are truly positive, less than a half is identified as such. The precision is better. Let us look at the ROC curve, that tells us a bit more:

In [7]:
y_predict_LR_proba = LR.predict_proba(X_test_ohe)
fpr_LR, tpr_LR, thresholds = roc_curve(y_test, y_predict_LR_proba[:,1], pos_label=1)
In [8]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR)
plt.title('Logistic Regression, AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

Random Forest Classifier

For the random forest classifier, we do not need to worry about standardization and in principle do not need one-hot-encoding. Regularization is enforced by minimizing the complexity of the individual trees.

X_train and X_test contain categorical variables, that unfortunately are not handled by sci-kit learn well. Let us go two routes:

  1. Use one-hot-encoded features
  2. Use coded categories (integers)

We will start off with one-hot-encoded features

In [22]:
pipe = Pipeline([
    ('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
    {'classify__max_depth': maxdepth_grid },
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   49.3s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  1.1min finished
GridSearchCV(cv=10, error_score='raise',
     steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0,
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)
In [23]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')

In Random Forest classification, complexity is determined by how large we allow our trees to be. Clearly, we start overfitting on the train data from a depth of 10 or more. However, because we are using a 100 trees (we are bootstrap aggregating), the variance is strongly suppressed. Although the predictive power doesn't increase much from a depth of about 15 or more, the predictions also do not deteriorate. This is in general a very nice property of random forests: grow many trees, and grow them large enough, and you are almost guaranteed to get a good classifier.

In [11]:
RFC = RandomForestClassifier(n_estimators=100, max_depth=20)
RFC.fit(X_train_ohe, y_train)
y_predict_RFC = RFC.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_RFC, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_RFC)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_RFC)))
col_0 0 1
0 0.722192 0.041582
1 0.118113 0.118113
accuracy: 0.840
f1-score: 0.597
In [12]:
y_predict_RFC_proba = RFC.predict_proba(X_test_ohe)
fpr_RFC, tpr_RFC, thresholds = roc_curve(y_test, y_predict_RFC_proba[:,1], pos_label=1)
In [13]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR, label= 'Log. Reg.')
plt.plot(fpr_RFC, tpr_RFC, label='RFC')
plt.title('LR AUC: {:.3f}, RFC AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR),
                                                  auc(x=fpr_RFC, y=tpr_RFC)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

By shifting the level at which the label is predicted positive, for an identical FPR (i.e.: what fraction of actual negatives have been classified as positive), the TPR (what fraction of actual positives has been identified as such) of the RFC classifier is larger: it is a stronger classifier.

As a final step, let us see if there is any difference with the integer-coded categorical features (rather than one-hot encoded)

In [14]:
# Convert categorical to codes (integers)
cat_cols = list(X_train.columns[X_train.dtypes == 'category'])
for col in cat_cols:
    X_train[col] = X_train[col].cat.codes
In [15]:
pipe = Pipeline([
    ('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
    {'classify__max_depth': maxdepth_grid },
grid_2 = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid_2.fit(X_train, y_train)
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   40.4s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   57.9s finished
GridSearchCV(cv=10, error_score='raise',
     steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0,
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)
In [25]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test, ohe',
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train, ohe',
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_test_score'], label='test')
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')


Using sklearn's Pipeline and GridsearchCV, we did the entire hyperparameter optimization loop (for a range of values, standardization, fitting, scoring on test predictions, storing results) in a few lines. We got somewhat better results with the Random Forest Classifier compared to the Logistic Regression classifier, as was to be expected based on the features (for instance, as the expectation of the target is not monotonically increasing with age). One-hot-encoded features and integer-coded features give very comparable performance for RFC.

In the end, our accuracy score ( NB: without removing observations with missing features) are on par with those mentioned on the website (where observations were removed).

As a quick and dirty comparison: the RFC accuracy is 84.40%, whereas the mentioned best scores are C4.5: 84.46%, Naive-Bayes: 83.88%, NBTree: 85.90%)

Anomaly Detection with Sci-kit Learn, Keras and TensorFlow

Anomaly detection in time-series data


Consider an unsteady heat transfer problem, with a solid plate that sits in between two turbulent media with different temperatures. The rate of change of temperature of the plate is driven by the weighted temperature difference of the two gaseous streams, and the weights -heat transfer coefficients- depend on other variables, such as the air velocity and pressure.

Our system is equipped with sensors, and we want to know whether an anomaly occurred at some point (a heat transfer change due to loss of insulation, a sensor that started giving wrong values, ..). We cannot observe this directly from the measured data. An essential difficulty for a traditional, physical model-based approach, is that our observed and unobserved variables have relationships that are not exactly known to us (for instance, the relationship between the heat transfer coefficient and variables such as temperature, pressure and velocity).

Nevertheless, we would like to be able to detect a hidden change in our system -in this case, a reduction in a heat transfer coefficient of 10%-. We will train a machine-learning model to learn how the observed variables are connected, such that it can detect when something does not look quite right anymore.

As a first step, let's generate data that follows from the following non-dimensionalized ordinary differential equation:

\begin{gather} t_{ref} \frac{dy}{dt} &= \eta_0 \theta_0 + \eta_1 \theta_1 - y(\eta_1 + \eta_0) \\ \end{gather}

Where the $\eta$'s denote the heat transfer coefficients, the $\theta$'s the temperatures (at both sides, indicated with 0 and 1), and $t_{ref}$ the time constant (that depends, among other things, on the material properties and thickness). To give some physical intuition:

  • when both $\eta$'s are identical, the steady-state solution is: $y = (\theta_0 + \theta_1)/2$. Note that this is also a stable solution: whenever y is larger, the rate of change becomes negative, and vice versa
  • More general, the steady-state solution is: $y = \left(\eta_0 \theta_0 + \eta_1 \theta_1\right) / \left(\eta_0 + \eta_1 \right)$

In case $\eta_0$ is much larger than $\eta_1$, the temperature $y$ will tend to $\theta_0$.

Note that this functional form applies to many systems that are governed by gradient-driven transport.

The heat transfer coefficients cannot be directly measured, but depend on other variables that can be measured and are therefore known - those are collectively called $\mathbf{x}$ (e.g., temperatures, pressures and velocities), that vary with time:

\begin{align} \eta_0 = \eta(\mathbf{x}(t)), \qquad \eta_1 = \pi(t) \cdot \eta(\mathbf{x}(t)) \end{align} where $\pi$ is the function that suddenly changes from 1 to a different value, at $t=t_{anomaly}$.

The temperature is supposed to be known on one side, and unknown but dependent on known variables on the other side:

\begin{align} \theta_0 = \theta_0(t), \qquad \theta_1 = \theta_1(\mathbf{x}(t)) \end{align}

A typical dependency of the heat transfer coefficients is that they go proportional to several non-dimensional quantities raised to some power. Although this will not be demonstrated here, when these exact exponents are not known, it is very difficult to reconstruct the occurrence of the anomaly. Machine-based learning, on the other hand, can be quite robust - if we have enough data to train our model with -.

We will use two approaches to this problem: a Gaussian Process regression -that offers the advantage of estimating confidence intervals for our predictions-, and a neural network, using keras with a Tensorflow back-end.

In [1]:
%matplotlib inline
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

1. Setting some parameters

In [2]:
N = int(1E5) # number of timesteps
t_anomaly = 0.8 # fraction of N where anomaly occurs
diff_anomaly = -0.1 # the difference of the coefficient that occurs at t_anomaly (-0.1: 10% reduction)
t_train = 0.6 # the first fraction of the data we can train our model on. 
N_events = 100 # The number of changes that occur in the target values for the Ornstein-Uhlenbeck process that generates X
noise_level = 1.E-4 # Additive white noise on top of the measured signals. 1.E-4: very high SNR, clear detection. 
# 1.E-3: decent detection. 1.E-2: pushing it 

2. Generating the independent (observed) variables

To generate data, we start with piecewise linear target functions of $\mathbf{x}$. Using an Ornstein-Uhlenbeck scheme, we introduce inertia in the response and add some noise in the variable values.

In [3]:
def generate_random_pwlin_curves(N, n_x, y_min, y_max, seed, stoch=True, theta=0.01, sigma=0.01):
    Returns piecewise linear time-series. 
    If stoch == True: additionally applies an Ornstein-Uhlenbeck process with the values 
    from the piecewise linear series as target values. 
    y = np.zeros(N)
    # Generate random times at which the series will change its value
    t_delta = np.array([round(xd) for xd in np.random.random(n_x) * N])
    # First time: 0, then sort the random times
    t_delta_all = np.concatenate( (np.array([0]), np.sort(t_delta), np.array([N]) ) )
    # Generate the piecewise linear arrays by assigning constant values between the times
    for i in range(n_x + 1):
        y[int(t_delta_all[i]) : int(t_delta_all[i+1])] = y_min + np.random.random() * (y_max - y_min)
    if stoch:
        ## Ornstein Uhlenbeck process: reverting to the "target" (y), with added noise
        noise = np.random.normal(loc=0.0, scale=sigma, size=N) #define noise process
        drift = lambda y, mu: theta * (mu - y) # drift term. theta: the larger, the quicker 
        # the values revert to target
        mu = y.copy() # mu is now the target value for the drift process
        # Iterate through time
        for i in range(1, N):
            y[i] = y[i-1] + drift(y[i-1], mu[i-1]) + noise[i]
    return y
In [4]:
# Generate the x-matrix, with the row-index corresponding to time:
X = np.empty((N, 5))
# Generate the values
X[:,0] = generate_random_pwlin_curves(N, N_events, 1.0, 2.5, seed=79) 
X[:,1] = generate_random_pwlin_curves(N, N_events, 2.0, 5, seed=12) 
X[:,2] = generate_random_pwlin_curves(N, N_events, 0, 1, seed=13) 
X[:,3] = generate_random_pwlin_curves(N, N_events, 0.5, 2, seed=443) 
X[:,4] = generate_random_pwlin_curves(N, N_events, 0.5, 1.5, seed=123) 
In [5]:
matplotlib.rcParams.update({'font.size': 14})

# Plot the independent variables
fig, axs = plt.subplots(2,3,figsize=(15,6))
for i in range(len(axs.flat)):
        axs.flat[i].plot(X[:,i], lw=0.5)
        axs.flat[i].set_xlabel('Time [-]')
        axs.flat[i].set_ylabel('Value [-]')
        axs.flat[i].ticklabel_format(style='sci', axis='x', scilimits=(0,0))

    except IndexError:  # When exceeding the number of columns of X, make an empty axis
plt.suptitle('Time traces of the independent variables X', fontsize=16)

3. Generating the hidden variables

The hidden ("latent") variables are the non-observed ingredients of the model. Some arbitrary dependence is assumed, loosely inspired on correlations for the turbulent heat transfer coefficient.

In [6]:
# Set some variables
t_ref = 10 # 1/(rho cp t)
p = np.ones(N)
p[round(N * t_anomaly)+1:] = 1 + diff_anomaly #This is the anomaly: change in the coefficient from 1 to below 1

# Define the functions that determine theta_1(x), eta_0(x) and eta_1(x):
fun_theta1 = lambda x1, x2 : x1 * (0.775 - 0.1*x2**2 + 0.7*x2**3)
fun_eta = lambda x1, x2, x3 : x1 * x2**0.628 * x3**-0.23

# Calculate the T1 and HTC's
theta_1 = fun_theta1(X[:,1], X[:,2])
eta_0 = fun_eta(X[:,3], X[:,4], X[:,0])
eta_1 = p * fun_eta(X[:,3], X[:,4], theta_1)

# Generate sensor values y
y = np.zeros(N)
y[0] = (eta_0[0] * X[0,0] + eta_1[0] * theta_1[1]) / (eta_0[0] + eta_1[0]) #Start with steady value

# Generate noise
eps = 1E-5 * np.random.random(len(y))

# Iterate through y using the difference equation
for i in range(N-1):
    y[i+1] = y[i] + 1/t_ref * (eta_1[i]*(theta_1[i] - y[i]) + eta_0[i]*(X[i,0] - y[i])) + eps[i]
In [7]:
# Add noise to both X and y
X += np.random.normal(loc=0.0, scale=noise_level, size=X.shape) 
y += np.random.normal(loc=0.0, scale=noise_level, size=y.shape)
In [8]:
# Zoom in on some intervals, to show that y (the solid temperature) tends to a value in between T0 and T1
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(14,5))

# 1: before anomaly
t1, t2 = 37000, 41000
ax1.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax1.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax1.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax1.legend(fancybox=True, framealpha=0.8, loc=2)
ax1.set_title('Temperatures, before anomaly', fontsize=16);

# 2: around anomaly
t1, t2 = int(N*t_anomaly-2000), int(N*t_anomaly+2000)
ax2.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax2.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax2.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax2.vlines(x=N*t_anomaly, ymin=ax2.get_ylim()[0], ymax=ax2.get_ylim()[1], linestyle='--', color='r', label='start of anomaly')
ax2.legend(fancybox=True, framealpha=0.8, loc=2)

ax2.set_title('Temperatures, around anomaly', fontsize=16);
ax2.set_xlabel('Time [-]')

The top plot shows how $y$ tends to a value that is in between $\theta_0$ and $\theta_1$, before the anomaly.

The bottom plot zooms in around the anomaly at t=80'000. We really cannot easily detect the event visually, because all variables are continuously changing. Let's see if a machine learning model can help us detect the event

4. Generate the features (X) and target (y)

In [9]:
# The target variable: y
y_target = y

# Create features from X, adding an extra 6th column for dy/dt
X_features = np.empty((N, 6))
X_features[:, :5] = X.copy()
X_features[:, 5] = np.concatenate((np.diff(y), np.array([0])))

# Scale the features to have zero mean and unit variance. 
rr = StandardScaler()
X_features = rr.transform(X_features)

5. Gaussian Process Regressor from Sci-kit learn

In [10]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

We will train a Gaussian Process regression model, to predict the temperature $y$ at t based on the measured properties at $t$ and the change between $y(t+1)$ and $y(t)$.

Note that this not predictive modelling in the traditional sense for time-series (future prediction); we are merely reconstructing the missing piece ($y(t)$) from all the other ingredients that together make up the differential equation at time $t$.

Training a GP model is computationally demanding for large N, with the training computational effort scaling with ~N^3. Sci-kit learn's GP model has an internal optimizer, that yields hyperparameters with maximum-likelihood maximization. Finding this optimum is additional effort. The found optimum for the length scale is 5.4 (note: we are dealing with 6 dimensions) which is given explicitly, along with


thus avoiding the optimization routine for the demo.

For this smooth problem, the common Gaussian/RBF kernel is well-suited.

In [11]:
N_1, train_step, test_step = int(t_train * N), 40, 10
X_train, y_train = X_features[0:N_1:train_step, :], y_target[0:N_1:train_step] # subsample features
X_test, y_test = X_features[::test_step, :], y_target[::test_step] # predict on entire range, subset later for test accuracy
In [12]:
(1500, 6)
In [14]:
GP = GaussianProcessRegressor(kernel=RBF(length_scale=5.4), n_restarts_optimizer=1, alpha=noise_level)
GP.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0001, copy_X_train=True,
             kernel=RBF(length_scale=5.4), n_restarts_optimizer=1,
             normalize_y=False, optimizer='fmin_l_bfgs_b',
In [15]:
y_pred, sigma = GP.predict(X_test, return_std=True) # Also return the estimated uncertainty of the prediction
In [16]:
plt.plot(range(0, N, test_step), y_test, lw=2, label='actual values')
plt.plot(range(0, N, test_step), y_pred,'k-', label='predicted values')
plt.fill_between(range(0, N, test_step), y_pred- 2*sigma, y_pred+2*sigma, 
                 color='k', alpha=0.2, label='95% conf.int.')
plt.vlines(x=N_1, ymin=1, ymax=5,  color='k', linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=1, ymax=5, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])

plt.ylabel('normalized temperature [-]')
plt.xlabel('time [-]')
plt.title('Predictions versus actual y-values', fontsize=16);
In [17]:
plt.plot(range(0, N, test_step), y_pred - y_test, label='prediction-actual')
plt.fill_between(range(0, N, test_step), -2*sigma, 2*sigma, color='k', alpha=0.5, label='95% conf.int.')
plt.vlines(x=N_1, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), color='r',
           linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])

plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]', fontsize=14)
plt.title('Residuals and model uncertainty', fontsize=16);

This plot shows the large advantage of Gaussian Process regression: our model indicates how certain it is regarding its predictions. Before the start of the test data (before black dashed line), the uncertainty band is rather narrow: the model has been trained on this data (with subsampling, which is why there is still some variation in the uncertainty within the training domain).

In the first part of the test-data, before the occurrence of the anomaly (red dashed line), the error increases. Since our features have 6 dimensions, and we had a total of 100 "set-points", we did not quite cover the entire parameter space. However, the model indicates this perfectly well: the uncertainty becomes much larger.

In the final part, the predictions generally lie far outside of our confidence interval. Let's calculate the log-likelihood of our data given our model's prediction and uncertainty, and plot this against time.

In [18]:
import scipy.stats
eps = 1E-12 # eps to avoid -inf
LL = np.log(eps + scipy.stats.norm(y_pred, sigma).pdf(y_target[::test_step])) 
plt.plot(range(0, N, test_step), LL, label='Log-likelihood')
plt.vlines(x=N_1, ymin=min(LL), ymax=10, linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=min(LL), ymax=10, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])

plt.ylabel('Log-likelihood [-]')
plt.xlabel('time [-]')
plt.title('Log-likelihood of data given prediction', fontsize=16);

The log-likelihood shows a slight degradation after the start of unseen data, but a drastic jump at $t_{novelty}$. Since the change is so drastic, one can easily automate detection by adding a moving average and thresholding.

In [19]:
# Calculate the R-squared and MSE of the GP model
R2_GP = r2_score(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
MSE_GP = mean_squared_error(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
In [20]:
# save the residuals for later use
res_GP = y_pred - y_test

6. Artificial neural network with Keras and TensorFlow

In [21]:
import keras
from time import time
from keras.models import Sequential
from keras.layers import Dense
from keras import callbacks
from keras.callbacks import TensorBoard
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
Using TensorFlow backend.

As an alternative, we will train a neural network to do the same task. The effort for training a neural network goes linear with the number of training points; we do not to undersample and just use all training data poitns.

In [22]:
X_train, y_train = X_features[:N_1, :], y_target[:N_1]
X_test, y_test = X_features[:, :], y_target[:]
In [23]:
def twolayer_model(activation_1='relu', activation_2='linear'):
    model = Sequential()
    model.add(Dense(6, input_dim=6, kernel_initializer='normal', activation=activation_1))
    model.add(Dense(1,  kernel_initializer='normal', activation=activation_2))
    model.compile(loss='mean_squared_error',optimizer='adam', metrics=['mean_squared_error'])
    return model
In [24]:
model = twolayer_model(activation_1='sigmoid', activation_2='linear')
# The re-lu gives somewhat poorer accuracy
In [25]:
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
# The tensorboard object will write data for visualization with tensorboard. 
# start tensorboard from the terminal with:
# $ tensorboard --log_dir=logs/
model.fit(X_train, y_train, epochs=200, batch_size=100, verbose=0, callbacks=[tensorboard])
<keras.callbacks.History at 0x1a2175b240>
In [26]:
y_pred = model.predict(X_test)[:, 0] #keras returns np.ndarray. Take the first column
In [27]:
plt.figure(figsize=(14, 8))
plt.plot(range(len(y_test)), y_test, lw=3, label='actual values')
plt.plot(range(len(y_test)), y_pred, 'k-', label='prediction')
plt.vlines(x=N_1, ymin=0, ymax=5, color='k', linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=0, ymax=5, color='r', linestyle='--')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])

plt.ylabel('normalized temperature [-]', fontsize=14)
plt.xlabel('time [-]', fontsize=14)
plt.title('Predictions versus actual y-values', fontsize=14);
In [28]:
fig = plt.figure(figsize=(14,8))
plt.plot(range(len(y_pred)), y_pred - y_test, label='Residuals, NN')
plt.plot(range(0, N, test_step), res_GP, 'k', linestyle=':', label='Residuals, GP', lw=2)
ax = plt.gca()
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()

plt.vlines(x=N_1, ymin=ymin,  ymax=ymax, linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=ymin,  ymax=ymax, color='r', linestyle='--')
plt.hlines(y=0, xmin=xmin, xmax=xmax)
plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]')
plt.title('Residuals, GP versus NN', fontsize=16);