Outlier Detection: Isolation Forest

Outlier Detection Part III: (Extended) Isolation Forest

This is the third post in a series of posts about outlier detection. Previously, MAD (median absolute deviation from the median) and DBSCAN were explored, and applied on 4 datasets.

The datasets are described here in detail.

In this post, we look at the Isolation Forest algorithm.

Isolation Forest

The Isolation Forest algorithm is related to the well-known Random Forest algorithm, and may be considered its unsupervised counterpart. The idea behind the algorithm is that it is easier to separate an outlier from the rest of the data, than to do the same with a point that is in the center of a cluster (and thus an inlier). The picture from the original paper clarifies this:

In [1]:
from IPython.display import Image
Image(filename='../images/isoforest.png', width=800) 
Out[1]:

As in most machine learning algorithms, there is a training/fitting and a prediction stage. During fitting, many trees are built that are trained on samples of the original data. Unlike Random Forest, these samples are however obtained by subsampling with a fixed, small size (typically 256) rather than by bootstrapping. Also, the partitioning is not done with a particular objective (contrary to Random Forest, that finds splittings that optimize some purity measure).

During prediction, instances traverse the trees in the iForest. The quicker these reach a termination node, the higher the outlier scores become. Note that, because we are splitting in hyperplanes that are orthogonal to the coordinate axes, this algorithm does not satisfy rotational symmetry, which may be undesired in some cases.

There are practically no parameters to be tuned; the default parameters of subsample size of 256 and number of trees of 100 are reported to work for many different datasets, which will also be investigated. The original paper is recommended for reading.

Extended Isolation Forest

The splits produced by the standard isolation forest are orthogonal cuts, aligned with the coordinate system. This leads to obvious artefacts. A recently (the paper is from November 2018) proposed solution to this problem is to make hyperplanes with arbitrary rotation in feature space. The following figure illustrates this.

In [2]:
Image(filename='../images/ext_iforest.png', width=800) 
Out[2]:

See the paper for an extensive discussion:

https://arxiv.org/abs/1811.02141

An implementation in Python is available on PyPi, a package with the name eif.

This is a rather long post, TL;DR? Jump straight to the Conclusions

In [3]:
# Standard library
import types
import time
import os
import pickle
from collections import Counter
from functools import wraps

#3rd party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import (auc, average_precision_score, 
                              roc_auc_score, roc_curve, precision_recall_curve)
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.linear_model import LinearRegression
import eif 
import seaborn as sns

# misc.
from blogutils import memotodisk, plot_metrics_curves, downsample_scale_split_df
plt.rcParams.update({'font.size': 13})
In [4]:
from sklearn import __version__ ; print(__version__)
0.20.2

0. Preparation

Define functions for construction and parameter scanning of isolation forest and autoencoder classifiers. We will use the memoizer from this other post and decorate the functions that do the heavy lifting, to make experimentation faster.

In [5]:
@memotodisk
def iforest_outlier_predictor(X, n_jobs=2, **kwargs):
    """ Takes an X-array and returns outlier scores
    """
    t0 = time.perf_counter()
    ifo = IsolationForest(n_jobs=n_jobs, **kwargs)
    y = - ifo.fit(X).decision_function(X)
    t = time.perf_counter() - t0    
    return {'y': y, 't': t}

@memotodisk
def extended_iforest_outlier_predictor(X, **kwargs):
    red_kwargs = {k: v for k, v in kwargs.items() if k != 'random_state'}
    t0 = time.perf_counter()
    ext_ifo = eif.iForest(X, ExtensionLevel=0, **red_kwargs)
    y = ext_ifo.compute_paths(X_in=X)
    t = time.perf_counter() - t0    
    return {'y': y, 't': t}
In [6]:
def iforest_parameter_scan(X, y, par_dict_list, verbose=False, extended=False, **kwargs):
    """ See whether this function can be merged with the autoencoder function
    """
    results_dict = defaultdict(list)
    for par_dict in par_dict_list:
        par_name, par_value = list(par_dict.keys())[0], list(par_dict.values())[0]
        results_dict['param_values'].append(par_value)
        results_dict['param_names'].append(par_name)
        verbose and print('Parameter name: {}. Value: {}'.format(par_name, par_value))
        if extended:
            par_results = extended_iforest_outlier_predictor(X, ntrees=100, **kwargs, **par_dict)
        else:
            par_results = iforest_outlier_predictor(X, n_estimators=100, **kwargs, **par_dict)
        auc, apr = (roc_auc_score(y, par_results['y']), 
                            average_precision_score(y, par_results['y']))
        results_dict['AUC'].append(auc); results_dict['AP'].append(apr);
        results_dict['y'].append(par_results['y'])
        results_dict['t'].append(par_results['t'])
        verbose and print(' AUC:{:.3f}, APR: {:.3f}'.format(auc, apr)) 
    return results_dict
In [7]:
def iforest_stats_gatherer(X, y, n_iterations, par_dict_list, **kwargs):
    for i in range(n_iterations):
        yield iforest_parameter_scan(X, y, par_dict_list=par_dict_list, verbose=False, random_state=i, **kwargs)
        print(f'Iteration {i} completed')
In [8]:
AUC_df = pd.read_pickle('AUC_partII.pkl') ; AP_df = pd.read_pickle('AP_partII.pkl')
AUC_df['iForest'] = '' ; AP_df['iForest'] = ''
AUC_df['ExtiForest'] = '' ; AP_df['ExtiForest'] = ''

1. Credit card dataset

See https://donernesto.github.io/blog/outlier-detection-data-preparation for details on this and other datasets

In [9]:
with open(r"data/anomaly_credit.dat","rb") as f:
    _ = np.load(f)
    _ = np.load(f)   
    X_downsampled_credit = np.load(f)
    y_downsampled_credit = np.load(f)
    X_credit = np.load(f)
    y_credit = np.load(f)

print('Total number of points downsampled: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_downsampled_credit), y_downsampled_credit.sum(), y_downsampled_credit.mean()))
print('Total number of points full: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_credit), y_credit.sum(), y_credit.mean()))
Total number of points downsampled: 28530. Number of positives: 98 (fraction: 0.34%)
Total number of points full: 284807. Number of positives: 492 (fraction: 0.17%)

Isolation Forest

We run a parameter scan on the reduced dataset (~10% of the full dataset)

In [10]:
N_iterations = 5
par_dict_list = [{'max_samples': i} for i in [128, 256, 512, 1024, 2048]]
iforest_scan_dataset1 = pd.concat(pd.DataFrame.from_dict(s) for s in 
                                  iforest_stats_gatherer(X_downsampled_credit, y_downsampled_credit, 
                                                         n_iterations=N_iterations, 
                                                         par_dict_list=par_dict_list))
iforest_scan_dataset1_stat = iforest_scan_dataset1.groupby('param_values').agg(
    {'AUC': ['mean', 'std'], 'AP':['mean', 'std']})
iforest_scan_dataset1_stat.round(3)
Iteration 0 completed
Iteration 1 completed
Iteration 2 completed
Iteration 3 completed
Iteration 4 completed
Out[10]:
AUC AP
mean std mean std
param_values
128 0.965 0.003 0.330 0.073
256 0.965 0.003 0.355 0.044
512 0.966 0.003 0.353 0.035
1024 0.963 0.001 0.410 0.031
2048 0.965 0.003 0.423 0.049

NB: the error bars show the standard deviation of the prediction, not the uncertainty in the estimate

Extended isolation forest

We run a parameter scan on the reduced dataset (~10% of the full dataset)

In [11]:
par_dict_list = [{'sample_size': i} for i in [128, 256, 512, 1024, 2048]]
ext_iforest_scan_dataset1 = pd.concat(pd.DataFrame.from_dict(s) for s in 
                                  iforest_stats_gatherer(X_downsampled_credit, y_downsampled_credit, 
                                                         par_dict_list=par_dict_list,
                                                         extended=True, 
                                                         n_iterations=5))
ext_iforest_scan_dataset1_stat = ext_iforest_scan_dataset1.groupby('param_values').agg(
    {'AUC': ['mean', 'std'], 'AP':['mean', 'std']})
ext_iforest_scan_dataset1_stat.round(3)
Iteration 0 completed
Iteration 1 completed
Iteration 2 completed
Iteration 3 completed
Iteration 4 completed
Out[11]:
AUC AP
mean std mean std
param_values
128 0.962 0.001 0.359 0.043
256 0.963 0.004 0.377 0.073
512 0.963 0.002 0.398 0.058
1024 0.965 0.001 0.442 0.037
2048 0.966 0.003 0.422 0.028

Comparison

In [12]:
idx = pd.IndexSlice # NB: to make indexing of the MultiIndexed DataFrame easier
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
iforest_scan_dataset1_stat.loc[:, idx['AP', 'mean']].plot(ax=ax, 
                            yerr=1/np.sqrt(N_iterations) * iforest_scan_dataset1_stat.loc[:, idx['AP', 'std']].T.values,
                                                         label='iso Forest')
ext_iforest_scan_dataset1_stat.loc[:, idx['AP', 'mean']].plot(ax=ax, 
                            yerr=1/np.sqrt(N_iterations) * ext_iforest_scan_dataset1_stat.loc[:, idx['AP', 'std']].T.values,
                                                        label='ext. iso Forest')
ext_iforest_scan_dataset1_stat.loc[:, idx['AP', 'mean']].plot(ax=ax)
ax.set_ylabel('Average Precision')
ax.set_xlabel('Sample size')
ax.semilogx()
plt.tight_layout()
plt.title('Average precision, downsampled (10%) Credit dataset')
plt.legend()
plt.show()

Taking a large sample size results in significantly better average precision scores, for both the "vanilla" isolation forest and its extended version.

NB: the error bars indicate the uncertainty in the Average Precision estimate

Final predictions

In [13]:
y_predict = iforest_outlier_predictor(X_credit, n_estimators=100, random_state=0, max_samples=1024)
plot_metrics_curves(y_credit, y_predict['y'], title='Isolation Forest, Credit Card \n')
In [14]:
par_dict_list = ({'max_samples': 1024}, )
iforest_results_dataset1 = iforest_parameter_scan(X_credit, y_credit, par_dict_list=par_dict_list, 
                                                  verbose=True, random_state=0)
Parameter name: max_samples. Value: 1024
 AUC:0.954, APR: 0.180
In [15]:
print('Calculation time Isolation Forest, full dataset: {:.1f} s'.format(iforest_results_dataset1['t'][0]))
Calculation time Isolation Forest, full dataset: 29.0 s
In [16]:
par_dict_list = ({'sample_size': 1024}, )
ext_iforest_results_dataset1 = iforest_parameter_scan(X_credit, y_credit, par_dict_list=par_dict_list, 
                                                  verbose=True, extended=True)
Parameter name: sample_size. Value: 1024
 AUC:0.956, APR: 0.194
In [17]:
print('Run-times for the Credit Card dataset')
print('iForest: samples / sec (downs.): {:.1f}'.format(len(y_downsampled_credit) 
                                                       / iforest_scan_dataset1['t'].mean()))
print('iForest: samples / sec (full): {:.1f}'.format(len(y_credit) 
                                                     / iforest_results_dataset1['t'][0]))
print('Ext iForest: samples / sec (downs.): {:.1f}'.format(len(y_downsampled_credit) 
                                                           / ext_iforest_scan_dataset1['t'].mean()))
print('Ext iForest: samples / sec (full): {:.1f}'.format(len(y_credit) 
                                                         / ext_iforest_results_dataset1['t'][0]))
Run-times for the Credit Card dataset
iForest: samples / sec (downs.): 10045.1
iForest: samples / sec (full): 9823.9
Ext iForest: samples / sec (downs.): 216.8
Ext iForest: samples / sec (full): 195.6

The computation time of both the basic and extended isolation forest depend linearly on the number of data points. However, the extended isolation forest is roughly 50 times slower, a considerable penalty.

In [18]:
AUC_df.loc['CreditCard', 'iForest']  = 0.95; AUC_df.loc['CreditCard', 'ExtiForest'] = 0.96
AP_df.loc['CreditCard', 'iForest'] = '0.18 (*0.41)'; AP_df.loc['CreditCard', 'ExtiForest'] = '0.19 (*0.44)'
In [19]:
del X_credit, y_credit, X_downsampled_credit, y_downsampled_credit

2. Paysim dataset

In [20]:
with open(r"data/anomaly_paysim.dat","rb") as f:
    X_downsampled_paysim = np.load(f,)
    y_downsampled_paysim = np.load(f,)
    X_paysim = np.load(f,)
    y_paysim = np.load(f,)
    
print('Total number of points downsampled: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_downsampled_paysim), y_downsampled_paysim.sum(), y_downsampled_paysim.mean()))
print('Total number of points full: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_paysim), y_paysim.sum(), y_paysim.mean()))
Total number of points downsampled: 277041. Number of positives: 821 (fraction: 0.30%)
Total number of points full: 2770409. Number of positives: 8213 (fraction: 0.30%)

Isolation Forest

We will do a parameter scan on the downsampled dataset.

In [21]:
N_iterations = 5
par_dict_list = [{'max_samples': i} for i in [128, 256, 512, 1024, 2048]]
iforest_scan_dataset2 = pd.concat(pd.DataFrame.from_dict(s) for s in 
                                  iforest_stats_gatherer(X_downsampled_paysim, y_downsampled_paysim, 
                                                         n_iterations=N_iterations, 
                                                         par_dict_list=par_dict_list))
iforest_scan_dataset2_stat = iforest_scan_dataset2.groupby('param_values').agg(
    {'AUC': ['mean', 'std'], 'AP':['mean', 'std']})
iforest_scan_dataset2_stat.round(3)
Iteration 0 completed
Iteration 1 completed
Iteration 2 completed
Iteration 3 completed
Iteration 4 completed
Out[21]:
AUC AP
mean std mean std
param_values
128 0.930 0.006 0.230 0.112
256 0.943 0.007 0.338 0.063
512 0.958 0.006 0.365 0.071
1024 0.976 0.003 0.447 0.053
2048 0.980 0.002 0.437 0.013
In [22]:
idx = pd.IndexSlice # NB: to make indexing of the MultiIndexed DataFrame easier
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
iforest_scan_dataset2_stat.loc[:, idx[:, 'mean']].plot(ax=ax, 
                                yerr=1/np.sqrt(N_iterations) * 
                                                       iforest_scan_dataset2_stat.loc[:, idx[:, 'std']].T.values)
ax.set_ylabel('Score')
ax.semilogx()
plt.tight_layout()
plt.title(f'Iso-forest downs. (10%) PaySim ({N_iterations} iterations)')
plt.show()

NB: the error bars indicate the uncertainty in the Average Precision estimate

Final prediction
In [23]:
par_dict_list = ({'max_samples': 1024}, )
iforest_results_dataset2 = iforest_parameter_scan(X_paysim, y_paysim, par_dict_list=par_dict_list, 
                                                  verbose=True, random_state=0)
Parameter name: max_samples. Value: 1024
 AUC:0.978, APR: 0.493

Extended Isolation Forest

Because of the expected duration of the prediction using extended isolation forest (~3 hours), we will limit ourselves to a prediction on the downsampled dataset.

In [24]:
# NB: long calculation ahead (~20 minutes)
par_dict_list = ({'sample_size': 1024}, )
ext_iforest_results_dataset2 = iforest_parameter_scan(X_downsampled_paysim, y_downsampled_paysim, 
                                                      par_dict_list=par_dict_list, verbose=True, extended=True)
Parameter name: sample_size. Value: 1024
 AUC:0.966, APR: 0.456
In [25]:
print('Run-times for the PaySim Card dataset')
print('iForest: samples / sec (downs.): {:.1f}'.format(len(y_downsampled_paysim) 
                                                       / iforest_scan_dataset2['t'].mean()))
print('Ext iForest: samples / sec (downs.): {:.1f}'.format(len(y_downsampled_paysim) 
                                                           / ext_iforest_results_dataset2['t'][0]))
Run-times for the PaySim Card dataset
iForest: samples / sec (downs.): 12174.1
Ext iForest: samples / sec (downs.): 202.4
In [26]:
AUC_df.loc['PaySim', 'iForest']  = '0.98 (*0.98)'; AUC_df.loc['PaySim', 'ExtiForest'] = '*0.97'
AP_df.loc['PaySim', 'iForest'] = '0.49 (*0.45)'; AP_df.loc['PaySim', 'ExtiForest'] = '*0.46'
In [27]:
del X_paysim, y_paysim, X_downsampled_paysim, y_downsampled_paysim

3. Forest cover dataset

In [28]:
with open(r"data/anomaly_cover.dat","rb") as f:
    _ = np.load(f, )
    _ = np.load(f, )
    X_cover = np.load(f, )
    y_cover = np.load(f, )
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_cover), y_cover.sum(), y_cover.mean()))
Total number of point: 286048. Number of positives: 2747 (fraction: 0.96%)

Isolation Forest

As this dataset is considerably smaller than the previous one, we can run a parameter sweep

In [29]:
par_dict_list = [{'max_samples': i} for i in [128, 256, 512, 1024, 2048]]
iforest_results_dataset3 = iforest_parameter_scan(X_cover, y_cover, par_dict_list=par_dict_list, 
                                                  verbose=True, random_state=0)
iforest_results_dataset3_stat = pd.DataFrame.from_dict(iforest_results_dataset3)
iforest_results_dataset3_stat = (iforest_results_dataset3_stat
                                     .rename(columns={'param_names':'max_samples'})
                                     .set_index('param_values'))
Parameter name: max_samples. Value: 128
 AUC:0.818, APR: 0.044
Parameter name: max_samples. Value: 256
 AUC:0.883, APR: 0.054
Parameter name: max_samples. Value: 512
 AUC:0.887, APR: 0.053
Parameter name: max_samples. Value: 1024
 AUC:0.905, APR: 0.062
Parameter name: max_samples. Value: 2048
 AUC:0.930, APR: 0.080

Extended Isolation Forest

For the Extended Isolation Forest, only a single parameter:

In [30]:
par_dict_list = ({'sample_size': 1024}, )
ext_iforest_results_dataset3 = iforest_parameter_scan(X_cover, y_cover, par_dict_list=par_dict_list, 
                                                  verbose=True, extended=True)
ext_iforest_results_dataset3_stat = pd.DataFrame.from_dict(ext_iforest_results_dataset3)
ext_iforest_results_dataset3_stat = (ext_iforest_results_dataset3_stat
                                     .rename(columns={'param_names':'max_samples'})
                                     .set_index('param_values'))
Parameter name: sample_size. Value: 1024
 AUC:0.887, APR: 0.049
In [31]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.plot(iforest_results_dataset3_stat.index, 
        iforest_results_dataset3_stat.loc[:, ['AP']], label='Iso Forest', marker='s')
ax.scatter(ext_iforest_results_dataset3_stat.index, ext_iforest_results_dataset3_stat['AP'],
          marker='s', label='Ext. Iso Forest', c='r')
ax.set_ylabel('Average Precision')
ax.set_xlabel('Sample size')
ax.semilogx()
plt.tight_layout()
plt.title('Average precision, Forest dataset')
plt.legend()
plt.show()

Also here, a larger sample size seems to be beneficial. However, unlike earlier results, the Extended isolation forest performs worse.

In [32]:
AUC_df.loc['ForestCover', 'iForest']  = '0.91'; AUC_df.loc['ForestCover', 'ExtiForest'] = '0.89'
AP_df.loc['ForestCover', 'iForest'] = '0.06'; AP_df.loc['ForestCover', 'ExtiForest'] = '0.05'
In [33]:
del X_cover, y_cover

4. MNIST dataset

Recall: this dataset consists of "randomly sampled features" (pixel intensities, though scaled in an unknown way) from handrawn zeros and sixes (the sixes are the outlier class)

In [34]:
with open(r"data/anomaly_mnist.dat","rb") as f:
    X_mnist = np.load(f, )
    y_mnist = np.load(f, )
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
        len(y_mnist), y_mnist.sum(), y_mnist.mean()))
Total number of point: 7603. Number of positives: 700 (fraction: 9.21%)
In [35]:
par_dict_list = [{'max_samples': i} for i in [128, 256, 512, 1024, 2048]]
iforest_results_dataset4 = iforest_parameter_scan(X_mnist, y_mnist, par_dict_list=par_dict_list, 
                                                  verbose=True, random_state=0)
Parameter name: max_samples. Value: 128
 AUC:0.811, APR: 0.296
Parameter name: max_samples. Value: 256
 AUC:0.800, APR: 0.287
Parameter name: max_samples. Value: 512
 AUC:0.788, APR: 0.263
Parameter name: max_samples. Value: 1024
 AUC:0.803, APR: 0.281
Parameter name: max_samples. Value: 2048
 AUC:0.801, APR: 0.278

Extended Isolation Forest

In [36]:
par_dict_list = [{'sample_size': i} for i in [128, 256, 512, 1024, 2048]]
ext_iforest_results_dataset4 = iforest_parameter_scan(X_mnist, y_mnist, par_dict_list=par_dict_list, 
                                                  verbose=True, extended=True)
Parameter name: sample_size. Value: 128
 AUC:0.766, APR: 0.234
Parameter name: sample_size. Value: 256
 AUC:0.804, APR: 0.281
Parameter name: sample_size. Value: 512
 AUC:0.838, APR: 0.340
Parameter name: sample_size. Value: 1024
 AUC:0.829, APR: 0.324
Parameter name: sample_size. Value: 2048
 AUC:0.815, APR: 0.297
In [37]:
AUC_df.loc['MNIST_0/6', 'iForest']  = '0.80'; AUC_df.loc['MNIST_0/6', 'ExtiForest'] = '0.83'
AP_df.loc['MNIST_0/6', 'iForest'] = '0.28'; AP_df.loc['MNIST_0/6', 'ExtiForest'] = '0.32'

5. Conclusion

Both Isolation Forest and its extended version have some very favourable properties:

  • No parameter tuning is needed (the default settings work on all 4 considered datasets)
  • The computation time is proportional to the number of data points N (rather than to N^2, such as e.g. DBSCAN and LOCF)

Rather than using a sample size of 256 (the default setting), a sample size of 1024 is recommended for both algorithms. The extended isolation forest is meant as an improvement over the isolation forest. The standard isolation forest generates artefacts due to its partitioning in hyperplanes that are aligned with the coordinate system.

Indeed, the extended isolation forest seems to yield somewhat better results. It however does so at a large computational expense: computations are ~50 times slower for the considered datasets. Whether the improved accuracy is worth this, is case-dependent.

The next and final post of this series will deal with autoencoders.

In [38]:
print('Av. precision:')
AP_df.fillna('')
Av. precision:
Out[38]:
MAD DBSCAN iForest ExtiForest
CreditCard 0.47 (*0.75) *0.12 0.18 (*0.41) 0.19 (*0.44)
PaySim 0.16 0.49 (*0.45) *0.46
ForestCover 0.04 0.06 0.05
MNIST_0/6 0.14 0.28 0.32
In [39]:
print('AUC:')
AUC_df.fillna('')
AUC:
Out[39]:
MAD DBSCAN iForest ExtiForest
CreditCard 0.95 (*0.95) *0.84 0.95 0.96
PaySim 0.75 0.98 (*0.98) *0.97
ForestCover 0.76 0.91 0.89
MNIST_0/6 0.63 0.80 0.83
In [40]:
AUC_df.to_pickle('AUC_partIII.pkl'); AP_df.to_pickle('AP_partIII.pkl')

Comments