Outlier detection: MAD

Outlier Detection Part I: MAD

This is the first post in a longer series that deals with Anomaly detection, or more specifically: Outlier detection. Anomaly detection means finding data points that are somehow different from the bulk of the data (Outlier detection), or different from previously seen data (Novelty detection). The aim of this series is to explore which algorithms have which advantages and disadvantages for outlier detection tasks.

In this post we will a most straightforward approach: we calculate the Median Absolute Deviation from the Median (MAD), and use this as a measure for the outlierness.

Note that tackling the problem in this way, compared to a supervised solution, is rather tricky. However, it is also more realistic for many realistic settings such as fraud detection: a sufficiently large and representatve labelled dataset is usually not available.

MAD (Median absolute deviation to the Median)

The simplest and quickest outlier detection method is to calculate the median absolute deviation to the median. The second part ("absolute deviation to the median") refers to the within-feature deviation from the column median (so it works in the column direction). The encapsulating, first median refers to the median of those deviations. This one makes that we have a single outlier value, also when dealing with multi-variate data.

Metrics

Note that we are looking at the Area Under the ROC-Curve (AUC score), as well as to the Area under the precision-recall curve. Both have their distinct advantages:

  • AUC allows comparison for different class imbalances (because the TPR- and FPR-ratios are within-class, these don't depend on class sizes)
  • The Area under the Precision-Recall curve ("Average Precision" or "APR") gives more relevant metrics: how many false positives can be expected if we want to find a certain fraction of the positives? Or the other way around: given that we can sample N points, how many actual positives can we expect to find?

So whereas the AUC is a universal measure, the APR can only be interpreted when the class imbalance is known. On the other hand, the APR gives us a more tangible metric.

Data pre-processing was already done in a previous post, so we can get started immediately.

In [1]:
from collections import defaultdict

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams.update({'font.size': 12})
from sklearn.metrics import (auc, average_precision_score, 
                              roc_auc_score, roc_curve, precision_recall_curve)
from sklearn.preprocessing import StandardScaler, RobustScaler
In [2]:
def plot_metrics_curves(y_true, y_pred, title='',  **kwargs):
    """
    Plots roc and precision-recall curves
    
    Arguments
    =========
    y_true (iterator) : actual labels
    y_pred (iterator) : predicted labels
    title (str) : title for subplots
    **kwargs are for plt.plot()
    
    """
    x_roc, y_roc, _ = roc_curve(y_true, y_pred)
    y_prc, x_prc, _ = precision_recall_curve(y_true, y_pred)   
    
    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    axs[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    axs[0].plot(x_roc, y_roc, color='darkorange', lw=2, 
                label='{}ROC curve (area {:.2f})'.format(title, roc_auc_score(y_true, y_pred)),
                            **kwargs)
    axs[1].plot(x_prc, y_prc, color='darkorange', lw=2, 
                label='{}PR curve (area {:.2f})'.format(title, average_precision_score(y_true, y_pred) ),
                            **kwargs)
    for ax, labels in zip(axs, (('FPR', 'TPR'), ('TPR (recall)', 'precision'))):
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel(labels[0])
        ax.set_ylabel(labels[1])
        ax.legend(loc="lower right")
    plt.show()
In [3]:
def plot_metrics_curves(y_true, y_pred, title='', label_extra='',  **kwargs):
    """
    Plots roc and precision-recall curves

    Arguments
    =========
    y_true (iterator) : actual labels
    y_pred (iterator) : predicted labels
    title (str) : title for subplots
    **kwargs are for plt.plot()

    """
    x_roc, y_roc, _ = roc_curve(y_true, y_pred)
    y_prc, x_prc, _ = precision_recall_curve(y_true, y_pred)

    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    axs[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    axs[0].plot(x_roc, y_roc, color='darkorange', lw=2,
                label='{}ROC curve (area {:.2f})'.format(label_extra, roc_auc_score(y_true, y_pred)),
                            **kwargs)
    axs[1].plot(x_prc, y_prc, color='darkorange', lw=2,
                label='{}PRC curve (area {:.2f})'.format(label_extra, average_precision_score(y_true, y_pred)),
                            **kwargs)
    for ax, labels in zip(axs, (('FPR', 'TPR (recall)'), ('TPR (recall)', 'precision'))):
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel(labels[0])
        ax.set_ylabel(labels[1])
        ax.legend(loc="lower right")
    plt.suptitle(title + 'N={} (frac_pos={:.2%})'.format(len(y_true), y_true.mean()))
    plt.tight_layout(rect=(0.0, 0.0, 1.0, 0.9))
    plt.show()
In [4]:
def mad(X):
    """ calculates the median over the first axis, of the column-wise normalized 
    absolute deviation (normalized by iqr)
    """
    iqr =  np.quantile(X, 0.75, axis=0) - np.quantile(X, 0.25, axis=0)
    #the IQR may be zero. To avoid inf, fill with the variance
    iqr[np.where(iqr==0)] = np.var(X, axis=0)[np.where(iqr==0)] 
    return np.median(np.abs(X)/iqr, axis=1)

def generate_auc_ap_tuple(X, y):
    return roc_auc_score(y, mad(X)), average_precision_score(y, mad(X))
In [5]:
AUC_metrics_dict = {}; AUC_metrics_dict['MAD'] = [''] * 4
AP_metrics_dict = {}; AP_metrics_dict['MAD'] = [''] * 4

1. Credit card dataset

See https://donernesto.github.io/blog/outlier-detection-data-preparation for details on this and other datasets

In [6]:
with open(r"data/anomaly_credit.dat","rb") as f:
    X_downsampled_credit = np.load(f)
    y_downsampled_credit = np.load(f)
    X_credit = np.load(f)
    y_credit = np.load(f)
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_credit), y_credit.sum(), y_credit.mean()))
Total number of point: 28530. Number of positives: 98 (fraction: 0.34%)
In [7]:
# Verify the scaling of the features: the IQR should be normalized at 1
np.quantile(X_credit, 0.75, axis=0) - np.quantile(X_credit, 0.25, axis=0)
Out[7]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
In [8]:
print('Credit Card (downsampled dataset), AUC: {:.3f}, APR: {:.3f}'.format(*generate_auc_ap_tuple(X_downsampled_credit, 
                                                                                        y_downsampled_credit)))
print('Credit Card (full dataset)       , AUC: {:.3f}, APR: {:.3f}'.format(*generate_auc_ap_tuple(X_credit, 
                                                                                        y_credit)))
Credit Card (downsampled dataset), AUC: 0.946, APR: 0.746
Credit Card (full dataset)       , AUC: 0.949, APR: 0.467
In [9]:
AUC_metrics_dict['MAD'][0]='0.95 (*0.95)'
AP_metrics_dict['MAD'][0]='0.47 (*0.75)'
In [10]:
prediction_df = pd.DataFrame(index=range(len(y_credit)))
prediction_df['y_true'] = y_credit
prediction_df['MAD'] = mad(X_credit)
ax = sns.violinplot(y='MAD', x="y_true", hue="y_true",
                   data=prediction_df, palette="muted");
ax.legend(loc=10)
plt.show();
plot_metrics_curves(y_credit, prediction_df['MAD'], title='MAD, Credit Card \n')

As demonstrated in the next cell, the MSE (mean square error) approach performs clearly worse than the MAD. For the full dataset, the AUC is 0.95, and the average precision is 0.43.

For the downsampled dataset, the AUC is identical (as may be expected), and the average precision is higher at 0.76.

Very good results for such a computationally cheap approach. Note that this dataset might be advantageous for this algorithm, as the features (apart from the Amount feature) do not correlate: they are the product of a PCA transformation. The "naive" approach of summing the individual deviation over the features is in this case appropriate.

As a comparison, below the result is shown for a MSE-based criterion

In [11]:
# Example: using the MSE. Note that features are centered for this approach
def mse(X):
    X = StandardScaler().fit_transform(X)
    return np.mean(X**2, axis=1)

def generate_mse_auc_ap_tuple(X, y):
    return roc_auc_score(y, mse(X)), average_precision_score(y, mse(X))

X_credit_std = StandardScaler().fit_transform(X_credit)
print('MSE (full dataset),        AUC: {:.3f}, APR: {:.3f}'.format(*generate_mse_auc_ap_tuple(
                    X_credit_std, y_credit)))
MSE (full dataset),        AUC: 0.965, APR: 0.281

2. Paysim card dataset

The paysim dataset contains binary features, and an approach such as MAD -that targets continuous variables- seems not very appropriate. We will nevertheless test this.

In [12]:
with open(r"data/anomaly_paysim.dat","rb") as f:
    X_downsampled_paysim = np.load(f,)
    y_downsampled_paysim = np.load(f,)
    X_paysim = np.load(f,)
    y_paysim = np.load(f,)
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_paysim), y_paysim.sum(), y_paysim.mean()))
Total number of point: 2770409. Number of positives: 8213 (fraction: 0.30%)
In [13]:
auc, ap = generate_auc_ap_tuple(X_paysim, y_paysim)
print('PaySim:  AUC: {:.3f}, APR: {:.3f}'.format(auc, ap))
AUC_metrics_dict['MAD'][1], AP_metrics_dict['MAD'][1] = np.round(auc, 2), np.round(ap, 2)
PaySim:  AUC: 0.749, APR: 0.159

A modest performance, that still beats random sampling (in that case, the expected APR would equal the minority fraction: 0.003) by far.

3. Forest cover dataset

This dataset is purely numerical again; let's see how MAD performs here.

In [14]:
with open(r"data/anomaly_cover.dat","rb") as f:
    X_downsampled_cover = np.load(f, )
    y_downsampled_cover = np.load(f, )
    X_cover = np.load(f, )
    y_cover = np.load(f, )
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
            len(y_cover), y_cover.sum(), y_cover.mean()))
Total number of point: 286048. Number of positives: 2747 (fraction: 0.96%)
In [15]:
# Verify the scaling of the features: the IQR should be normalized at 1
np.quantile(X_cover, 0.75, axis=0) - np.quantile(X_cover, 0.25, axis=0)
Out[15]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
In [16]:
auc, ap = generate_auc_ap_tuple(X_cover, y_cover)
print('Forest Cover:  AUC: {:.3f}, APR: {:.3f}'.format(auc, ap))
AUC_metrics_dict['MAD'][2], AP_metrics_dict['MAD'][2] = np.round(auc, 2), np.round(ap, 2)
Forest Cover:  AUC: 0.763, APR: 0.036

The performance is not overwhelming, but again by far beats random sampling: it is still useful.

4. MNIST dataset

Recall: this dataset consists of "randomly sampled features" (pixel intensities, though scaled in an unknown way) from handrawn zeros and sixes (the sixes are the outlier class)

In [17]:
with open(r"data/anomaly_mnist.dat","rb") as f:
    X_mnist = np.load(f, )
    y_mnist = np.load(f, )
print('Total number of point: {}. Number of positives: {} (fraction: {:.2%})'.format(
        len(y_mnist), y_mnist.sum(), y_mnist.mean()))
Total number of point: 7603. Number of positives: 700 (fraction: 9.21%)
In [18]:
print('MAD (full dataset)       , AUC: {:.3f}, APR: {:.3f}'.format(*generate_auc_ap_tuple(X_mnist, 
                                                                                        y_mnist)))
MAD (full dataset)       , AUC: 0.626, APR: 0.140
In [19]:
auc, ap = generate_auc_ap_tuple(X_mnist, y_mnist)
print('MNIST:  AUC: {:.3f}, APR: {:.3f}'.format(auc, ap))
AUC_metrics_dict['MAD'][3], AP_metrics_dict['MAD'][3] = np.round(auc, 2), np.round(ap, 2)
MNIST:  AUC: 0.626, APR: 0.140

The AUC is somewhat better than random. Note that with a outlier fraction of almost 10%, an average precision of 14% is not great.

Conclusion

The Median Absolute Deviation to the Median (MAD) algorithm is very well suited to the first dataset. This is due to the fact that the features are non-correlating, making the additive univariate approach successful.

For the other datasets, the performance is still decent, with AUC values between about 0.6 and 0.75, and average precisions clearly larger than the outlier class fraction.

In [20]:
datasets_names = ['CreditCard', 'PaySim', 'ForestCover', 'MNIST_0/6']
AUC_df = pd.DataFrame.from_dict(AUC_metrics_dict); AUC_df.index=datasets_names
AP_df = pd.DataFrame.from_dict(AP_metrics_dict); AP_df.index=datasets_names
In [21]:
print('AUC:')
AUC_df
AUC:
Out[21]:
MAD
CreditCard 0.95 (*0.95)
PaySim 0.75
ForestCover 0.76
MNIST_0/6 0.63
In [22]:
print('Av. precision:')
AP_df
Av. precision:
Out[22]:
MAD
CreditCard 0.47 (*0.75)
PaySim 0.16
ForestCover 0.04
MNIST_0/6 0.14

* Downsampled data

In [23]:
AUC_df.to_pickle('AUC_partI.pkl')
AP_df.to_pickle('AP_partI.pkl')

Comments