Outlier detection: data preparation

Data preparation and feature engineering for Outlier Detection

Anomaly detection means finding data points that are somehow different from the bulk of the data (Outlier detection), or different from previously seen data (Novelty detection). We will focus on the first type: outlier detection. This codes prepares the data for usage with various algorithms in later posts.

The features are log-transformed when heavily right-tailed, median-imputed when there are Null- or Null-like values, indicator columns are sometimes added, and categorical features are dummy-encoded. Finally, the data is scaled and centered.

Dataset

1. Credit card fraud dataset

https://www.kaggle.com/mlg-ulb/creditcardfraud

This dataset has 285'000 rows and 31 columns, the features resulting from a PCA transformation. From the kaggle website: "The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions."

All features are numeric, and as a consequence of the PCA transformation, 28 of 29 features do not correlate. Only the "Amount" column correlates with the other columns. Note that the features themselves are characterized by strong outliers: many points lie far away from the median (even dozens of IQR-distances).

2. PaySim synthetic dataset

https://www.kaggle.com/ntnu-testimon/paysim1

This dataset has 6.4 million rows and 11 columns, and is synthetic (artificially generated). According to the author, the generating algorithm was however fed with real transaction data. This dataset contains one categorical variable.

3. Forest cover dataset

http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/

The original ForestCover/Covertype dataset from the UCI machine learning repository is a multiclass classification dataset. The dataset from the website linked above has only 10 quantitative attributes, rather than the full 54 from the original data. From the website: "Instances from class 2 are considered as normal points and instances from class 4 are anomalies. The anomalies ratio is 0.9%. Instances from the other classes are omitted." The dataset has 286048 rows, and 10 columns.

4. MNIST dataset

http://odds.cs.stonybrook.edu/mnist-dataset/

This modified MNIST dataset contains the zeros from the original MNIST dataset, with sixes added as outliers. 100 features of the original 784 (number of pixels) were randomly sampled. The dataset has 7603 rows and 100 columns (thus, a lot of features for a relatively small number of observations), and has a rather large fraction of outliers: 9.2%.

Because there is a lot of ground to cover, we will not do much data exploration or go into the details of parameter tuning, but focus on the specifics of the algorithms and their relative performance.

In [1]:
# Standard library imports
import time
import os

# Third party library imports
import scipy.io
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

plt.rcParams.update({'font.size': 12})
In [2]:
def downsample_scale_split_df(df_full, y_column='Class', frac_negative=1, frac_positive=1, scaler=RobustScaler,
                        random_state=1, verbose=False):
    """ Returns downsampled X, y DataFrames, with prescribed downsampling of positives and negatives
    The labels (y's) should have values 0, 1 and be located in y_column
    X will additionally be scaled using the passed scaler

    Arguments
    =========
    df_full (pd.DataFrame) : data to be processed
    y_column (str) : name of the column containing the Class
    frac_negative (int): fraction of negatives in returned data
    frac_positive (int): fraction of negatives in returned data
    scaler (sci-kit learn scaler object)

    Returns
    ========
    downsampled and scaled X (DataFrame) and downsampled y (Series)
    """
    df_downsampled = (pd.concat([df_full.loc[df_full[y_column] == 0].sample(frac=frac_negative,
                                                                        random_state=random_state),
                                df_full.loc[df_full[y_column] == 1].sample(frac=frac_positive,
                                                                       random_state=random_state)])
                              .sample(frac=1, random_state=random_state)) # a random shuffle to mix both classes
    X_downsampled = df_downsampled.loc[:, df_full.columns != y_column]
    y_downsampled = df_downsampled.loc[:, y_column]
    if scaler is not None:
        X_downsampled = scaler().fit_transform(X_downsampled) # Scale the data
    if verbose:
        print('Number of points: {}, number of positives: {} ({:.2%})'.format(
            len(y_downsampled), y_downsampled.sum(), y_downsampled.mean()))
    return(X_downsampled, y_downsampled)

1. Credit Card dataset

In [3]:
df_full = pd.read_csv(r"../data/creditcard.csv")
df_full = df_full.drop('Time', axis=1)
df_full = df_full.sample(frac=1) # Shuffle the data set
df_full.shape
num_neg = (df_full.Class==0).sum()
num_pos = df_full.Class.sum()
print('Number of positive / negative samples: {} / {}'.format(num_pos, num_neg))
print('Fraction of positives: {:.2%}'.format(num_pos/num_neg))
Number of positive / negative samples: 492 / 284315
Fraction of positives: 0.17%
In [4]:
print(df_full.columns)
print(df_full.shape)
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')
(284807, 30)

Minimal feature transformation (before general scaling) will be done; only the Amount feature will be log-transformed. Adding 0.01 results in a nice, symmetric distribution.

In [5]:
df_full.Amount = np.log(0.01 + df_full.Amount)
df_full.Amount.plot(kind='box')
plt.show()

Looking at the boxplots of the first 10 features, the features are characterized by strong outliers (large deviation from the median, normalized by the IQR):

In [6]:
df_scaled = pd.concat((pd.DataFrame(RobustScaler().fit_transform(df_full.iloc[:, :-1])), df_full.Class), axis=1)
df_scaled.columns = df_full.columns
In [7]:
range_1 = np.r_[1:10, 29]
df_long = pd.melt(df_scaled.iloc[:, range_1], 
                  "Class", var_name="feature", value_name="value")
sns.factorplot("feature", hue="Class", y="value", data=df_long, kind="box", size=5, aspect=2)
plt.ylim([-20, 20])
plt.show()

We will create the datasets for further usage. There will be one downsample dataset (strongly downsampled: only 10% of positives), because of computational limitations of some algorithms.

In [8]:
X_credit, y_credit = downsample_scale_split_df(df_full, verbose=1, random_state=1)
X_downsampled_credit, y_downsampled_credit = downsample_scale_split_df(df_full, frac_positive=0.2, 
                                                            frac_negative=0.1, verbose=1, random_state=1,
                                                            scaler=RobustScaler)

X_downsampled2_credit, y_downsampled2_credit = downsample_scale_split_df(df_full, frac_positive=0.2, 
                                                            frac_negative=3000/len(y_credit), verbose=1, random_state=1,
                                                            scaler=RobustScaler)
Number of points: 284807, number of positives: 492 (0.17%)
Number of points: 28530, number of positives: 98 (0.34%)
Number of points: 3093, number of positives: 98 (3.17%)
In [9]:
with open(r"data/anomaly_credit.dat","wb") as f:
    np.save(f, X_downsampled2_credit)
    np.save(f, y_downsampled2_credit)    
    np.save(f, X_downsampled_credit)
    np.save(f, y_downsampled_credit)
    np.save(f, X_credit)
    np.save(f, y_credit)

2. PaySim dataset

In [10]:
df_full = pd.read_csv(r"../data/PS_20174392719_1491204439457_log.csv")
df_full.head()
print('Shape of DataFrame: {}'.format(df_full.shape))
print('Fraction of positives: {:.3%}'.format(df_full.isFraud.mean()))
Shape of DataFrame: (6362620, 11)
Fraction of positives: 0.129%
In [11]:
(df_full.groupby('type')['isFraud']
     .agg(['count', 'mean'])
     .sort_values('count', ascending=False)
     .rename(columns={'mean':'fraction of positives'}))
Out[11]:
count fraction of positives
type
CASH_OUT 2237500 0.001840
PAYMENT 2151495 0.000000
CASH_IN 1399284 0.000000
TRANSFER 532909 0.007688
DEBIT 41432 0.000000

Fraud is only present in two categories: Transfer and Cash out. This will be considered domain-knowledge, and only these two categories will be kept. Furthermore, we implement some feature transformations.

In [12]:
df_full = df_full.loc[(df_full.type == 'CASH_OUT') | (df_full.type == 'TRANSFER'), :]
df_full['errorBalanceOrig'] = df_full.newbalanceOrig + df_full.amount - df_full.oldbalanceOrg
df_full['errorBalanceDest'] = df_full.oldbalanceDest + df_full.amount - df_full.newbalanceDest
df_full = df_full.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)
df_full = df_full.rename(columns={'isFraud': 'Class'})
In [13]:
# Log-transform amount columns (that are right-tailed)
cols_to_logtransform = [col for col in df_full.columns if 'balance' in col.lower() or 'amount' in col.lower()]
for col in cols_to_logtransform:
    print('Transforming column', col)
    col_min_val = df_full[col].min()
    #df_full[col] = df_full[col].apply(lambda x: np.log10(1 - col_min_val + x)) 
    df_full[col] = np.log10(1- col_min_val + df_full[col]) #vectorized operation yields order of magnitude speed-up
Transforming column amount
Transforming column oldbalanceOrg
Transforming column newbalanceOrig
Transforming column oldbalanceDest
Transforming column newbalanceDest
Transforming column errorBalanceOrig
Transforming column errorBalanceDest
In [14]:
# the set-column shows the time in hours. Do a modulo-24 transform
df_full['step'] = np.mod(df_full['step'], 24)

Do dummy-encoding, dropping the first category to avoid collinearity.

In [15]:
df_full = pd.get_dummies(df_full, columns=["type"], drop_first=True)

Some further feature engineering: the balance columns are a combination of a delta-peak at zero and a normal function. Do a median replacement for the zero values, and make indicator columns.

In [16]:
for col in ['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']:
    df_full[col + '_zero_ind'] = (df_full[col] == 0).astype(int)
    df_full.loc[df_full[col] == 0, col] = df_full[col].median()

    
In [17]:
num_neg = (df_full.Class==0).sum()
num_pos = df_full.Class.sum()
print('Number of positive / negative samples: {} / {}'.format(num_pos, num_neg))
print('Fraction of positives: {:.2%}'.format(num_pos/num_neg))
Number of positive / negative samples: 8213 / 2762196
Fraction of positives: 0.30%
In [18]:
df_full = pd.concat([df_full.loc[:, df_full.columns != 'Class'], 
                    df_full.loc[:, 'Class']], axis=1) # Make the label the final column
print(df_full.shape)
(2770409, 14)

Do a standard scaling of all columns, except the 0-1 encoded ones

In [19]:
df_full.iloc[:, :8] = StandardScaler().fit_transform(df_full.iloc[:, :8])
In [20]:
df_full.sample(5)
Out[20]:
step amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest errorBalanceOrig errorBalanceDest type_TRANSFER oldbalanceOrg_zero_ind newbalanceOrig_zero_ind oldbalanceDest_zero_ind newbalanceDest_zero_ind Class
3937753 1.670886 0.536248 1.011051 -0.327568 2.036591 2.038521 0.463979 0.028012 0 0 1 0 0 0
1083621 -1.575385 -0.363190 0.898819 -0.327568 -0.129012 -1.525351 0.111739 0.028012 0 0 1 1 0 0
4933662 -0.326819 -0.022368 -0.658138 -0.327568 -0.129012 -1.225735 0.320119 0.028012 0 0 1 1 0 0
5169411 1.670886 0.318340 -0.869551 -0.327568 1.501091 1.479247 0.432016 0.028012 0 1 1 0 0 0
2825150 -1.325672 0.794151 -0.869551 -0.327568 0.002054 0.194055 0.586963 0.028012 0 1 1 0 0 0
In [21]:
X_paysim, y_paysim = df_full.loc[:, df_full.columns != 'Class'], df_full.Class
X_downsampled_paysim, y_downsampled_paysim = downsample_scale_split_df(df_full, scaler=None, verbose=1, 
                                                            frac_negative=0.1,
                                                            frac_positive=0.1)
Number of points: 277041, number of positives: 821 (0.30%)
In [22]:
with open(r"data/anomaly_paysim.dat","wb") as f:
    np.save(f, X_downsampled_paysim)
    np.save(f, y_downsampled_paysim)
    np.save(f, X_paysim)
    np.save(f, y_paysim)

3. Forest cover dataset

In [23]:
mat = scipy.io.loadmat(r'../data/cover.mat')
df_full = pd.concat((pd.DataFrame(data=mat['X']), pd.DataFrame(data=mat['y'])), axis=1)
df_full.columns = list(range(df_full.shape[1]-1)) + ['Class']
#forest_y = pd.Series(mat['y'][:, 0])
# df_full = 
In [24]:
df_full.describe()
Out[24]:
0 1 2 3 4 5 6 7 8 9 Class
count 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000 286048.000000
mean 2914.242610 151.917224 13.598309 278.255251 45.839107 2414.978643 213.983685 225.246605 142.680092 2155.583857 0.009603
std 197.987324 107.488551 7.138464 210.458091 57.504597 1618.090012 24.955931 18.551910 36.501454 1423.976520 0.097525
min 1988.000000 0.000000 0.000000 0.000000 -173.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2788.000000 60.000000 8.000000 120.000000 8.000000 1123.000000 201.000000 215.000000 120.000000 1165.000000 0.000000
50% 2933.000000 126.000000 13.000000 240.000000 30.000000 2016.000000 219.000000 227.000000 142.000000 1832.000000 0.000000
75% 3041.000000 241.000000 18.000000 390.000000 67.000000 3386.000000 232.000000 239.000000 167.000000 2647.000000 0.000000
max 3433.000000 360.000000 66.000000 1397.000000 601.000000 7117.000000 254.000000 254.000000 254.000000 7173.000000 1.000000
In [25]:
df_full.isna().sum().sum()
Out[25]:
0
In [26]:
df_scaled = pd.concat((pd.DataFrame(RobustScaler().fit_transform(df_full.iloc[:, :-1])), df_full.Class), axis=1)
df_scaled.columns = df_full.columns
In [27]:
df_long = pd.melt(df_scaled, 
                  "Class", var_name="feature", value_name="value")
sns.factorplot("feature", hue="Class", y="value", data=df_long, kind="box", size=5, aspect=2)
plt.ylim((-5, 5))
plt.show()

Note the large conditional difference on feature 0. In a supervised setting, a very good performance could be easily achieved. Note that the features are nice and symmetric, when not split by class:

In [28]:
df_long = pd.melt(df_scaled, 
                  "Class", var_name="feature", value_name="value")
sns.factorplot("feature", y="value", data=df_long, kind="box", size=5, aspect=2)
plt.ylim((-5, 5))
plt.show()
In [29]:
df_scaled.Class.value_counts()
Out[29]:
0    283301
1      2747
Name: Class, dtype: int64
In [30]:
X_cover, y_cover = df_scaled.loc[:, df_scaled.columns != 'Class'], df_scaled.Class
X_downsampled_cover, y_downsampled_cover = downsample_scale_split_df(df_scaled, scaler=None, verbose=1, 
                                                            frac_negative=0.2,
                                                            frac_positive=0.2)
Number of points: 57209, number of positives: 549 (0.96%)
In [31]:
with open(r"data/anomaly_cover.dat","wb") as f:
    np.save(f, X_downsampled_cover)
    np.save(f, y_downsampled_cover)
    np.save(f, X_cover)
    np.save(f, y_cover)

4. MNIST dataset

In the original MNIST dataset, the feature values range between 0 and 255, as they are 8-bit intensity values. The features here seem a bit different, as also negative values occur regularly in some features. I will apply a rescaling between 0 and 1.

In [32]:
mat = scipy.io.loadmat(r'../data/mnist.mat')
df_full = pd.concat((pd.DataFrame(data=mat['X']), pd.DataFrame(data=mat['y'])), axis=1)
df_full.columns = list(range(df_full.shape[1]-1)) + ['Class']
print(df_full.shape)
print('Positive class fraction: {:.3%}'.format(df_full.Class.mean()))
(7603, 101)
Positive class fraction: 9.207%
In [33]:
print(df_full.shape)
print('Positive class fraction: {:.3%}'.format(df_full.Class.mean()))
(7603, 101)
Positive class fraction: 9.207%
In [34]:
X_min = mat['X'].min().min()
X_max = mat['X'].max().max()
In [35]:
df_full = pd.concat((pd.DataFrame(data=mat['X']), pd.DataFrame(data=mat['y'])), axis=1)
df_full.columns = list(range(df_full.shape[1]-1)) + ['Class']

Note that some features have zero variance: these might be pixels in the very corner that are always black. We will remove those

In [36]:
df_full = df_full.loc[:, df_full.var(axis=0) > 1E-6]
print(df_full.shape)
(7603, 79)
In [37]:
df_scaled = pd.concat(((df_full.iloc[:, :-1] - X_min)/(X_max - X_min), df_full.Class), axis=1)
df_scaled.columns = df_full.columns
In [38]:
df_long = pd.melt(df_scaled.iloc[:, -10:], 
                  "Class", var_name="feature", value_name="value")
sns.factorplot("feature", hue="Class", y="value", data=df_long, kind="box", size=5, aspect=2)
plt.show()
In [39]:
X_mnist, y_mnist = df_full.loc[:, df_full.columns != 'Class'], df_full.Class
with open(r"data/anomaly_mnist.dat","wb") as f:
    np.save(f, X_mnist)
    np.save(f, y_mnist)

This concludes the data pre-processing. Time for the more interesting stuff: outlier detection!

Comments