Optimizing with sklearn's GridSearchCV and Pipeline

Using sklearn's gridsearchCV and pipelines for hyperparameter optimization

Sklearn has built-in functionality to scan for the best combinations of hyperparameters (such as regularization strength, length scale parameters) in an efficient manner.

With the Pipeline class, we can also pass data-preprocessing steps such as standardization or PCA. This is a real time-saver. No more writing complex cross-validation loops with .fit() on the train set, calculating the score on the leave-out set, storing it in a dataframe...

We will use the pre-processed data from

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

that was explored in detail in a previous post.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

from sklearn.linear_model import LogisticRegression # Classifier 1
from sklearn.ensemble import RandomForestClassifier # Classifier 2
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, make_scorer, roc_curve, auc
from sklearn.model_selection import GridSearchCV # to optimize hyperparams
from sklearn.preprocessing import StandardScaler # to standardize numerical features
from sklearn.pipeline import Pipeline # to pass the standardization along with the classifier to GridSearchCV

0. Import data

In [2]:
# after downloading, store locally. Set the path here
data_dir = '../data/'
## Let's write the results to disk for analysis. 
X_train = pd.read_pickle(path=data_dir + 'X_train')
X_test = pd.read_pickle(path=data_dir + 'X_test')
X_train_ohe = pd.read_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe = pd.read_pickle(path=data_dir + 'X_test_ohe')
y_train = pd.read_pickle(path=data_dir + 'y_train')
y_test = pd.read_pickle(path=data_dir + 'y_test')

1. Classification

The data set is already cleaned-up, and one-hot encoded:

In [3]:
X_train_ohe.describe()
Out[3]:
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
count 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000
mean 10.080679 38.581647 40.437456 1077.648844 87.303830 0.669205 0.031909 0.095943 0.008323 0.854274 0.064279 0.000215 0.753417 0.034274 0.078038 0.039864 0.000430
std 2.572720 13.640433 12.347429 7385.292085 402.960219 0.470506 0.175761 0.294518 0.090851 0.352837 0.245254 0.014661 0.431029 0.181935 0.268236 0.195642 0.020731
min 1.000000 17.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 28.000000 40.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
50% 10.000000 37.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
75% 12.000000 48.000000 45.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
max 16.000000 90.000000 99.000000 99999.000000 4356.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Let's do classification using logistic regression and random-forest, and compare the results. As features, we have:

  • education_num (as a numerical feature, which seems a fairly decent approach)
  • age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage here
  • hours per week (numerical)
  • sex (categorical)
  • race (categorical)
  • hoursperweek (numerical)
  • workclass (categorical)
  • capital gain (numerical)
  • capital loss (numerical)

Note that zip code was thrown away from the original dataset. It might carry some useful data after suitable binning (for instance, mapping zip to state), but without such an operation it is likely not a suitable feature.

The website mentions typical accuracy scores of about 85% (after removal of unknown records, which is not what was done here), so let's see if we can get close to that. Note that the dataset has an imbalance with only about 24% being positive, so an accuracy of 76% can be obtained trivially. We will therefore also look at the F1-score and at ROC curves.

What is a meaningful score in real-life depends on the business case. For instance: with a fixed budget for targeted marketing on the positive group, we are probably most interested in the precision of our classifier.

Logistic regression

In [19]:
f1_scorer = make_scorer(f1_score)

pipe = Pipeline([
    ('standardscaler', StandardScaler()), #'standardscaler' is the name we chose to give 
    ('classify', LogisticRegression(random_state=1234))
                ])
C_grid = np.logspace(-4, 4, 8) # These are the 'C' parameters that are varied

# the parameter_grid is a list with dictionaries. The naming for keys in that dict is strict. 
# It refers to the the element in the pipe, with a __ followed by the parameter name we want to vary
# Whence, we use 'classify__C'

param_grid = [{'classify__C': C_grid }]

# Give GridSearchCV the pipe, the number of folds cv, the param_grid, the scorer we like to use for assessment
# (if we don't pass "scoring", we will just be using the classifier's default scorer)
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 8 candidates, totalling 80 fits
[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   11.4s finished
Out[19]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classify', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1234, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__C': array([  1.00000e-04,   1.38950e-03,   1.93070e-02,   2.68270e-01,
         3.72759e+00,   5.17947e+01,   7.19686e+02,   1.00000e+04])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)

We just did cross-validation, scanning for 8 different C-values, each time fitting the scaler on the train-set, and scoring on the hold-out set. The results are stored in

grid.cv_results_
In [21]:
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_train_score'], label='train')
plt.legend()
plt.xlabel('C')
plt.ylabel('F1-score')
plt.semilogx();
#plt.ylim([0.8, 0.82])
Interpretation

C is an inverse regularization-strength. For C about 1 or higher, the results do not change, meaning that regularization is not having any influence any more (the penalty it incurs is relatively small).

Because of the relatively large amount of observations (32.5 K) compared to the number of features (17), regularization is clearly not needed. This is also observed in the accordance between test and train score. We can take C=1000.0 to effectively have unregularized regression

In [6]:
sc = StandardScaler()
X_train_ohe = sc.fit_transform(X_train_ohe)
X_test_ohe = sc.transform(X_test_ohe)
LR = LogisticRegression(C=1000.0)
LR.fit(X_train_ohe, y_train)
y_predict_LR = LR.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_LR, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_LR)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_LR)))
col_0 0 1
income_class
0 0.721946 0.041828
1 0.131749 0.104478
accuracy: 0.826
f1-score: 0.546

Note that the actual labels are in the rows, and that predicted labels are displayed column-wise. There seem to be different conventions, but I follow the convention from "The elements of statistical learning (Hastie et al.)" here. Also, giving the 0th index (rows) the true values, and the 1st index (columns) the prediction seems to be more intuitively correct.

Our logistic model has a rather poor recall: of the 23% that are truly positive, less than a half is identified as such. The precision is better. Let us look at the ROC curve, that tells us a bit more:

In [7]:
y_predict_LR_proba = LR.predict_proba(X_test_ohe)
fpr_LR, tpr_LR, thresholds = roc_curve(y_test, y_predict_LR_proba[:,1], pos_label=1)
In [8]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR)
plt.title('Logistic Regression, AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

Random Forest Classifier

For the random forest classifier, we do not need to worry about standardization and in principle do not need one-hot-encoding. Regularization is enforced by minimizing the complexity of the individual trees.

X_train and X_test contain categorical variables, that unfortunately are not handled by sci-kit learn well. Let us go two routes:

  1. Use one-hot-encoded features
  2. Use coded categories (integers)

We will start off with one-hot-encoded features

In [22]:
pipe = Pipeline([
    ('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
])
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
    {'classify__max_depth': maxdepth_grid },
]
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   49.3s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  1.1min finished
Out[22]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)
In [23]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')
plt.ylabel('F1-score')
plt.legend();
Interpretation

In Random Forest classification, complexity is determined by how large we allow our trees to be. Clearly, we start overfitting on the train data from a depth of 10 or more. However, because we are using a 100 trees (we are bootstrap aggregating), the variance is strongly suppressed. Although the predictive power doesn't increase much from a depth of about 15 or more, the predictions also do not deteriorate. This is in general a very nice property of random forests: grow many trees, and grow them large enough, and you are almost guaranteed to get a good classifier.

In [11]:
RFC = RandomForestClassifier(n_estimators=100, max_depth=20)
RFC.fit(X_train_ohe, y_train)
y_predict_RFC = RFC.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_RFC, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_RFC)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_RFC)))
col_0 0 1
income_class
0 0.722192 0.041582
1 0.118113 0.118113
accuracy: 0.840
f1-score: 0.597
In [12]:
y_predict_RFC_proba = RFC.predict_proba(X_test_ohe)
fpr_RFC, tpr_RFC, thresholds = roc_curve(y_test, y_predict_RFC_proba[:,1], pos_label=1)
In [13]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR, label= 'Log. Reg.')
plt.plot(fpr_RFC, tpr_RFC, label='RFC')
plt.legend()
plt.title('LR AUC: {:.3f}, RFC AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR),
                                                  auc(x=fpr_RFC, y=tpr_RFC)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');
Interpretation

By shifting the level at which the label is predicted positive, for an identical FPR (i.e.: what fraction of actual negatives have been classified as positive), the TPR (what fraction of actual positives has been identified as such) of the RFC classifier is larger: it is a stronger classifier.

As a final step, let us see if there is any difference with the integer-coded categorical features (rather than one-hot encoded)

In [14]:
# Convert categorical to codes (integers)
cat_cols = list(X_train.columns[X_train.dtypes == 'category'])
for col in cat_cols:
    X_train[col] = X_train[col].cat.codes
In [15]:
pipe = Pipeline([
    ('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
])
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
    {'classify__max_depth': maxdepth_grid },
]
grid_2 = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid_2.fit(X_train, y_train)
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   40.4s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   57.9s finished
Out[15]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=1234, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)
In [25]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test, ohe',
        linestyle='--')
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train, ohe',
        linestyle='--')
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_test_score'], label='test')
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')
plt.ylabel('F1-score')
plt.legend();

Conclusion

Using sklearn's Pipeline and GridsearchCV, we did the entire hyperparameter optimization loop (for a range of values, standardization, fitting, scoring on test predictions, storing results) in a few lines. We got somewhat better results with the Random Forest Classifier compared to the Logistic Regression classifier, as was to be expected based on the features (for instance, as the expectation of the target is not monotonically increasing with age). One-hot-encoded features and integer-coded features give very comparable performance for RFC.

In the end, our accuracy score ( NB: without removing observations with missing features) are on par with those mentioned on the website (where observations were removed).

As a quick and dirty comparison: the RFC accuracy is 84.40%, whereas the mentioned best scores are C4.5: 84.46%, Naive-Bayes: 83.88%, NBTree: 85.90%)

Anomaly Detection with Sci-kit Learn, Keras and TensorFlow

Anomaly detection in time-series data

Background:

Consider an unsteady heat transfer problem, with a solid plate that sits in between two turbulent media with different temperatures. The rate of change of temperature of the plate is driven by the weighted temperature difference of the two gaseous streams, and the weights -heat transfer coefficients- depend on other variables, such as the air velocity and pressure.

Our system is equipped with sensors, and we want to know whether an anomaly occurred at some point (a heat transfer change due to loss of insulation, a sensor that started giving wrong values, ..). We cannot observe this directly from the measured data. An essential difficulty for a traditional, physical model-based approach, is that our observed and unobserved variables have relationships that are not exactly known to us (for instance, the relationship between the heat transfer coefficient and variables such as temperature, pressure and velocity).

Nevertheless, we would like to be able to detect a hidden change in our system -in this case, a reduction in a heat transfer coefficient of 10%-. We will train a machine-learning model to learn how the observed variables are connected, such that it can detect when something does not look quite right anymore.

As a first step, let's generate data that follows from the following non-dimensionalized ordinary differential equation:

\begin{gather} t_{ref} \frac{dy}{dt} &= \eta_0 \theta_0 + \eta_1 \theta_1 - y(\eta_1 + \eta_0) \\ \end{gather}

Where the $\eta$'s denote the heat transfer coefficients, the $\theta$'s the temperatures (at both sides, indicated with 0 and 1), and $t_{ref}$ the time constant (that depends, among other things, on the material properties and thickness). To give some physical intuition:

  • when both $\eta$'s are identical, the steady-state solution is: $y = (\theta_0 + \theta_1)/2$. Note that this is also a stable solution: whenever y is larger, the rate of change becomes negative, and vice versa
  • More general, the steady-state solution is: $y = \left(\eta_0 \theta_0 + \eta_1 \theta_1\right) / \left(\eta_0 + \eta_1 \right)$

In case $\eta_0$ is much larger than $\eta_1$, the temperature $y$ will tend to $\theta_0$.

Note that this functional form applies to many systems that are governed by gradient-driven transport.

The heat transfer coefficients cannot be directly measured, but depend on other variables that can be measured and are therefore known - those are collectively called $\mathbf{x}$ (e.g., temperatures, pressures and velocities), that vary with time:

\begin{align} \eta_0 = \eta(\mathbf{x}(t)), \qquad \eta_1 = \pi(t) \cdot \eta(\mathbf{x}(t)) \end{align} where $\pi$ is the function that suddenly changes from 1 to a different value, at $t=t_{anomaly}$.

The temperature is supposed to be known on one side, and unknown but dependent on known variables on the other side:

\begin{align} \theta_0 = \theta_0(t), \qquad \theta_1 = \theta_1(\mathbf{x}(t)) \end{align}

A typical dependency of the heat transfer coefficients is that they go proportional to several non-dimensional quantities raised to some power. Although this will not be demonstrated here, when these exact exponents are not known, it is very difficult to reconstruct the occurrence of the anomaly. Machine-based learning, on the other hand, can be quite robust - if we have enough data to train our model with -.

We will use two approaches to this problem: a Gaussian Process regression -that offers the advantage of estimating confidence intervals for our predictions-, and a neural network, using keras with a Tensorflow back-end.

In [1]:
%matplotlib inline
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

1. Setting some parameters

In [2]:
N = int(1E5) # number of timesteps
t_anomaly = 0.8 # fraction of N where anomaly occurs
diff_anomaly = -0.1 # the difference of the coefficient that occurs at t_anomaly (-0.1: 10% reduction)
t_train = 0.6 # the first fraction of the data we can train our model on. 
N_events = 100 # The number of changes that occur in the target values for the Ornstein-Uhlenbeck process that generates X
noise_level = 1.E-4 # Additive white noise on top of the measured signals. 1.E-4: very high SNR, clear detection. 
# 1.E-3: decent detection. 1.E-2: pushing it 

2. Generating the independent (observed) variables

To generate data, we start with piecewise linear target functions of $\mathbf{x}$. Using an Ornstein-Uhlenbeck scheme, we introduce inertia in the response and add some noise in the variable values.

In [3]:
def generate_random_pwlin_curves(N, n_x, y_min, y_max, seed, stoch=True, theta=0.01, sigma=0.01):
    """
    Returns piecewise linear time-series. 
    If stoch == True: additionally applies an Ornstein-Uhlenbeck process with the values 
    from the piecewise linear series as target values. 
    """
    np.random.seed(seed)
    y = np.zeros(N)
    # Generate random times at which the series will change its value
    t_delta = np.array([round(xd) for xd in np.random.random(n_x) * N])
    # First time: 0, then sort the random times
    t_delta_all = np.concatenate( (np.array([0]), np.sort(t_delta), np.array([N]) ) )
    
    # Generate the piecewise linear arrays by assigning constant values between the times
    for i in range(n_x + 1):
        y[int(t_delta_all[i]) : int(t_delta_all[i+1])] = y_min + np.random.random() * (y_max - y_min)
        
    if stoch:
        ## Ornstein Uhlenbeck process: reverting to the "target" (y), with added noise
        noise = np.random.normal(loc=0.0, scale=sigma, size=N) #define noise process
        drift = lambda y, mu: theta * (mu - y) # drift term. theta: the larger, the quicker 
        # the values revert to target
        mu = y.copy() # mu is now the target value for the drift process
        
        # Iterate through time
        for i in range(1, N):
            y[i] = y[i-1] + drift(y[i-1], mu[i-1]) + noise[i]
    return y
    
    
In [4]:
# Generate the x-matrix, with the row-index corresponding to time:
X = np.empty((N, 5))
# Generate the values
X[:,0] = generate_random_pwlin_curves(N, N_events, 1.0, 2.5, seed=79) 
X[:,1] = generate_random_pwlin_curves(N, N_events, 2.0, 5, seed=12) 
X[:,2] = generate_random_pwlin_curves(N, N_events, 0, 1, seed=13) 
X[:,3] = generate_random_pwlin_curves(N, N_events, 0.5, 2, seed=443) 
X[:,4] = generate_random_pwlin_curves(N, N_events, 0.5, 1.5, seed=123) 
In [5]:
matplotlib.rcParams.update({'font.size': 14})

# Plot the independent variables
fig, axs = plt.subplots(2,3,figsize=(15,6))
for i in range(len(axs.flat)):
    try:
        axs.flat[i].plot(X[:,i], lw=0.5)
        axs.flat[i].set_xlabel('Time [-]')
        axs.flat[i].set_ylabel('Value [-]')
        axs.flat[i].ticklabel_format(style='sci', axis='x', scilimits=(0,0))

    except IndexError:  # When exceeding the number of columns of X, make an empty axis
        ax=axs.flat[i]
        axs.flat[i].set_axis_off()
        
plt.suptitle('Time traces of the independent variables X', fontsize=16)
plt.tight_layout(pad=2)

3. Generating the hidden variables

The hidden ("latent") variables are the non-observed ingredients of the model. Some arbitrary dependence is assumed, loosely inspired on correlations for the turbulent heat transfer coefficient.

In [6]:
# Set some variables
t_ref = 10 # 1/(rho cp t)
p = np.ones(N)
p[round(N * t_anomaly)+1:] = 1 + diff_anomaly #This is the anomaly: change in the coefficient from 1 to below 1

# Define the functions that determine theta_1(x), eta_0(x) and eta_1(x):
fun_theta1 = lambda x1, x2 : x1 * (0.775 - 0.1*x2**2 + 0.7*x2**3)
fun_eta = lambda x1, x2, x3 : x1 * x2**0.628 * x3**-0.23

# Calculate the T1 and HTC's
theta_1 = fun_theta1(X[:,1], X[:,2])
eta_0 = fun_eta(X[:,3], X[:,4], X[:,0])
eta_1 = p * fun_eta(X[:,3], X[:,4], theta_1)

# Generate sensor values y
y = np.zeros(N)
y[0] = (eta_0[0] * X[0,0] + eta_1[0] * theta_1[1]) / (eta_0[0] + eta_1[0]) #Start with steady value

# Generate noise
eps = 1E-5 * np.random.random(len(y))

# Iterate through y using the difference equation
for i in range(N-1):
    y[i+1] = y[i] + 1/t_ref * (eta_1[i]*(theta_1[i] - y[i]) + eta_0[i]*(X[i,0] - y[i])) + eps[i]
   
In [7]:
# Add noise to both X and y
X += np.random.normal(loc=0.0, scale=noise_level, size=X.shape) 
y += np.random.normal(loc=0.0, scale=noise_level, size=y.shape)
In [8]:
# Zoom in on some intervals, to show that y (the solid temperature) tends to a value in between T0 and T1
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(14,5))

# 1: before anomaly
t1, t2 = 37000, 41000
ax1.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax1.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax1.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax1.legend(fancybox=True, framealpha=0.8, loc=2)
ax1.set_title('Temperatures, before anomaly', fontsize=16);

# 2: around anomaly
t1, t2 = int(N*t_anomaly-2000), int(N*t_anomaly+2000)
ax2.plot(range(t1,t2), theta_1[t1:t2],'k', label='theta_1')
ax2.plot(range(t1,t2), y[t1:t2], lw=3, label='y')
ax2.plot(range(t1,t2), X[t1:t2, 0],'g', label='theta_0')
ax2.vlines(x=N*t_anomaly, ymin=ax2.get_ylim()[0], ymax=ax2.get_ylim()[1], linestyle='--', color='r', label='start of anomaly')
ax2.legend(fancybox=True, framealpha=0.8, loc=2)

ax2.set_title('Temperatures, around anomaly', fontsize=16);
ax2.set_xlabel('Time [-]')
plt.tight_layout()

The top plot shows how $y$ tends to a value that is in between $\theta_0$ and $\theta_1$, before the anomaly.

The bottom plot zooms in around the anomaly at t=80'000. We really cannot easily detect the event visually, because all variables are continuously changing. Let's see if a machine learning model can help us detect the event

4. Generate the features (X) and target (y)

In [9]:
# The target variable: y
y_target = y

# Create features from X, adding an extra 6th column for dy/dt
X_features = np.empty((N, 6))
X_features[:, :5] = X.copy()
X_features[:, 5] = np.concatenate((np.diff(y), np.array([0])))

# Scale the features to have zero mean and unit variance. 
rr = StandardScaler()
rr.fit(X_features)
X_features = rr.transform(X_features)

5. Gaussian Process Regressor from Sci-kit learn

In [10]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

We will train a Gaussian Process regression model, to predict the temperature $y$ at t based on the measured properties at $t$ and the change between $y(t+1)$ and $y(t)$.

Note that this not predictive modelling in the traditional sense for time-series (future prediction); we are merely reconstructing the missing piece ($y(t)$) from all the other ingredients that together make up the differential equation at time $t$.

Training a GP model is computationally demanding for large N, with the training computational effort scaling with ~N^3. Sci-kit learn's GP model has an internal optimizer, that yields hyperparameters with maximum-likelihood maximization. Finding this optimum is additional effort. The found optimum for the length scale is 5.4 (note: we are dealing with 6 dimensions) which is given explicitly, along with

optimizer=None

thus avoiding the optimization routine for the demo.

For this smooth problem, the common Gaussian/RBF kernel is well-suited.

In [11]:
N_1, train_step, test_step = int(t_train * N), 40, 10
X_train, y_train = X_features[0:N_1:train_step, :], y_target[0:N_1:train_step] # subsample features
X_test, y_test = X_features[::test_step, :], y_target[::test_step] # predict on entire range, subset later for test accuracy
In [12]:
X_train.shape
Out[12]:
(1500, 6)
In [13]:
GP = GaussianProcessRegressor(kernel=RBF(length_scale=5.4), alpha=noise_level, optimizer=None )
GP.fit(X_train, y_train)
Out[13]:
GaussianProcessRegressor(alpha=0.0001, copy_X_train=True,
             kernel=RBF(length_scale=5.4), n_restarts_optimizer=0,
             normalize_y=False, optimizer=None, random_state=None)
In [14]:
GP = GaussianProcessRegressor(kernel=RBF(length_scale=5.4), n_restarts_optimizer=1, alpha=noise_level)
GP.fit(X_train, y_train)
Out[14]:
GaussianProcessRegressor(alpha=0.0001, copy_X_train=True,
             kernel=RBF(length_scale=5.4), n_restarts_optimizer=1,
             normalize_y=False, optimizer='fmin_l_bfgs_b',
             random_state=None)
In [15]:
y_pred, sigma = GP.predict(X_test, return_std=True) # Also return the estimated uncertainty of the prediction
In [16]:
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), y_test, lw=2, label='actual values')
plt.plot(range(0, N, test_step), y_pred,'k-', label='predicted values')
plt.fill_between(range(0, N, test_step), y_pred- 2*sigma, y_pred+2*sigma, 
                 color='k', alpha=0.2, label='95% conf.int.')
plt.vlines(x=N_1, ymin=1, ymax=5,  color='k', linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=1, ymax=5, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('normalized temperature [-]')
plt.xlabel('time [-]')
plt.title('Predictions versus actual y-values', fontsize=16);
In [17]:
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), y_pred - y_test, label='prediction-actual')
plt.fill_between(range(0, N, test_step), -2*sigma, 2*sigma, color='k', alpha=0.5, label='95% conf.int.')
plt.vlines(x=N_1, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=-2*np.max(sigma), ymax=2*np.max(sigma), color='r',
           linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]', fontsize=14)
plt.title('Residuals and model uncertainty', fontsize=16);

This plot shows the large advantage of Gaussian Process regression: our model indicates how certain it is regarding its predictions. Before the start of the test data (before black dashed line), the uncertainty band is rather narrow: the model has been trained on this data (with subsampling, which is why there is still some variation in the uncertainty within the training domain).

In the first part of the test-data, before the occurrence of the anomaly (red dashed line), the error increases. Since our features have 6 dimensions, and we had a total of 100 "set-points", we did not quite cover the entire parameter space. However, the model indicates this perfectly well: the uncertainty becomes much larger.

In the final part, the predictions generally lie far outside of our confidence interval. Let's calculate the log-likelihood of our data given our model's prediction and uncertainty, and plot this against time.

In [18]:
import scipy.stats
eps = 1E-12 # eps to avoid -inf
LL = np.log(eps + scipy.stats.norm(y_pred, sigma).pdf(y_target[::test_step])) 
plt.figure(figsize=(14,8))
plt.plot(range(0, N, test_step), LL, label='Log-likelihood')
plt.vlines(x=N_1, ymin=min(LL), ymax=10, linestyle='--', label='end of training data')
plt.vlines(x=N*t_anomaly, ymin=min(LL), ymax=10, color='r', linestyle='--', label='start of anomaly')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('Log-likelihood [-]')
plt.xlabel('time [-]')
plt.title('Log-likelihood of data given prediction', fontsize=16);

The log-likelihood shows a slight degradation after the start of unseen data, but a drastic jump at $t_{novelty}$. Since the change is so drastic, one can easily automate detection by adding a moving average and thresholding.

In [19]:
# Calculate the R-squared and MSE of the GP model
R2_GP = r2_score(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
                 y_test[N_1//test_step:int(N*t_anomaly)//test_step])
MSE_GP = mean_squared_error(y_pred[N_1//test_step:int(N*t_anomaly)//test_step], 
                            y_test[N_1//test_step:int(N*t_anomaly)//test_step])
In [20]:
# save the residuals for later use
res_GP = y_pred - y_test

6. Artificial neural network with Keras and TensorFlow

In [21]:
import keras
from time import time
from keras.models import Sequential
from keras.layers import Dense
from keras import callbacks
from keras.callbacks import TensorBoard
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
Using TensorFlow backend.

As an alternative, we will train a neural network to do the same task. The effort for training a neural network goes linear with the number of training points; we do not to undersample and just use all training data poitns.

In [22]:
X_train, y_train = X_features[:N_1, :], y_target[:N_1]
X_test, y_test = X_features[:, :], y_target[:]
In [23]:
def twolayer_model(activation_1='relu', activation_2='linear'):
    model = Sequential()
    model.add(Dense(6, input_dim=6, kernel_initializer='normal', activation=activation_1))
    model.add(Dense(1,  kernel_initializer='normal', activation=activation_2))
    model.compile(loss='mean_squared_error',optimizer='adam', metrics=['mean_squared_error'])
    return model
In [24]:
model = twolayer_model(activation_1='sigmoid', activation_2='linear')
# The re-lu gives somewhat poorer accuracy
In [25]:
tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
# The tensorboard object will write data for visualization with tensorboard. 
# start tensorboard from the terminal with:
# $ tensorboard --log_dir=logs/
model.fit(X_train, y_train, epochs=200, batch_size=100, verbose=0, callbacks=[tensorboard])
Out[25]:
<keras.callbacks.History at 0x1a2175b240>
In [26]:
y_pred = model.predict(X_test)[:, 0] #keras returns np.ndarray. Take the first column
In [27]:
plt.figure(figsize=(14, 8))
plt.plot(range(len(y_test)), y_test, lw=3, label='actual values')
plt.plot(range(len(y_test)), y_pred, 'k-', label='prediction')
plt.vlines(x=N_1, ymin=0, ymax=5, color='k', linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=0, ymax=5, color='r', linestyle='--')
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
plt.legend()

plt.ylabel('normalized temperature [-]', fontsize=14)
plt.xlabel('time [-]', fontsize=14)
plt.title('Predictions versus actual y-values', fontsize=14);
In [28]:
fig = plt.figure(figsize=(14,8))
plt.plot(range(len(y_pred)), y_pred - y_test, label='Residuals, NN')
plt.plot(range(0, N, test_step), res_GP, 'k', linestyle=':', label='Residuals, GP', lw=2)
ax = plt.gca()
plt.xlim([N_1 - 10000, N*t_anomaly + 10000])
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()

plt.vlines(x=N_1, ymin=ymin,  ymax=ymax, linestyle='--')
plt.vlines(x=N*t_anomaly, ymin=ymin,  ymax=ymax, color='r', linestyle='--')
plt.hlines(y=0, xmin=xmin, xmax=xmax)
plt.legend()
plt.ylabel('predicted - actual [-]')
plt.xlabel('time [-]')
plt.title('Residuals, GP versus NN', fontsize=16);

Since we do not have an error for each prediction, as in the GP case, we cannot calculate a log-likelihood. It is therefore not possible to distinguish between mismatch because of model uncertainty or because of the anomaly.

In [29]:
# Determine the R-squared and MSE of the neural network model
R2_NN = r2_score(y_pred[N_1:int(N*t_anomaly)], y_test[N_1:int(N*t_anomaly)])
MSE_NN = mean_squared_error(y_pred[N_1:int(N*t_anomaly)], y_test[N_1:int(N*t_anomaly)])
In [30]:
#Compare the GP and NN accuracies:
print('   Model: \t GP \t \t  NN')
print('R-squared \t {:.4f} \t {:.4f}'.format(R2_GP, R2_NN))
print('MSE       \t {:.6f} \t {:.6f}'.format(MSE_GP, MSE_NN))
   Model: 	 GP 	 	  NN
R-squared 	 0.9993 	 0.9991
MSE       	 0.000155 	 0.000187

Concluding:

Both the Gaussian process model and the 2-layer neural network can easily identify the anomaly, that is a mere 10% reduction of a coefficient. Both models succeed to capture the differential equation very well, having good predictive accuracy (R^2 exceeding 99% on test data).

The Gaussian process regressor offers a measure for its prediction uncertainty, which greatly helps detection because it allows for determination of the log-likelihood of the data given the model. This log-likelihood drastically reduces after the anomaly.

pandas: data ingestion, cleaning, exploration and preparation

Getting the most out of pandas: data ingestion, cleaning, exploration and preparation

The pandas package offers a lot more than just the data containers Series and DataFrame. Knowing some of its functionality will save a lot of time -- and result in much nicer and faster code -- compared to doing it oneself in native Python.

This notebook shows how to use pandas to:

  • inspect and clean up data (using .apply() -- fillna(), .dropna(), .replace() )
  • write data to a database, and load it
  • query a database
  • explore properties of the data using .groupby(), pd.crosstab() and pd.pivot_table()
  • plot the previous results directly with DataFrame.plot()
  • prepare the data for classification

The purpose is to cover a lot of useful pandas features that really speed up coding and analysis

In [1]:
%matplotlib inline
import pandas as pd
import sqlite3 as sql
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

0. Import data

Data description

The datasets come from:

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

It is a rich dataset with a mix of numerical and categorical features that requires some clean-up. The target is a constructed dichotomy: does the person's income exceed 50K per year or not? Note that this binning into two groups means quite some loss of information, one wouldn't do this if not necessary.

In [2]:
# url's to download data
train_url = r"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
test_url = r"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
# after downloading, store locally. Set the path here
data_dir = '../data/'
train_path = data_dir + 'adult.data.txt'
test_path = data_dir + 'adult.test.txt'
In [3]:
# define the column_names (these are not stored in the data set)
column_names = ['age', 'workclass', 'zip', 'education', 'education_num', 'marital_status', 'occupation',
          'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 
                'hoursperweek', 'country', 'income_class']

# Try to read the data from disk. If this fails, get it from internet, and save it
try:
    train_data = pd.read_csv(train_path)
    test_data = pd.read_csv(test_path)
    print('read successfully from disk')
except:    
    print('fetching data from internet')
    train_data = pd.read_csv(train_url, header=None, names=column_names)
    train_data.to_csv(train_path, index=False) # do not write row names
    test_data = pd.read_csv(test_url, header=None, names=column_names, skiprows=1) # skiprows: first row is odd
    test_data.to_csv(test_path, index=False)
read successfully from disk

(Optional) Write/read the data to an SQL db

This is not very useful in this specific case since the data is not relational, but pandas does offer a nice interface for adding data to tables and to send queries. The example below is with SQLite, but pandas offers interfaces to many databases in combination with SQLalchemy.

Note that inside a Jupyter Notebook, one can get the documentation by typing

?<object>
In [4]:
?pd.read_sql_table
In [5]:
sqlite_io = False
if sqlite_io: # Do not execute by default
    db_path = './data/50K.db'
    connection = sql.connect(db_path)
    try:
        train_data.to_sql('train', connection, index=False, if_exists='replace')
        test_data.to_sql('test', connection, index=False, if_exists='replace')
    except Exception as e:
        print(e)
    finally:
        connection.close()

    # Reading goes as follows:
    connection = sql.connect(db_path)
    try:
        train_data = pd.read_sql('SELECT * from train', connection)
    except Exception as e:
        print(e)
    finally:
        connection.close()

1. Inspect and clean up data

In [6]:
display(train_data.head(3))
display(test_data.head(3))
age workclass zip education education_num marital_status occupation relationship race sex capital_gain capital_loss hoursperweek country income_class
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
age workclass zip education education_num marital_status occupation relationship race sex capital_gain capital_loss hoursperweek country income_class
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K.
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K.
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K.

Note the occurrence of question marks, and the subtle difference between test and train for the column income_class: a dot behind the K.

In [7]:
display(train_data.dtypes)
print(train_data.shape)
age                int64
workclass         object
zip                int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hoursperweek       int64
country           object
income_class      object
dtype: object
(32561, 15)

There are 6 numeric data columns, the rest (9 out of 15) are text or mixed. There are 32.6 K rows. What we will look for:

Numerical features:

  • Are there NaN's, outliers or dummy values?

Categorical features:

  • Are there missing values? Is clean-up needed?

1.1. Check for missing values/outliers/dummy values in the numerical features

In [8]:
pd.isnull(train_data).sum(axis=0).sum()
# No missing values. So there is nothing to replace
# In case there would be, some options are:
# train_data.fillna(value) # replace nan's with value
# or
# train_data.dropna() # remove rows (or columns) when nan's are present
Out[8]:
0
In [9]:
train_data.describe()
test_data.describe()
# This returns only the numerical columns
Out[9]:
age zip education_num capital_gain capital_loss hoursperweek
count 16281.000000 1.628100e+04 16281.000000 16281.000000 16281.000000 16281.000000
mean 38.767459 1.894357e+05 10.072907 1081.905104 87.899269 40.392236
std 13.849187 1.057149e+05 2.567545 7583.935968 403.105286 12.479332
min 17.000000 1.349200e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.167360e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.778310e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.383840e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.490400e+06 16.000000 99999.000000 3770.000000 99.000000

Concluding: no missing values, no problematic numerical values. The 99999 and 99 values indicate dummy values. One could replace these with the median of the columns. This means that these features become rather uninformative for these specific instance, which might be a good thing if the number is misleading, and a bad thing if the number actually carries some information.

In [10]:
train_data.replace({'capital_gain' : {99999:train_data.capital_gain.median()},
                             'hoursperweek' : {99:train_data.hoursperweek.median()}}, inplace=False).describe()
Out[10]:
age zip education_num capital_gain capital_loss hoursperweek
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 589.339486 87.303830 40.283437
std 13.640433 1.055500e+05 2.572720 2554.334160 402.960219 11.978424
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 41310.000000 4356.000000 98.000000

Not being convinced that a median-replacement is a good thing (I interpret the large values as meaningful, when not exact), so will leave things as they are.

1.2. Check for missing values/dummies and textual problems in the categorical values

Let's convert the text columns to categorical type. This post :

https://blog.dominodatalab.com/pandas-categoricals/

explains nicely how and why. The great benefit: improved performance and decreased memory usage. For textual clean-up, we need to redefine the categories' values rather than act on the columns themselves with .apply()

Although the text entries look okay, there are miscellaneous whitespaces:

In [11]:
train_data.loc[0, 'sex']
Out[11]:
' Male'

In our current dataframe, the columns contain text, which we would replace as follows:

In [12]:
# Remove the whitespaces from each "object"-type column
train_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x).loc[0, 'sex']
#test_data = test_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
Out[12]:
'Male'

We will, however, first convert to categorical and then do the clean-up on the category levels. This is more involved than doing it on the unconverted columns, because:

  • type comparisons are less intuitive
  • doing .apply() for cleaning results in object-type columns again

So I would recommend doing this only when the computational advantage of working directly on the categories is relevant. We will go this route here as well, though.

In [13]:
text_cols = [col for col in train_data.columns if train_data[col].dtype == 'object' ]
for col in text_cols:
    train_data[col] = train_data[col].astype('category')
    test_data[col] = test_data[col].astype('category')
In [14]:
# strip the text of the categories, rather than applying strip on the columns:
train_data['sex'].cat.categories = train_data['sex'].cat.categories.str.strip()
print(train_data['sex'].cat.categories)
Index(['Female', 'Male'], dtype='object')
In [15]:
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
cat_cols
Out[15]:
['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'country',
 'income_class']
In [16]:
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
# Note that type comparison does not always work as expected for 
# categorical data. This, and > hasattr() do work
for col in cat_cols:
    train_data[col].cat.categories = train_data[col].cat.categories.str.strip()
    test_data[col].cat.categories = test_data[col].cat.categories.str.strip()
   
In [17]:
test_data.replace({'income_class': {'>50K.': '>50K', '<=50K.': '<=50K'}}, inplace=True)
In [18]:
# Look at the categories for each columns
for col in cat_cols[:1]:
    print(pd.value_counts(train_data.loc[:, col]))
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

There are some '?''s, which we need to deal with. I believe in this case, we better replace them with the mode (most common value of that feature).

In [19]:
# Note that we can get the levels of categories as follows:
train_data['workclass'].cat.categories
Out[19]:
Index(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',
       'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'],
      dtype='object')
In [20]:
replace_dict = {col: {'?' : train_data[col].mode().values[0]} for col in cat_cols if 
                '?' in train_data[col].cat.categories}
print(replace_dict)
{'workclass': {'?': 'Private'}, 'occupation': {'?': 'Prof-specialty'}, 'country': {'?': 'United-States'}}
In [21]:
for df in train_data, test_data:
    df.replace(replace_dict, inplace=True) 
    
# replace converted the categories to objects. Convert back [NB: not optimal]
for col in replace_dict.keys():
    train_data[col] = train_data[col].astype('category')
    train_data[col] = train_data[col].astype('category')
In [22]:
for col in cat_cols[:1]:
    print(pd.value_counts(train_data.loc[:, col]))
Private             24532
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

4. Data exploration using pandas

a. Groupby() to compare some conditional stats
In [23]:
# Do a groupby to look at conditional differences
td = train_data.groupby(by='income_class').agg({'sex': lambda x: sum(x=='Male')/len(x), 
                                   'race': lambda x:sum(x=='White')/len(x),
                                   'age': 'mean'})
td.rename(columns={'sex': 'male', 'race': 'white', 'age': 'mean_age'},  inplace=True)
td
Out[23]:
male white mean_age
income_class
<=50K 0.611974 0.837338 36.783738
>50K 0.849637 0.907665 44.249841

The high-earning group is on average more older, and predominantly male and white

b. Single-index crosstabs

Crosstabs are "counters" over a one or more grouped categorical variables. The arguments are array-like (unlike pivot_table, which takes a DataFrame). We can normalize to give fractions

In [24]:
# Determine the fraction in the sample with income higher than 50K
pd.crosstab(index=train_data['income_class'], columns="Fraction", normalize=True)
Out[24]:
col_0 Fraction
income_class
<=50K 0.75919
>50K 0.24081
In [25]:
# Plot the fractions of different races in the data-set
pd.crosstab(index=train_data['race'], columns="Count", normalize=True).plot(kind='bar', fontsize=14);
c. Multi-index crosstabs

By giving additionally one (or more) column, we can do further partitioning:

In [26]:
# Choose to normalize on index. This gives the fraction of male/female within the income class
fm_fraction_income = pd.crosstab(index=train_data['income_class'], 
                                 columns=[train_data['sex']], normalize='index', margins=True)
display(fm_fraction_income)
sex Female Male
income_class
<=50K 0.388026 0.611974
>50K 0.150363 0.849637
All 0.330795 0.669205
In [27]:
# Let's plot this result
fig, ax = plt.subplots(1,1)
fpl = fm_fraction_income.plot(kind='bar', stacked=True, fontsize=14, ax = ax)
fpl.legend(loc=3, frameon=True, fancybox=True, framealpha=0.5, facecolor='w'); 
# Note the semicolon to suppress text output
# NB: frameon is needed because of the seaborn import

Note that we have been looking at fractions conditional on income_class and other categorical variables (for instance: of those earning more than 50K, what fraction is Female -if we take 'index' as normalizer).

Next, let's study how the fraction of people with income exceeding 50K depends on other variables, which is closer to the final classification task. We will do that with pivot_tables

d. Pivot tables
In [28]:
# simple example of a pivot table. Note that the next pivot table is equivalent to
# train_data.groupby(by='income_class').mean()
pd.pivot_ta