Optimizing with sklearn's GridSearchCV and Pipeline

Using sklearn's gridsearchCV and pipelines for hyperparameter optimization

Sklearn has built-in functionality to scan for the best combinations of hyperparameters (such as regularization strength, length scale parameters) in an efficient manner.

With the Pipeline class, we can also pass data-preprocessing steps such as standardization or PCA. This is a real time-saver. No more writing complex cross-validation loops with .fit() on the train set, calculating the score on the leave-out set, storing it in a dataframe...

We will use the pre-processed data from


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

from sklearn.linear_model import LogisticRegression # Classifier 1
from sklearn.ensemble import RandomForestClassifier # Classifier 2
from sklearn.metrics import (accuracy_score, f1_score, confusion_matrix, make_scorer, roc_curve, 
                             auc, average_precision_score, roc_auc_score)
from sklearn.model_selection import GridSearchCV # to optimize hyperparams
from sklearn.preprocessing import StandardScaler # to standardize numerical features
from sklearn.pipeline import Pipeline # to pass the standardization along with the classifier to GridSearchCV

0. Import data

In [2]:
data_dir = '../data/'
X_train = pd.read_pickle(path=data_dir + 'X_train')
X_test = pd.read_pickle(path=data_dir + 'X_test')
X_train_ohe = pd.read_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe = pd.read_pickle(path=data_dir + 'X_test_ohe')
y_train = pd.read_pickle(path=data_dir + 'y_train')
y_test = pd.read_pickle(path=data_dir + 'y_test')

1. Classification

The data set is already cleaned-up, and one-hot encoded:

In [3]:
X_train_ohe.describe().loc[['count', 'mean', 'std', 'min', '50%', 'max'], :]
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
count 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000
mean 10.080679 38.581647 40.437456 1077.648844 87.303830 0.669205 0.031909 0.095943 0.008323 0.854274 0.064279 0.000215 0.753417 0.034274 0.078038 0.039864 0.000430
std 2.572720 13.640433 12.347429 7385.292085 402.960219 0.470506 0.175761 0.294518 0.090851 0.352837 0.245254 0.014661 0.431029 0.181935 0.268236 0.195642 0.020731
min 1.000000 17.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 10.000000 37.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
max 16.000000 90.000000 99.000000 99999.000000 4356.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Let's do classification using logistic regression and random-forest, and compare the results. As features, we have:

  • education_num (as a numerical feature, which seems a fairly decent approach)
  • age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage here
  • hours per week (numerical)
  • sex (categorical)
  • race (categorical)
  • hoursperweek (numerical)
  • workclass (categorical)
  • capital gain (numerical)
  • capital loss (numerical)

Note that zip code was thrown away from the original dataset. It might carry some useful data after suitable binning (for instance, mapping zip to state), but without such an operation it is likely not a suitable feature.

The website mentions typical accuracy scores of about 85% (after removal of unknown records, which is not what was done here), so let's see if we can get close to that. Note that the dataset has an imbalance with only about 24% being positive, so an accuracy of 76% can be obtained trivially.

Logistic regression

In [15]:
f1_scorer = make_scorer(f1_score)

pipe = Pipeline([
    ('standardscaler', StandardScaler()), #'standardscaler' is the name we chose to give 
    ('classify', LogisticRegression(random_state=1234))
C_grid = np.logspace(-4, 4, 8) # These are the 'C' parameters that are varied

# the parameter_grid is a list with dictionaries. The naming for keys in that dict is strict. 
# It refers to the the element in the pipe, with a __ followed by the parameter name we want to vary
# Whence, we use 'classify__C'

param_grid = [{'classify__C': C_grid }]

# Give GridSearchCV the pipe, the number of folds cv, the param_grid, the scorer we like to use for assessment
# (if we don't pass "scoring", we will just be using the classifier's default scorer)
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 8 candidates, totalling 80 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    9.6s
[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   15.9s finished
GridSearchCV(cv=10, error_score='raise-deprecating',
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classify', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=1234, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid=[{'classify__C': array([1.00000e-04, 1.38950e-03, 1.93070e-02, 2.68270e-01, 3.72759e+00,
       5.17947e+01, 7.19686e+02, 1.00000e+04])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)

We just did cross-validation, scanning for 8 different C-values, each time fitting the scaler on the train-set, and scoring on the hold-out set. The results are stored in grid.cv_results_

In [22]:
plt.errorbar(grid.cv_results_['param_classify__C'], grid.cv_results_['mean_test_score'], 
             yerr=grid.cv_results_['std_test_score'], label='test')
plt.errorbar(grid.cv_results_['param_classify__C'] * 1.4, grid.cv_results_['mean_train_score'], 
             yerr=grid.cv_results_['std_train_score'], label='train') #multiplier to avoid overlap
#plt.ylim([0.8, 0.82])

C is an inverse regularization-strength. For C about 1 or higher, the results do not change, meaning that regularization is not having any influence any more (the penalty it incurs is relatively small). Note the much larger variance in the hold-out set ("test").

Because of the relatively large amount of observations (32.5 K) compared to the number of features (17), regularization is clearly not needed. This is also observed in the accordance between test and train score.

Note We can use the fitted gridscan object to make predictions. Calling .predict(X_test) is equivalent to training on the entire training data and using the best estimator to make predictions for X_test.

In [6]:
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classify', LogisticRegression(C=51.79474679231202, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=1234,
          solver='warn', tol=0.0001, verbose=0, warm_start=False))])
In [7]:
y_predict_LR = grid.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_LR, index=y_test, normalize=True).round(3))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_LR)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_LR)))
col_0 0 1
0 0.722 0.042
1 0.132 0.104
accuracy: 0.826
f1-score: 0.546

Note that the actual labels are in the rows, and that predicted labels are displayed column-wise, following the convention from "The elements of statistical learning (Hastie et al.)".

Our logistic model has a rather poor recall: of the 23% that are truly positive, less than a half is identified as such. Let's look at the ROC and average precision scores using .predict_proba():

In [8]:
y_predict_LR = grid.predict_proba(X_test_ohe)[:, 1]
fpr_LR, tpr_LR, thresholds = roc_curve(y_test, y_predict_LR, pos_label=1)
print('ROC score: {:.3f}'.format(roc_auc_score(y_test, y_predict_LR)))
print('Average precision score: {:.3f}'.format(average_precision_score(y_test, y_predict_LR)))
ROC score: 0.845
Average precision score: 0.669

Random Forest Classifier

For the random forest classifier, we do not need to worry about standardization and in principle do not need one-hot-encoding. It is however advantageous to do so, since an optimal categorical split might otherwise not be found. Regularization is enforced by limiting the complexity of the individual trees.

In [9]:
pipe = Pipeline([
    ('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
    {'classify__max_depth': maxdepth_grid },
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1, 
                    return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   43.6s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  1.1min finished
GridSearchCV(cv=10, error_score='raise-deprecating',
     steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=1234, verbose=0,
       fit_params=None, iid='warn', n_jobs=2,
       param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(f1_score), verbose=1)
In [10]:
plt.errorbar(grid.cv_results_['param_classify__max_depth'], grid.cv_results_['mean_test_score'], 
         yerr=grid.cv_results_['std_test_score'], label='test')
plt.errorbar(grid.cv_results_['param_classify__max_depth'], grid.cv_results_['mean_train_score'], 
         yerr=grid.cv_results_['std_train_score'], label='train')
plt.xlabel('Max depth')

In Random Forest classification, complexity is determined by how large we allow our trees to be. From a depth of 10 or more, the test results start flattening out whereas training results keep on improving; we are over-fitting. However, the test scores do not deteriorate.

This is a nice property of random forests: grow many trees, and grow them large enough, and you are almost guaranteed to get a good classifier.

As before, make the predictions on the held-out test set directly from the trained GridSearchCV object.

In [11]:
y_predict_RFC = grid.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_RFC, index=y_test, normalize=True).round(3))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_RFC)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_RFC)))
col_0 0 1
0 0.720 0.044
1 0.117 0.119
accuracy: 0.839
f1-score: 0.596
In [12]:
y_predict_RFC_proba = grid.predict_proba(X_test_ohe)
fpr_RFC, tpr_RFC, thresholds = roc_curve(y_test, y_predict_RFC_proba[:,1], pos_label=1)
In [13]:
fig = plt.figure(figsize=(8,4))
plt.plot(fpr_LR, tpr_LR, linestyle='-', label= 'Log. Reg.')
plt.plot(fpr_RFC, tpr_RFC, linestyle='-.', lw=2, label='RFC')
plt.title('LR AUC: {:.3f}, RFC AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR),
                                                  auc(x=fpr_RFC, y=tpr_RFC)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');


Using sklearn's Pipeline and GridsearchCV, we did the entire hyperparameter optimization loop (for a range of values, standardization, fitting, scoring on test predictions, storing results) in a few lines. Note that the "best classifier" is available for prediction, trained with all training data, with the .predict method of the GridSearchCV object.

We got somewhat better results with the Random Forest Classifier compared to the Logistic Regression classifier, as was to be expected based on the features (for instance, as the expectation of the target is not monotonically increasing with age).

In the end, the accuracy score (without removing observations with missing features) are on par with those mentioned on the website.

For comparison: the RFC accuracy is 84.40%, whereas the mentioned best scores are C4.5: 84.46%, Naive-Bayes: 83.88%, NBTree: 85.90%)

Blog with Nikola and Github: setup and workflow

Github blog with Nikola and Jupyter notebooks - setup and workflow

Setting up the github blog and creating a first post with Nikola was surprisingly easy, after having struggled with some alternatives. As easy as it was to set up, it is also easy to forget a few intricacies (although it really isn't that complicated) between one post and the next, a couple of weeks later. So here a short reminder on how to set Nikola up for Github and Jupyter notebooks, and how to post something new. It is assumed that there is already a Github account set up with a .github.io.git repository for your personal page.

There are several articles on the subject that might also be helpful, but they somehow didn't quite contain all my issues with Python (Anaconda), git and Nikola in one. So here it goes:

1. Getting started: Python environment, installing packages and Nikola, starting the git repo

This first part is rather dependent on setup and is documented more completely in various other places, such as the nikola site. Nevertheless, especially if you have Anaconda, the following could be useful.

Nikola has quite a few dependencies that are possibly different from those for everyday Python work. So it makes sense to create a new environment. This is easy with conda (I chose a prompt ending with # because of a conflict with the dollar sign within markdown quotations. I didn't actually do these commands as root, neither should you have to). By default, my conda installation will create Python 3.5, which is what we need.

user@macbook:~/Projects/user.github.io# conda create --name blog
user@macbook:~/Projects/user.github.io# source activate blog

Conda offers a possibility to copy an environment (using conda list --explicit) but this doesn't work cross-platform (I already had an environment set-up on Linux and wanted to have the same on my Mac. Alas.).

Using the list of commands outlined here (NB: check it if you start from scratch and also need to set up a Github account), one installs some necessary packages and Nikola with pip. But not before we have installed pip itself in our new environment:

(blog)user@macbook:~/Projects/user.github.io# which pip 

Not located in our environment, so install pip first:

(blog)user@macbook:~/Projects/user.github.io# conda install pip

That looks better. Now we can install in our environment with pip (NB: from now on, omitting the path and prompt):

pip install nikola pyzmq tornado jinja2 requests sphinx markdown ghp-import2

Upgrade nikola with extras, as recommended on the nikola site

pip install --upgrade "Nikola[extras]"

One thing is still missing: the Jupyter notebook, which adds a rather long list of more packages. Install with conda:

conda install notebook

In my most recent case, I did a git clone into my directory because I already had a Github repo with Nikola. Typically, you will first do a nikola init mysite, and add and commit this to a repo, and set a github location as your remote. I am using it for a personal page, so have the following remote:


See here for some more hints with respect to the git repo. And I liked this post a lot too.

There is something important to keep in mind with the git repo: there is a source branch (src) and a branch where the HTML will be located, in this case master (for project pages this can be a directory "/docs" inside branch master or directly in a branch gh-pages).

A second thing related to git: make sure there is a .gitignore file as described in the nikola handbook. Fixing conflicts with files in /cache is no fun.

To see the file structure of what will be published on the site, do:

git checkout master

This will show an index.html, a directory blogs and several more site-related things. Going back to the src branch with:

git checkout src

you will see a conf.py file for Nikola, plus a bunch of other stuff.

By now, we have a Python environment with all necessary packages including Nikola, and a git repository that is linked to our personal Github repo

2. Adding a Jupyter notebook file as a post

Let's make the new post with a Jupyter notebook:

nikola new_post -f ipynb

Open the Jupyter console and navigate to the created, empty notebook.

jupyter notebook posts/new_post.ipynb

Now the actual fun starts: you can write text, equations (using the magic command %%latex) and of course Python code, including plots if you make them. This post was also written in a Jupyter notebook. You can see how your post is progressing, by telling nikola:

nikola auto -b

When you are done, save the notebook as you want it to appear. We haven't given any tags yet to the notebook, but that can easily be done now. For this post, I did:

nikola tags --add "nikola, conda, jupyter notebook" posts/blog-with-nikola-and-github-setup-and-workflow.ipynb

Note that this modifies your notebook file (metadata). Also note that if one desires to keep a post hidden from the public (for instance, when deploying the site when there is a post that is still under construction), it can be given the tag private. This can of course later be removed by nikola tags --remove

The last action: deploy the post on Github. Here, Nikola really shines by simplicity.

nikola github_deploy

Additional notes:

  • Jupyter notebooks support inline math using \$ . \$ . This doesn't work automatically for Nikola. The conf.py file contains some lines that need to be commented out (search for mathjax to find those lines).