# Optimizing with sklearn's GridSearchCV and Pipeline

### Using sklearn's gridsearchCV and pipelines for hyperparameter optimization¶

Sklearn has built-in functionality to scan for the best combinations of hyperparameters (such as regularization strength, length scale parameters) in an efficient manner.

With the Pipeline class, we can also pass data-preprocessing steps such as standardization or PCA. This is a real time-saver. No more writing complex cross-validation loops with .fit() on the train set, calculating the score on the leave-out set, storing it in a dataframe...

We will use the pre-processed data from

that was explored in detail in a previous post.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

from sklearn.linear_model import LogisticRegression # Classifier 1
from sklearn.ensemble import RandomForestClassifier # Classifier 2
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, make_scorer, roc_curve, auc
from sklearn.model_selection import GridSearchCV # to optimize hyperparams
from sklearn.preprocessing import StandardScaler # to standardize numerical features
from sklearn.pipeline import Pipeline # to pass the standardization along with the classifier to GridSearchCV


### 0. Import data¶

In [2]:
# after downloading, store locally. Set the path here
data_dir = '../data/'
## Let's write the results to disk for analysis.
X_train = pd.read_pickle(path=data_dir + 'X_train')
X_test = pd.read_pickle(path=data_dir + 'X_test')
X_train_ohe = pd.read_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe = pd.read_pickle(path=data_dir + 'X_test_ohe')
y_train = pd.read_pickle(path=data_dir + 'y_train')
y_test = pd.read_pickle(path=data_dir + 'y_test')


### 1. Classification¶

The data set is already cleaned-up, and one-hot encoded:

In [3]:
X_train_ohe.describe()

Out[3]:
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
count 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000 32561.000000
mean 10.080679 38.581647 40.437456 1077.648844 87.303830 0.669205 0.031909 0.095943 0.008323 0.854274 0.064279 0.000215 0.753417 0.034274 0.078038 0.039864 0.000430
std 2.572720 13.640433 12.347429 7385.292085 402.960219 0.470506 0.175761 0.294518 0.090851 0.352837 0.245254 0.014661 0.431029 0.181935 0.268236 0.195642 0.020731
min 1.000000 17.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 28.000000 40.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
50% 10.000000 37.000000 40.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
75% 12.000000 48.000000 45.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
max 16.000000 90.000000 99.000000 99999.000000 4356.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Let's do classification using logistic regression and random-forest, and compare the results. As features, we have:

• education_num (as a numerical feature, which seems a fairly decent approach)
• age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage here
• hours per week (numerical)
• sex (categorical)
• race (categorical)
• hoursperweek (numerical)
• workclass (categorical)
• capital gain (numerical)
• capital loss (numerical)

Note that zip code was thrown away from the original dataset. It might carry some useful data after suitable binning (for instance, mapping zip to state), but without such an operation it is likely not a suitable feature.

The website mentions typical accuracy scores of about 85% (after removal of unknown records, which is not what was done here), so let's see if we can get close to that. Note that the dataset has an imbalance with only about 24% being positive, so an accuracy of 76% can be obtained trivially. We will therefore also look at the F1-score and at ROC curves.

What is a meaningful score in real-life depends on the business case. For instance: with a fixed budget for targeted marketing on the positive group, we are probably most interested in the precision of our classifier.

#### Logistic regression¶

In [19]:
f1_scorer = make_scorer(f1_score)

pipe = Pipeline([
('standardscaler', StandardScaler()), #'standardscaler' is the name we chose to give
('classify', LogisticRegression(random_state=1234))
])
C_grid = np.logspace(-4, 4, 8) # These are the 'C' parameters that are varied

# the parameter_grid is a list with dictionaries. The naming for keys in that dict is strict.
# It refers to the the element in the pipe, with a __ followed by the parameter name we want to vary
# Whence, we use 'classify__C'

param_grid = [{'classify__C': C_grid }]

# Give GridSearchCV the pipe, the number of folds cv, the param_grid, the scorer we like to use for assessment
# (if we don't pass "scoring", we will just be using the classifier's default scorer)
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1,
return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)

Fitting 10 folds for each of 8 candidates, totalling 80 fits

[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   11.4s finished

Out[19]:
GridSearchCV(cv=10, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classify', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=1234, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]),
fit_params=None, iid=True, n_jobs=2,
param_grid=[{'classify__C': array([  1.00000e-04,   1.38950e-03,   1.93070e-02,   2.68270e-01,
3.72759e+00,   5.17947e+01,   7.19686e+02,   1.00000e+04])}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=make_scorer(f1_score), verbose=1)

We just did cross-validation, scanning for 8 different C-values, each time fitting the scaler on the train-set, and scoring on the hold-out set. The results are stored in

grid.cv_results_
In [21]:
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__C'],grid.cv_results_['mean_train_score'], label='train')
plt.legend()
plt.xlabel('C')
plt.ylabel('F1-score')
plt.semilogx();
#plt.ylim([0.8, 0.82])

##### Interpretation¶

C is an inverse regularization-strength. For C about 1 or higher, the results do not change, meaning that regularization is not having any influence any more (the penalty it incurs is relatively small).

Because of the relatively large amount of observations (32.5 K) compared to the number of features (17), regularization is clearly not needed. This is also observed in the accordance between test and train score. We can take C=1000.0 to effectively have unregularized regression

In [6]:
sc = StandardScaler()
X_train_ohe = sc.fit_transform(X_train_ohe)
X_test_ohe = sc.transform(X_test_ohe)
LR = LogisticRegression(C=1000.0)
LR.fit(X_train_ohe, y_train)
y_predict_LR = LR.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_LR, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_LR)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_LR)))

col_0 0 1
income_class
0 0.721946 0.041828
1 0.131749 0.104478
accuracy: 0.826
f1-score: 0.546


Note that the actual labels are in the rows, and that predicted labels are displayed column-wise. There seem to be different conventions, but I follow the convention from "The elements of statistical learning (Hastie et al.)" here. Also, giving the 0th index (rows) the true values, and the 1st index (columns) the prediction seems to be more intuitively correct.

Our logistic model has a rather poor recall: of the 23% that are truly positive, less than a half is identified as such. The precision is better. Let us look at the ROC curve, that tells us a bit more:

In [7]:
y_predict_LR_proba = LR.predict_proba(X_test_ohe)
fpr_LR, tpr_LR, thresholds = roc_curve(y_test, y_predict_LR_proba[:,1], pos_label=1)

In [8]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR)
plt.title('Logistic Regression, AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');


#### Random Forest Classifier¶

For the random forest classifier, we do not need to worry about standardization and in principle do not need one-hot-encoding. Regularization is enforced by minimizing the complexity of the individual trees.

X_train and X_test contain categorical variables, that unfortunately are not handled by sci-kit learn well. Let us go two routes:

1. Use one-hot-encoded features
2. Use coded categories (integers)

We will start off with one-hot-encoded features

In [22]:
pipe = Pipeline([
('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
])
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
{'classify__max_depth': maxdepth_grid },
]
grid = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1,
return_train_score=True, scoring=f1_scorer)
grid.fit(X_train_ohe, y_train)

Fitting 10 folds for each of 6 candidates, totalling 60 fits

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   49.3s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  1.1min finished

Out[22]:
GridSearchCV(cv=10, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=1234, verbose=0,
warm_start=False))]),
fit_params=None, iid=True, n_jobs=2,
param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=make_scorer(f1_score), verbose=1)
In [23]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test')
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')
plt.ylabel('F1-score')
plt.legend();

##### Interpretation¶

In Random Forest classification, complexity is determined by how large we allow our trees to be. Clearly, we start overfitting on the train data from a depth of 10 or more. However, because we are using a 100 trees (we are bootstrap aggregating), the variance is strongly suppressed. Although the predictive power doesn't increase much from a depth of about 15 or more, the predictions also do not deteriorate. This is in general a very nice property of random forests: grow many trees, and grow them large enough, and you are almost guaranteed to get a good classifier.

In [11]:
RFC = RandomForestClassifier(n_estimators=100, max_depth=20)
RFC.fit(X_train_ohe, y_train)
y_predict_RFC = RFC.predict(X_test_ohe)
display(pd.crosstab(columns=y_predict_RFC, index=y_test, normalize=True))
print('accuracy: {:.3f}'.format(accuracy_score(y_test, y_predict_RFC)))
print('f1-score: {:.3f}'.format(f1_score(y_test, y_predict_RFC)))

col_0 0 1
income_class
0 0.722192 0.041582
1 0.118113 0.118113
accuracy: 0.840
f1-score: 0.597

In [12]:
y_predict_RFC_proba = RFC.predict_proba(X_test_ohe)
fpr_RFC, tpr_RFC, thresholds = roc_curve(y_test, y_predict_RFC_proba[:,1], pos_label=1)

In [13]:
fig = plt.figure(figsize=(8,6))
plt.plot(fpr_LR, tpr_LR, label= 'Log. Reg.')
plt.plot(fpr_RFC, tpr_RFC, label='RFC')
plt.legend()
plt.title('LR AUC: {:.3f}, RFC AUC: {:.3f}'.format(auc(x=fpr_LR, y=tpr_LR),
auc(x=fpr_RFC, y=tpr_RFC)), fontsize=14)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

##### Interpretation¶

By shifting the level at which the label is predicted positive, for an identical FPR (i.e.: what fraction of actual negatives have been classified as positive), the TPR (what fraction of actual positives has been identified as such) of the RFC classifier is larger: it is a stronger classifier.

As a final step, let us see if there is any difference with the integer-coded categorical features (rather than one-hot encoded)

In [14]:
# Convert categorical to codes (integers)
cat_cols = list(X_train.columns[X_train.dtypes == 'category'])
for col in cat_cols:
X_train[col] = X_train[col].cat.codes

In [15]:
pipe = Pipeline([
('classify', RandomForestClassifier(random_state=1234, n_estimators=100))
])
maxdepth_grid = [ 6, 8, 10, 12, 15, 20]
param_grid = [
{'classify__max_depth': maxdepth_grid },
]
grid_2 = GridSearchCV(pipe, cv=10, n_jobs=2, param_grid=param_grid, verbose=1,
return_train_score=True, scoring=f1_scorer)
grid_2.fit(X_train, y_train)

Fitting 10 folds for each of 6 candidates, totalling 60 fits

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   40.4s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   57.9s finished

Out[15]:
GridSearchCV(cv=10, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('classify', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=1234, verbose=0,
warm_start=False))]),
fit_params=None, iid=True, n_jobs=2,
param_grid=[{'classify__max_depth': [6, 8, 10, 12, 15, 20]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=make_scorer(f1_score), verbose=1)
In [25]:
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_test_score'], label='test, ohe',
linestyle='--')
plt.plot(grid.cv_results_['param_classify__max_depth'],grid.cv_results_['mean_train_score'], label='train, ohe',
linestyle='--')
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_test_score'], label='test')
plt.plot(grid_2.cv_results_['param_classify__max_depth'],grid_2.cv_results_['mean_train_score'], label='train')
plt.xlabel('Max depth')
plt.ylabel('F1-score')
plt.legend();


#### Conclusion¶

Using sklearn's Pipeline and GridsearchCV, we did the entire hyperparameter optimization loop (for a range of values, standardization, fitting, scoring on test predictions, storing results) in a few lines. We got somewhat better results with the Random Forest Classifier compared to the Logistic Regression classifier, as was to be expected based on the features (for instance, as the expectation of the target is not monotonically increasing with age). One-hot-encoded features and integer-coded features give very comparable performance for RFC.

In the end, our accuracy score ( NB: without removing observations with missing features) are on par with those mentioned on the website (where observations were removed).

As a quick and dirty comparison: the RFC accuracy is 84.40%, whereas the mentioned best scores are C4.5: 84.46%, Naive-Bayes: 83.88%, NBTree: 85.90%)