pandas: data ingestion, cleaning, exploration and preparation

Getting the most out of pandas: data ingestion, cleaning, exploration and preparation

The pandas package offers a lot more than just the data containers Series and DataFrame. Knowing some of its functionality will save a lot of time -- and result in much nicer and faster code -- compared to doing it oneself in native Python.

This notebook shows how to use pandas to:

  • inspect and clean up data (using .apply() -- fillna(), .dropna(), .replace() )
  • write data to a database, and load it
  • query a database
  • explore properties of the data using .groupby(), pd.crosstab() and pd.pivot_table()
  • plot the previous results directly with DataFrame.plot()
  • prepare the data for classification

The purpose is to cover a lot of useful pandas features that really speed up coding and analysis

In [1]:
%matplotlib inline
import pandas as pd
import sqlite3 as sql
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})

0. Import data

Data description

The datasets come from:

It is a rich dataset with a mix of numerical and categorical features that requires some clean-up. The target is a constructed dichotomy: does the person's income exceed 50K per year or not? Note that this binning into two groups means quite some loss of information, one wouldn't do this if not necessary.

In [2]:
# url's to download data
train_url = r""
test_url = r""
# after downloading, store locally. Set the path here
data_dir = '../data/'
train_path = data_dir + ''
test_path = data_dir + 'adult.test.txt'
In [3]:
# define the column_names (these are not stored in the data set)
column_names = ['age', 'workclass', 'zip', 'education', 'education_num', 'marital_status', 'occupation',
          'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 
                'hoursperweek', 'country', 'income_class']

# Try to read the data from disk. If this fails, get it from internet, and save it
    train_data = pd.read_csv(train_path)
    test_data = pd.read_csv(test_path)
    print('read successfully from disk')
    print('fetching data from internet')
    train_data = pd.read_csv(train_url, header=None, names=column_names)
    train_data.to_csv(train_path, index=False) # do not write row names
    test_data = pd.read_csv(test_url, header=None, names=column_names, skiprows=1) # skiprows: first row is odd
    test_data.to_csv(test_path, index=False)
read successfully from disk

(Optional) Write/read the data to an SQL db

This is not very useful in this specific case since the data is not relational, but pandas does offer a nice interface for adding data to tables and to send queries. The example below is with SQLite, but pandas offers interfaces to many databases in combination with SQLalchemy.

Note that inside a Jupyter Notebook, one can get the documentation by typing

In [4]:
In [5]:
sqlite_io = False
if sqlite_io: # Do not execute by default
    db_path = './data/50K.db'
    connection = sql.connect(db_path)
        train_data.to_sql('train', connection, index=False, if_exists='replace')
        test_data.to_sql('test', connection, index=False, if_exists='replace')
    except Exception as e:

    # Reading goes as follows:
    connection = sql.connect(db_path)
        train_data = pd.read_sql('SELECT * from train', connection)
    except Exception as e:

1. Inspect and clean up data

In [6]:
age workclass zip education education_num marital_status occupation relationship race sex capital_gain capital_loss hoursperweek country income_class
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
age workclass zip education education_num marital_status occupation relationship race sex capital_gain capital_loss hoursperweek country income_class
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K.
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K.
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K.

Note the occurrence of question marks, and the subtle difference between test and train for the column income_class: a dot behind the K.

In [7]:
age                int64
workclass         object
zip                int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hoursperweek       int64
country           object
income_class      object
dtype: object
(32561, 15)

There are 6 numeric data columns, the rest (9 out of 15) are text or mixed. There are 32.6 K rows. What we will look for:

Numerical features:

  • Are there NaN's, outliers or dummy values?

Categorical features:

  • Are there missing values? Is clean-up needed?

1.1. Check for missing values/outliers/dummy values in the numerical features

In [8]:
# No missing values. So there is nothing to replace
# In case there would be, some options are:
# train_data.fillna(value) # replace nan's with value
# or
# train_data.dropna() # remove rows (or columns) when nan's are present
In [9]:
# This returns only the numerical columns
age zip education_num capital_gain capital_loss hoursperweek
count 16281.000000 1.628100e+04 16281.000000 16281.000000 16281.000000 16281.000000
mean 38.767459 1.894357e+05 10.072907 1081.905104 87.899269 40.392236
std 13.849187 1.057149e+05 2.567545 7583.935968 403.105286 12.479332
min 17.000000 1.349200e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.167360e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.778310e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.383840e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.490400e+06 16.000000 99999.000000 3770.000000 99.000000

Concluding: no missing values, no problematic numerical values. The 99999 and 99 values indicate dummy values. One could replace these with the median of the columns. This means that these features become rather uninformative for these specific instance, which might be a good thing if the number is misleading, and a bad thing if the number actually carries some information.

In [10]:
train_data.replace({'capital_gain' : {99999:train_data.capital_gain.median()},
                             'hoursperweek' : {99:train_data.hoursperweek.median()}}, inplace=False).describe()
age zip education_num capital_gain capital_loss hoursperweek
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 589.339486 87.303830 40.283437
std 13.640433 1.055500e+05 2.572720 2554.334160 402.960219 11.978424
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 41310.000000 4356.000000 98.000000

Not being convinced that a median-replacement is a good thing (I interpret the large values as meaningful, when not exact), so will leave things as they are.

1.2. Check for missing values/dummies and textual problems in the categorical values

Let's convert the text columns to categorical type. This post :

explains nicely how and why. The great benefit: improved performance and decreased memory usage. For textual clean-up, we need to redefine the categories' values rather than act on the columns themselves with .apply()

Although the text entries look okay, there are miscellaneous whitespaces:

In [11]:
train_data.loc[0, 'sex']
' Male'

In our current dataframe, the columns contain text, which we would replace as follows:

In [12]:
# Remove the whitespaces from each "object"-type column
train_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x).loc[0, 'sex']
#test_data = test_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)

We will, however, first convert to categorical and then do the clean-up on the category levels. This is more involved than doing it on the unconverted columns, because:

  • type comparisons are less intuitive
  • doing .apply() for cleaning results in object-type columns again

So I would recommend doing this only when the computational advantage of working directly on the categories is relevant. We will go this route here as well, though.

In [13]:
text_cols = [col for col in train_data.columns if train_data[col].dtype == 'object' ]
for col in text_cols:
    train_data[col] = train_data[col].astype('category')
    test_data[col] = test_data[col].astype('category')
In [14]:
# strip the text of the categories, rather than applying strip on the columns:
train_data['sex'].cat.categories = train_data['sex'].cat.categories.str.strip()
Index(['Female', 'Male'], dtype='object')
In [15]:
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
In [16]:
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
# Note that type comparison does not always work as expected for 
# categorical data. This, and > hasattr() do work
for col in cat_cols:
    train_data[col].cat.categories = train_data[col].cat.categories.str.strip()
    test_data[col].cat.categories = test_data[col].cat.categories.str.strip()
In [17]:
test_data.replace({'income_class': {'>50K.': '>50K', '<=50K.': '<=50K'}}, inplace=True)
In [18]:
# Look at the categories for each columns
for col in cat_cols[:1]:
    print(pd.value_counts(train_data.loc[:, col]))
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

There are some '?''s, which we need to deal with. I believe in this case, we better replace them with the mode (most common value of that feature).

In [19]:
# Note that we can get the levels of categories as follows:
Index(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',
       'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'],
In [20]:
replace_dict = {col: {'?' : train_data[col].mode().values[0]} for col in cat_cols if 
                '?' in train_data[col].cat.categories}
{'workclass': {'?': 'Private'}, 'occupation': {'?': 'Prof-specialty'}, 'country': {'?': 'United-States'}}
In [21]:
for df in train_data, test_data:
    df.replace(replace_dict, inplace=True) 
# replace converted the categories to objects. Convert back [NB: not optimal]
for col in replace_dict.keys():
    train_data[col] = train_data[col].astype('category')
    train_data[col] = train_data[col].astype('category')
In [22]:
for col in cat_cols[:1]:
    print(pd.value_counts(train_data.loc[:, col]))
Private             24532
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

4. Data exploration using pandas

a. Groupby() to compare some conditional stats
In [23]:
# Do a groupby to look at conditional differences
td = train_data.groupby(by='income_class').agg({'sex': lambda x: sum(x=='Male')/len(x), 
                                   'race': lambda x:sum(x=='White')/len(x),
                                   'age': 'mean'})
td.rename(columns={'sex': 'male', 'race': 'white', 'age': 'mean_age'},  inplace=True)
male white mean_age
<=50K 0.611974 0.837338 36.783738
>50K 0.849637 0.907665 44.249841

The high-earning group is on average more older, and predominantly male and white

b. Single-index crosstabs

Crosstabs are "counters" over a one or more grouped categorical variables. The arguments are array-like (unlike pivot_table, which takes a DataFrame). We can normalize to give fractions

In [24]:
# Determine the fraction in the sample with income higher than 50K
pd.crosstab(index=train_data['income_class'], columns="Fraction", normalize=True)
col_0 Fraction
<=50K 0.75919
>50K 0.24081
In [25]:
# Plot the fractions of different races in the data-set
pd.crosstab(index=train_data['race'], columns="Count", normalize=True).plot(kind='bar', fontsize=14);
c. Multi-index crosstabs

By giving additionally one (or more) column, we can do further partitioning:

In [26]:
# Choose to normalize on index. This gives the fraction of male/female within the income class
fm_fraction_income = pd.crosstab(index=train_data['income_class'], 
                                 columns=[train_data['sex']], normalize='index', margins=True)
sex Female Male
<=50K 0.388026 0.611974
>50K 0.150363 0.849637
All 0.330795 0.669205
In [27]:
# Let's plot this result
fig, ax = plt.subplots(1,1)
fpl = fm_fraction_income.plot(kind='bar', stacked=True, fontsize=14, ax = ax)
fpl.legend(loc=3, frameon=True, fancybox=True, framealpha=0.5, facecolor='w'); 
# Note the semicolon to suppress text output
# NB: frameon is needed because of the seaborn import

Note that we have been looking at fractions conditional on income_class and other categorical variables (for instance: of those earning more than 50K, what fraction is Female -if we take 'index' as normalizer).

Next, let's study how the fraction of people with income exceeding 50K depends on other variables, which is closer to the final classification task. We will do that with pivot_tables

d. Pivot tables
In [28]:
# simple example of a pivot table. Note that the next pivot table is equivalent to
# train_data.groupby(by='income_class').mean()
pd.pivot_table(train_data, index='income_class',  aggfunc='mean')
age capital_gain capital_loss education_num hoursperweek zip
<=50K 36.783738 148.752468 53.142921 9.595065 38.840210 190340.86517
>50K 44.249841 4006.142456 195.001530 11.611657 45.473026 188005.00000

To be able to calculate the fractions of people earning more than 50K, add a column with 0/1 values to indicate income less/more than 50 K, and a column 'age_group' to divide people over and under 50 years of age.

In [29]:
for df in [train_data, test_data]:
    df['larger_50K'] = df['income_class'].apply(lambda x: int(x == '>50K'))
    df['age_group'] = df['age'].apply(lambda x: '>=50' if x >= 50 else '<50')
In [30]:
age                  int64
workclass         category
zip                  int64
education         category
education_num        int64
marital_status    category
occupation        category
relationship      category
race              category
sex               category
capital_gain         int64
capital_loss         int64
hoursperweek         int64
country           category
income_class      category
larger_50K        category
age_group           object
dtype: object

Note that "larger_50K" automatically became categorical. We want an integer, to be able to aggregate:

In [31]:
train_data['larger_50K'] = train_data['larger_50K'].astype('int')
test_data['larger_50K'] = test_data['larger_50K'].astype('int')
In [32]:
pd.pivot_table(train_data, values='larger_50K', 
               index=['race', 'age_group'], columns=['sex'], aggfunc='mean' )
sex Female Male
race age_group
Amer-Indian-Eskimo <50 0.093750 0.123457
>=50 0.130435 0.133333
Asian-Pac-Islander <50 0.115894 0.312274
>=50 0.181818 0.431655
Black <50 0.059748 0.169170
>=50 0.049470 0.273026
Other <50 0.060000 0.099291
>=50 0.000000 0.238095
White <50 0.111750 0.283745
>=50 0.148609 0.427212
In [33]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
# Do a pivot_table for the Female population
pd.pivot_table(train_data.loc['Female', :], values='larger_50K', index=[ 'race'], 
               columns=['age_group'], aggfunc='mean' ).plot(kind='bar', fontsize=14, ax=axs[0])
axs[0].set_title('Females', fontsize= 15)
axs[0].set_ylim([0, 0.5])

# Male population
pd.pivot_table(train_data.loc['Male', :], values='larger_50K', index=[ 'race'], 
               columns=['age_group'], aggfunc='mean' ).plot(kind='bar', fontsize=14, ax=axs[1])
axs[1].set_title('Males', fontsize= 15)
axs[1].set_ylim([0, 0.5]);

Chances of being in the over-50K group for Females is highest in the over 50, Asian-pacific group. For Males, they are highest for Asian-pacific and White groups, also in the over-50 age category.

Note that seaborn offers the facetgrid that can do similar plotting. Finally a (seaborn) pairplot, which gives some indication of how separable the data is by any pair of features.

In [34]:
# Plot only numerical values (the "hue" comes from a categorical one: 'income_class')
num_cols = list(train_data.columns[train_data.dtypes == 'int'])
# Limit the plot in number of points, and do not take all numerical columns (note that categorical variables are
# automatically excluded)
sns.pairplot(train_data.loc[::10,num_cols + ['income_class']], hue='income_class', plot_kws=dict(s=20, alpha=0.8));

It looks like we have good chances to separate the two classes.

A final question to answer: What is education_num. Can we use it as a numerical variable? Or is it an (unordered) categorical variable?

In [35]:
train_data.pivot_table(values='education_num', index='education').sort_values(by='education_num')
Preschool 1
1st-4th 2
5th-6th 3
7th-8th 4
9th 5
10th 6
11th 7
12th 8
HS-grad 9
Some-college 10
Assoc-voc 11
Assoc-acdm 12
Bachelors 13
Masters 14
Prof-school 15
Doctorate 16

Education_num has a one-to-one correspondence with education, and is thus categorical. However, since it is ordered, it is not a bad idea per-se to use it as a numerical feature (if we don't want to do one-hot-encoding of 16 categorical values).

In [36]:
train_data.pivot_table(values='larger_50K', index='education_num').plot(); 
# Note that the mean as aggregate function is the default for pivot_table

5. Preparing for classification

In order to do classification with logistic regression or any other classifier that requires numerical values (neural networks, naive bayes, actually anything which is not tree-based), we need to convert our categories to dummies: "is the value equal to category X or not?".

We want to continue with the following features:

  • education_num (as a numerical feature, which seems a fairly decent approach)
  • age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage her
  • hours per week (numerical)
  • sex (categorical, need to convert to numerical for logistic regression)
  • race (categorical, same as above)
  • hoursperweek (numerical)
  • workclass (categorical)
  • capital gain (numerical)
  • capital loss (numerical)

Note that zip might carry some useful data after suitable binning (for instance, mapping zip to state), but cannot be fed into a logistic regression model in its current state.

We will write the data for disk, and look at the classification in a follow-up post.

In [37]:
col_list = ['education_num', 'age', 'sex', 'race', 'hoursperweek', 
            'workclass', 'capital_gain', 'capital_loss']
X_train, X_test = train_data.loc[:, col_list], test_data.loc[:, col_list]
y_train = train_data.loc[:, 'income_class'].apply(lambda x: 1 if x=='>50K' else 0)
y_test = test_data.loc[:, 'income_class'].apply(lambda x: 1 if x=='>50K' else 0)

As it turns out, the test set has an additional value in the "workclass" column. We deal with this by concatenating first, then getting dummies, and then splitting again. This way, the train and test features are identical.

In [38]:
len_train = len(X_train)
X_train_test = pd.concat([X_train, X_test])
X_train_test_ohe = pd.get_dummies(X_train_test, drop_first=True)
# Separate them again into train and test
X_train_ohe, X_test_ohe = X_train_test_ohe.iloc[:len_train, :], X_train_test_ohe.iloc[len_train:, :]
In [39]:
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
0 13 39 40 2174 0 1 0 0 0 1 0 0 0 0 0 1 0
1 13 50 13 0 0 1 0 0 0 1 0 0 0 0 1 0 0
2 9 38 40 0 0 1 0 0 0 1 0 0 1 0 0 0 0
education_num age hoursperweek capital_gain capital_loss sex_Male race_Asian-Pac-Islander race_Black race_Other race_White workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay
0 7 25 40 0 0 1 0 1 0 0 0 0 1 0 0 0 0
1 9 38 50 0 0 1 0 0 0 1 0 0 1 0 0 0 0
2 12 28 40 0 0 1 0 0 0 1 1 0 0 0 0 0 0


Using pd.read_csv (directly pointing to a URL, and to a local file on disk) we loaded the data, and saw how to use df.to_sql() and df.read_sql() to interact with an SQL database.

To judge the data quality we used df.describe(), df.head(), pd.value_counts() and pd.isnull(). We saw how to clean-up using df.apply(), and to converted our categorical variables to the categorical type using df[column]astype('category'). With df.replace({column: {old_value:new_value}) we could replace values in specific columns.

Displaying conditional probabilities of the target variable was done efficiently with pd.crosstab(index=pd.Series, value=pd.Series), and pd.pivot_table(df, values, columns) after converting the target variable for values to integer type.

Plotting of conditional distributions -conditional on target- was done with seaborn, using sns.pairplot. Finally, we did one-hot-encoding using pd.get_dummies(df, drop_first=True).

As a last step, write the results to disk, ready for building statistical models!

In [40]:
## Let's write the results to disk for analysis. 
X_train.to_pickle(path=data_dir + 'X_train')
X_test.to_pickle(path=data_dir + 'X_test')
X_train_ohe.to_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe.to_pickle(path=data_dir + 'X_test_ohe')
y_train.to_pickle(path=data_dir + 'y_train')
y_test.to_pickle(path=data_dir + 'y_test')