pandas: data ingestion, cleaning, exploration and preparation
Getting the most out of pandas: data ingestion, cleaning, exploration and preparation¶
The pandas package offers a lot more than just the data containers Series and DataFrame. Knowing some of its functionality will save a lot of time -- and result in much nicer and faster code -- compared to doing it oneself in native Python.
This notebook shows how to use pandas to:
- inspect and clean up data (using .apply() -- fillna(), .dropna(), .replace() )
- write data to a database, and load it
- query a database
- explore properties of the data using .groupby(), pd.crosstab() and pd.pivot_table()
- plot the previous results directly with DataFrame.plot()
- prepare the data for classification
The purpose is to cover a lot of useful pandas features that really speed up coding and analysis
%matplotlib inline
import pandas as pd
import sqlite3 as sql
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
import numpy as np
import matplotlib
plt.rcParams.update({'font.size': 14})
0. Import data¶
Data description¶
The datasets come from:
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
It is a rich dataset with a mix of numerical and categorical features that requires some clean-up. The target is a constructed dichotomy: does the person's income exceed 50K per year or not? Note that this binning into two groups means quite some loss of information, one wouldn't do this if not necessary.
# url's to download data
train_url = r"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
test_url = r"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
# after downloading, store locally. Set the path here
data_dir = '../data/'
train_path = data_dir + 'adult.data.txt'
test_path = data_dir + 'adult.test.txt'
# define the column_names (these are not stored in the data set)
column_names = ['age', 'workclass', 'zip', 'education', 'education_num', 'marital_status', 'occupation',
'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hoursperweek', 'country', 'income_class']
# Try to read the data from disk. If this fails, get it from internet, and save it
try:
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
print('read successfully from disk')
except:
print('fetching data from internet')
train_data = pd.read_csv(train_url, header=None, names=column_names)
train_data.to_csv(train_path, index=False) # do not write row names
test_data = pd.read_csv(test_url, header=None, names=column_names, skiprows=1) # skiprows: first row is odd
test_data.to_csv(test_path, index=False)
(Optional) Write/read the data to an SQL db¶
This is not very useful in this specific case since the data is not relational, but pandas does offer a nice interface for adding data to tables and to send queries. The example below is with SQLite, but pandas offers interfaces to many databases in combination with SQLalchemy.
Note that inside a Jupyter Notebook, one can get the documentation by typing
?<object>
?pd.read_sql_table
sqlite_io = False
if sqlite_io: # Do not execute by default
db_path = './data/50K.db'
connection = sql.connect(db_path)
try:
train_data.to_sql('train', connection, index=False, if_exists='replace')
test_data.to_sql('test', connection, index=False, if_exists='replace')
except Exception as e:
print(e)
finally:
connection.close()
# Reading goes as follows:
connection = sql.connect(db_path)
try:
train_data = pd.read_sql('SELECT * from train', connection)
except Exception as e:
print(e)
finally:
connection.close()
1. Inspect and clean up data¶
display(train_data.head(3))
display(test_data.head(3))
Note the occurrence of question marks, and the subtle difference between test and train for the column income_class: a dot behind the K.
display(train_data.dtypes)
print(train_data.shape)
There are 6 numeric data columns, the rest (9 out of 15) are text or mixed. There are 32.6 K rows. What we will look for:
Numerical features:
- Are there NaN's, outliers or dummy values?
Categorical features:
- Are there missing values? Is clean-up needed?
1.1. Check for missing values/outliers/dummy values in the numerical features¶
pd.isnull(train_data).sum(axis=0).sum()
# No missing values. So there is nothing to replace
# In case there would be, some options are:
# train_data.fillna(value) # replace nan's with value
# or
# train_data.dropna() # remove rows (or columns) when nan's are present
train_data.describe()
test_data.describe()
# This returns only the numerical columns
Concluding: no missing values, no problematic numerical values. The 99999 and 99 values indicate dummy values. One could replace these with the median of the columns. This means that these features become rather uninformative for these specific instance, which might be a good thing if the number is misleading, and a bad thing if the number actually carries some information.
train_data.replace({'capital_gain' : {99999:train_data.capital_gain.median()},
'hoursperweek' : {99:train_data.hoursperweek.median()}}, inplace=False).describe()
Not being convinced that a median-replacement is a good thing (I interpret the large values as meaningful, when not exact), so will leave things as they are.
1.2. Check for missing values/dummies and textual problems in the categorical values¶
Let's convert the text columns to categorical type. This post :
https://blog.dominodatalab.com/pandas-categoricals/
explains nicely how and why. The great benefit: improved performance and decreased memory usage. For textual clean-up, we need to redefine the categories' values rather than act on the columns themselves with .apply()
Although the text entries look okay, there are miscellaneous whitespaces:
train_data.loc[0, 'sex']
In our current dataframe, the columns contain text, which we would replace as follows:
# Remove the whitespaces from each "object"-type column
train_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x).loc[0, 'sex']
#test_data = test_data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
We will, however, first convert to categorical and then do the clean-up on the category levels. This is more involved than doing it on the unconverted columns, because:
- type comparisons are less intuitive
- doing .apply() for cleaning results in object-type columns again
So I would recommend doing this only when the computational advantage of working directly on the categories is relevant. We will go this route here as well, though.
text_cols = [col for col in train_data.columns if train_data[col].dtype == 'object' ]
for col in text_cols:
train_data[col] = train_data[col].astype('category')
test_data[col] = test_data[col].astype('category')
# strip the text of the categories, rather than applying strip on the columns:
train_data['sex'].cat.categories = train_data['sex'].cat.categories.str.strip()
print(train_data['sex'].cat.categories)
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
cat_cols
cat_cols = list(train_data.dtypes.index[train_data.dtypes == 'category'])
# Note that type comparison does not always work as expected for
# categorical data. This, and > hasattr() do work
for col in cat_cols:
train_data[col].cat.categories = train_data[col].cat.categories.str.strip()
test_data[col].cat.categories = test_data[col].cat.categories.str.strip()
test_data.replace({'income_class': {'>50K.': '>50K', '<=50K.': '<=50K'}}, inplace=True)
# Look at the categories for each columns
for col in cat_cols[:1]:
print(pd.value_counts(train_data.loc[:, col]))
There are some '?''s, which we need to deal with. I believe in this case, we better replace them with the mode (most common value of that feature).
# Note that we can get the levels of categories as follows:
train_data['workclass'].cat.categories
replace_dict = {col: {'?' : train_data[col].mode().values[0]} for col in cat_cols if
'?' in train_data[col].cat.categories}
print(replace_dict)
for df in train_data, test_data:
df.replace(replace_dict, inplace=True)
# replace converted the categories to objects. Convert back [NB: not optimal]
for col in replace_dict.keys():
train_data[col] = train_data[col].astype('category')
train_data[col] = train_data[col].astype('category')
for col in cat_cols[:1]:
print(pd.value_counts(train_data.loc[:, col]))
4. Data exploration using pandas¶
a. Groupby() to compare some conditional stats¶
# Do a groupby to look at conditional differences
td = train_data.groupby(by='income_class').agg({'sex': lambda x: sum(x=='Male')/len(x),
'race': lambda x:sum(x=='White')/len(x),
'age': 'mean'})
td.rename(columns={'sex': 'male', 'race': 'white', 'age': 'mean_age'}, inplace=True)
td
The high-earning group is on average more older, and predominantly male and white
b. Single-index crosstabs¶
Crosstabs are "counters" over a one or more grouped categorical variables. The arguments are array-like (unlike pivot_table, which takes a DataFrame). We can normalize to give fractions
# Determine the fraction in the sample with income higher than 50K
pd.crosstab(index=train_data['income_class'], columns="Fraction", normalize=True)
# Plot the fractions of different races in the data-set
pd.crosstab(index=train_data['race'], columns="Count", normalize=True).plot(kind='bar', fontsize=14);
c. Multi-index crosstabs¶
By giving additionally one (or more) column, we can do further partitioning:
# Choose to normalize on index. This gives the fraction of male/female within the income class
fm_fraction_income = pd.crosstab(index=train_data['income_class'],
columns=[train_data['sex']], normalize='index', margins=True)
display(fm_fraction_income)
# Let's plot this result
fig, ax = plt.subplots(1,1)
fpl = fm_fraction_income.plot(kind='bar', stacked=True, fontsize=14, ax = ax)
fpl.legend(loc=3, frameon=True, fancybox=True, framealpha=0.5, facecolor='w');
# Note the semicolon to suppress text output
# NB: frameon is needed because of the seaborn import
Note that we have been looking at fractions conditional on income_class and other categorical variables (for instance: of those earning more than 50K, what fraction is Female -if we take 'index' as normalizer).
Next, let's study how the fraction of people with income exceeding 50K depends on other variables, which is closer to the final classification task. We will do that with pivot_tables
d. Pivot tables¶
# simple example of a pivot table. Note that the next pivot table is equivalent to
# train_data.groupby(by='income_class').mean()
pd.pivot_table(train_data, index='income_class', aggfunc='mean')
To be able to calculate the fractions of people earning more than 50K, add a column with 0/1 values to indicate income less/more than 50 K, and a column 'age_group' to divide people over and under 50 years of age.
for df in [train_data, test_data]:
df['larger_50K'] = df['income_class'].apply(lambda x: int(x == '>50K'))
df['age_group'] = df['age'].apply(lambda x: '>=50' if x >= 50 else '<50')
train_data.dtypes
Note that "larger_50K" automatically became categorical. We want an integer, to be able to aggregate:
train_data['larger_50K'] = train_data['larger_50K'].astype('int')
test_data['larger_50K'] = test_data['larger_50K'].astype('int')
pd.pivot_table(train_data, values='larger_50K',
index=['race', 'age_group'], columns=['sex'], aggfunc='mean' )
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
# Do a pivot_table for the Female population
pd.pivot_table(train_data.loc[train_data.sex=='Female', :], values='larger_50K', index=[ 'race'],
columns=['age_group'], aggfunc='mean' ).plot(kind='bar', fontsize=14, ax=axs[0])
axs[0].set_title('Females', fontsize= 15)
axs[0].set_ylim([0, 0.5])
# Male population
pd.pivot_table(train_data.loc[train_data.sex=='Male', :], values='larger_50K', index=[ 'race'],
columns=['age_group'], aggfunc='mean' ).plot(kind='bar', fontsize=14, ax=axs[1])
axs[1].set_title('Males', fontsize= 15)
axs[1].set_ylim([0, 0.5]);
Chances of being in the over-50K group for Females is highest in the over 50, Asian-pacific group. For Males, they are highest for Asian-pacific and White groups, also in the over-50 age category.
Note that seaborn offers the facetgrid that can do similar plotting. Finally a (seaborn) pairplot, which gives some indication of how separable the data is by any pair of features.
# Plot only numerical values (the "hue" comes from a categorical one: 'income_class')
num_cols = list(train_data.columns[train_data.dtypes == 'int'])
# Limit the plot in number of points, and do not take all numerical columns (note that categorical variables are
# automatically excluded)
sns.set(font_scale=1.5)
sns.pairplot(train_data.loc[::10,num_cols + ['income_class']], hue='income_class', plot_kws=dict(s=20, alpha=0.8));
It looks like we have good chances to separate the two classes.
A final question to answer: What is education_num. Can we use it as a numerical variable? Or is it an (unordered) categorical variable?
train_data.pivot_table(values='education_num', index='education').sort_values(by='education_num')
Education_num has a one-to-one correspondence with education, and is thus categorical. However, since it is ordered, it is not a bad idea per-se to use it as a numerical feature (if we don't want to do one-hot-encoding of 16 categorical values).
train_data.pivot_table(values='larger_50K', index='education_num').plot();
# Note that the mean as aggregate function is the default for pivot_table
5. Preparing for classification¶
In order to do classification with logistic regression or any other classifier that requires numerical values (neural networks, naive bayes, actually anything which is not tree-based), we need to convert our categories to dummies: "is the value equal to category X or not?".
We want to continue with the following features:
- education_num (as a numerical feature, which seems a fairly decent approach)
- age (numerical). Note that at a certain age, a decline can be expected. Random Forest will be at an advantage her
- hours per week (numerical)
- sex (categorical, need to convert to numerical for logistic regression)
- race (categorical, same as above)
- hoursperweek (numerical)
- workclass (categorical)
- capital gain (numerical)
- capital loss (numerical)
Note that zip might carry some useful data after suitable binning (for instance, mapping zip to state), but cannot be fed into a logistic regression model in its current state.
We will write the data for disk, and look at the classification in a follow-up post.
col_list = ['education_num', 'age', 'sex', 'race', 'hoursperweek',
'workclass', 'capital_gain', 'capital_loss']
X_train, X_test = train_data.loc[:, col_list], test_data.loc[:, col_list]
y_train = train_data.loc[:, 'income_class'].apply(lambda x: 1 if x=='>50K' else 0)
y_test = test_data.loc[:, 'income_class'].apply(lambda x: 1 if x=='>50K' else 0)
As it turns out, the test set has an additional value in the "workclass" column. We deal with this by concatenating first, then getting dummies, and then splitting again. This way, the train and test features are identical.
len_train = len(X_train)
X_train_test = pd.concat([X_train, X_test])
X_train_test_ohe = pd.get_dummies(X_train_test, drop_first=True)
# Separate them again into train and test
X_train_ohe, X_test_ohe = X_train_test_ohe.iloc[:len_train, :], X_train_test_ohe.iloc[len_train:, :]
display(X_train_ohe.head(3))
display(X_test_ohe.head(3))
Concluding¶
Using pd.read_csv (directly pointing to a URL, and to a local file on disk) we loaded the data, and saw how to use df.to_sql() and df.read_sql() to interact with an SQL database.
To judge the data quality we used df.describe(), df.head(), pd.value_counts() and pd.isnull(). We saw how to clean-up using df.apply(), and to converted our categorical variables to the categorical type using df[column]astype('category'). With df.replace({column: {old_value:new_value}) we could replace values in specific columns.
Displaying conditional probabilities of the target variable was done efficiently with pd.crosstab(index=pd.Series, value=pd.Series), and pd.pivot_table(df, values, columns) after converting the target variable for values to integer type.
Plotting of conditional distributions -conditional on target- was done with seaborn, using sns.pairplot. Finally, we did one-hot-encoding using pd.get_dummies(df, drop_first=True).
As a last step, write the results to disk, ready for building statistical models!
## Let's write the results to disk for analysis.
X_train.to_pickle(path=data_dir + 'X_train')
X_test.to_pickle(path=data_dir + 'X_test')
X_train_ohe.to_pickle(path=data_dir + 'X_train_ohe')
X_test_ohe.to_pickle(path=data_dir + 'X_test_ohe')
y_train.to_pickle(path=data_dir + 'y_train')
y_test.to_pickle(path=data_dir + 'y_test')
Comments