The goal of data preparation is to clean and shape the data in ways that make it satisfactory for the machine learning process. Fortunately, our dataset is not lacking any data points and therefore does not need cleaning. However, as described further in this notebook, our data is unscaled and must be preprocessed before we can use it to shape our prediction models.
import obj
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Parameters
pd.set_option('display.max_columns', 30) # set pandas to display all columns
sns.set_style('whitegrid') # set plot backgrounds to white
# Set graphics to appear inline with notebook code
%matplotlib inline
Import raw data DataFrame object from previous notebook.
data_raw = obj.load('var/data_raw')
Before passing data through a machine learning algorithm, it is important to scale each parameter relative to the others such that each parameter affects equal pull on the training process. If left unscaled, data with numerically larger (absolute) values will impose a much greater affect on the learning processes, skewing the prediction model to favor their patterns, whether or not they are useful for prediction. Likewise, numerically smaller data points will not affect much pull on the learning processes, even if they are strong predictors and highly worthwhile.
data_raw.head()
# Import scaler module
from sklearn.preprocessing import StandardScaler
# Separate data from labels
feat_raw = data_raw.drop('label',axis=1)
label = data_raw['label']
# Instantiate scaler
scaler = StandardScaler()
# Shape scaler to data
scaler.fit(feat_raw)
# Scale features using scaler
feat_scale = scaler.transform(feat_raw)
# Create DataFrame
feat_scale = pd.DataFrame(feat_scale, columns=feat_raw.columns)
feat_scale.head()
# Merge scaled data back with labels
data_scale = pd.concat([feat_scale,label], axis=1)
It will be interesting to test the effect of removing parameters that exhibit high correlations to other parameters. We will create a modified DataFrame that removes some parameters but leaves their matching, highly-correlated partner parameters.
data_scaleNoCorr = data_scale.drop(['kurt','centroid','dfrange','IQR'], axis=1)
The plots below show the total data set (top) versus the subset (bottom) of only non-correlated features. Notice the lack of extreme red or blue correlations, aside from the matrix diagonal.
sns.set_style('whitegrid')
g1 = sns.clustermap(data_raw.corr(), cmap='coolwarm', figsize=(6,6))
for text in g1.ax_heatmap.get_yticklabels():
text.set_rotation('horizontal')
g2 = sns.clustermap(data_scaleNoCorr.corr(), cmap='coolwarm', figsize=(6,6))
for text in g2.ax_heatmap.get_yticklabels():
text.set_rotation('horizontal')
With the data scaled, we can now proceed to fit various models using our data. We will output the data objects for use in these models.
obj.save(data_scale, 'var/data_scale')
obj.save(data_scaleNoCorr, 'var/data_scaleNoCorr')