This project has the intention of analyzing human voice samples in order to create multiple predictive models that can accurately identify the speakers as male or female. The data was sourced from Kaggle and includes 3,168 voice samples that have been preprocessed in R's seewave and tuneR software packages, generating 20 unique acoustic parameters per sample.
This first notebook will focus on understanding and visualizing the data before we proceed to generate various machine learning algorithms.
import obj
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot backgrounds to white
sns.set_style('whitegrid')
# Set graphics to appear inline with notebook code
%matplotlib inline
The following acoustic properties of each voice are measured and included within the CSV:
Parameter | Description |
---|---|
meanfreq | mean frequency (kHz) |
sd | standard deviation of frequency |
median | median frequency (kHz) |
Q25 | first quantile (kHz) |
Q75 | third quantile (kHz) |
IQR | interquantile range (kHz) |
skew | skewness |
kurt | kurtosis |
sp.ent | spectral entropy |
sfm | spectral flatness |
mode | mode frequency |
centroid | frequency centroid |
peakf | peak frequency (frequency with highest energy) |
meanfun | average of fundamental frequency measured across acoustic signal |
minfun | minimum fundamental frequency measured across acoustic signal |
maxfun | maximum fundamental frequency measured across acoustic signal |
meandom | average of dominant frequency measured across acoustic signal |
mindom | minimum of dominant frequency measured across acoustic signal |
maxdom | maximum of dominant frequency measured across acoustic signal |
dfrange | range of dominant frequency measured across acoustic signal |
modindx | modulation index |
label | "male" or "female" |
Load data .csv into pandas DataFrame.
data_raw = pd.read_csv('voice.csv')
Discover missing entries (if any), and view parameter data types.
# Discover number of missing entries
print('Missing entries:', data_raw.isnull().sum().sum())
print('\n')
# Show data types
data_raw.info()
print('\n')
# Statistical description of data
data_raw.describe()
When initially viewing the data with Seaborn's pairplot module, it was plain to see that a small handful of the 20 parameters were fairly distict across their histogram distribution. These six parameters are plotted below to reveal their histograms as well as their relationship to the other separable parameters.
The exposed quality of separation brings confidence to our ultimate goal of developing machine learning algorithms to distinguish between the possible gender of the source speaker.
# Choose data with visually separable distribution across label
data_of_interest = ['sd','Q25','IQR','sfm','mode','meanfun']
# Colors to differentiate between "male" and "female" labelling
clr_m = '#3498db' # blue
clr_fm = '#F67088' # pink
# Plot data of interest
sns.set(font_scale=1.2)
g = sns.pairplot(data_raw[data_of_interest + ['label']],
palette=sns.color_palette([clr_m, clr_fm]),
hue='label',
size=2.5)
for ax in g.axes.flatten():
for t in ax.get_xticklabels():
t.set(rotation=33)
In the above plot, notice that each histogram (on the diagonal) shows distinct differences in shape between the male and female labels. The most telling parameter is the meanfun (mean fundamental frequency). As one would expect, the fundamental frequencies exhibited by male voices are much lower than those exhibited by females. Our final algoithms will likely place a large predictory weight on the meanfun category.
Another important aspect of the data is the correlation between parameters. Highly correlated data does not have as much independent information among the parameters, thus providing less distict patterns fro which a machine learning algorithm can learn.
The clustered heat map below uncovers some correlations within the data, but does seem to indicate enough independent information among the many parameters to build successful prediction models.
g = sns.clustermap(data_raw.corr(), cmap='coolwarm', figsize=(8,8))
for text in g.ax_heatmap.get_yticklabels():
text.set_rotation('horizontal')
Although the plot above does not show much correlation, there are some relationships that we can make sense of, given our knowledge of the data.
Looking into the methodology behind the R language's specprop (spectral properies) function, we can see that skewness (skew) and kurtosis (kurt) are derived via the following equations:
Both skewness and kurtosis describe the distribution of data by shape. Skewness describes whether the destribution curve is left-leaning or right-leaning, and kurtosis describes how close to the median value a distribution falls. Because of their similar derivations, it is not surprising that they would be highly correlated in the data.
Similar similarities exist between the centroid and meanfreq, as well as maxdom and dfrange. Each are different specific values that attempt to quantify very similar aspects of the data.
Furthermore, IQR (interquartile range) and Q25 (25th percentile location) are very similar distribution measurements as well. The interquartile range measures the range that encompases the "middle 50%" of the data, equal to the difference between the 75th and 25th percentiles. In this case, these two categories appear to have a strong negative correlation, implying that a low 25th percentile would yield a large IQR value. This relationship makes sense in the context of the data. A similar relationship does not exist with the Q75 category, likely due to the positive skew of the data.
A custom module obj.py was created to easily save and load objects between Python environments.
obj.save(data_raw,'var/data_raw')