Gender Recognition by Voice | 01 | Data Exploration¶

Project Overview¶

This project has the intention of analyzing human voice samples in order to create multiple predictive models that can accurately identify the speakers as male or female. The data was sourced from Kaggle and includes 3,168 voice samples that have been preprocessed in R's seewave and tuneR software packages, generating 20 unique acoustic parameters per sample.

This first notebook will focus on understanding and visualizing the data before we proceed to generate various machine learning algorithms.

Import Libraries¶

import obj
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot backgrounds to white
sns.set_style('whitegrid')

# Set graphics to appear inline with notebook code
%matplotlib inline

Load data into DataFrame¶

The Dataset¶

The following acoustic properties of each voice are measured and included within the CSV:

Parameter	Description
meanfreq	mean frequency (kHz)
sd	standard deviation of frequency
median	median frequency (kHz)
Q25	first quantile (kHz)
Q75	third quantile (kHz)
IQR	interquantile range (kHz)
skew	skewness
kurt	kurtosis
sp.ent	spectral entropy
sfm	spectral flatness
mode	mode frequency
centroid	frequency centroid
peakf	peak frequency (frequency with highest energy)
meanfun	average of fundamental frequency measured across acoustic signal
minfun	minimum fundamental frequency measured across acoustic signal
maxfun	maximum fundamental frequency measured across acoustic signal
meandom	average of dominant frequency measured across acoustic signal
mindom	minimum of dominant frequency measured across acoustic signal
maxdom	maximum of dominant frequency measured across acoustic signal
dfrange	range of dominant frequency measured across acoustic signal
modindx	modulation index
label	"male" or "female"

Load data .csv into pandas DataFrame.

data_raw = pd.read_csv('voice.csv')

Discover missing entries (if any), and view parameter data types.

# Discover number of missing entries
print('Missing entries:', data_raw.isnull().sum().sum())
print('\n')

# Show data types
data_raw.info()
print('\n')

# Statistical description of data
data_raw.describe()

Missing entries: 0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
meanfreq    3168 non-null float64
sd          3168 non-null float64
median      3168 non-null float64
Q25         3168 non-null float64
Q75         3168 non-null float64
IQR         3168 non-null float64
skew        3168 non-null float64
kurt        3168 non-null float64
sp.ent      3168 non-null float64
sfm         3168 non-null float64
mode        3168 non-null float64
centroid    3168 non-null float64
meanfun     3168 non-null float64
minfun      3168 non-null float64
maxfun      3168 non-null float64
meandom     3168 non-null float64
mindom      3168 non-null float64
maxdom      3168 non-null float64
dfrange     3168 non-null float64
modindx     3168 non-null float64
label       3168 non-null object
dtypes: float64(20), object(1)
memory usage: 519.8+ KB

Plot highly separable data¶

When initially viewing the data with Seaborn's pairplot module, it was plain to see that a small handful of the 20 parameters were fairly distict across their histogram distribution. These six parameters are plotted below to reveal their histograms as well as their relationship to the other separable parameters.

The exposed quality of separation brings confidence to our ultimate goal of developing machine learning algorithms to distinguish between the possible gender of the source speaker.

# Choose data with visually separable distribution across label
data_of_interest = ['sd','Q25','IQR','sfm','mode','meanfun']

# Colors to differentiate between "male" and "female" labelling
clr_m  = '#3498db'  # blue
clr_fm = '#F67088'  # pink

# Plot data of interest
sns.set(font_scale=1.2)
g = sns.pairplot(data_raw[data_of_interest + ['label']],
                 palette=sns.color_palette([clr_m, clr_fm]), 
                 hue='label',
                 size=2.5)
for ax in g.axes.flatten():
    for t in ax.get_xticklabels():
        t.set(rotation=33)

In the above plot, notice that each histogram (on the diagonal) shows distinct differences in shape between the male and female labels. The most telling parameter is the meanfun (mean fundamental frequency). As one would expect, the fundamental frequencies exhibited by male voices are much lower than those exhibited by females. Our final algoithms will likely place a large predictory weight on the meanfun category.

Plot correlations between data categories¶

Another important aspect of the data is the correlation between parameters. Highly correlated data does not have as much independent information among the parameters, thus providing less distict patterns fro which a machine learning algorithm can learn.

The clustered heat map below uncovers some correlations within the data, but does seem to indicate enough independent information among the many parameters to build successful prediction models.

g = sns.clustermap(data_raw.corr(), cmap='coolwarm', figsize=(8,8))
for text in g.ax_heatmap.get_yticklabels():
    text.set_rotation('horizontal')

Although the plot above does not show much correlation, there are some relationships that we can make sense of, given our knowledge of the data.

Looking into the methodology behind the R language's specprop (spectral properies) function, we can see that skewness (skew) and kurtosis (kurt) are derived via the following equations:

$$ S = \frac{\sum_{i=1}^N(x_i-\bar{x})^3}{(N-1)\sigma^3} $$

$$ K = \frac{\sum_{i=1}^N(x_i-\bar{x})^4}{(N-1)\sigma^4} $$

Both skewness and kurtosis describe the distribution of data by shape. Skewness describes whether the destribution curve is left-leaning or right-leaning, and kurtosis describes how close to the median value a distribution falls. Because of their similar derivations, it is not surprising that they would be highly correlated in the data.

Similar similarities exist between the centroid and meanfreq, as well as maxdom and dfrange. Each are different specific values that attempt to quantify very similar aspects of the data.

Furthermore, IQR (interquartile range) and Q25 (25th percentile location) are very similar distribution measurements as well. The interquartile range measures the range that encompases the "middle 50%" of the data, equal to the difference between the 75th and 25th percentiles. In this case, these two categories appear to have a strong negative correlation, implying that a low 25th percentile would yield a large IQR value. This relationship makes sense in the context of the data. A similar relationship does not exist with the Q75 category, likely due to the positive skew of the data.

Save raw data for further preparation¶

A custom module obj.py was created to easily save and load objects between Python environments.

obj.save(data_raw,'var/data_raw')

	meanfreq	sd	median	Q25	Q75	IQR	skew	kurt	sp.ent	sfm	mode	centroid	meanfun	minfun	maxfun	meandom	mindom	maxdom	dfrange	modindx
count	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000	3168.000000
mean	0.180907	0.057126	0.185621	0.140456	0.224765	0.084309	3.140168	36.568461	0.895127	0.408216	0.165282	0.180907	0.142807	0.036802	0.258842	0.829211	0.052647	5.047277	4.994630	0.173752
std	0.029918	0.016652	0.036360	0.048680	0.023639	0.042783	4.240529	134.928661	0.044980	0.177521	0.077203	0.029918	0.032304	0.019220	0.030077	0.525205	0.063299	3.521157	3.520039	0.119454
min	0.039363	0.018363	0.010975	0.000229	0.042946	0.014558	0.141735	2.068455	0.738651	0.036876	0.000000	0.039363	0.055565	0.009775	0.103093	0.007812	0.004883	0.007812	0.000000	0.000000
25%	0.163662	0.041954	0.169593	0.111087	0.208747	0.042560	1.649569	5.669547	0.861811	0.258041	0.118016	0.163662	0.116998	0.018223	0.253968	0.419828	0.007812	2.070312	2.044922	0.099766
50%	0.184838	0.059155	0.190032	0.140286	0.225684	0.094280	2.197101	8.318463	0.901767	0.396335	0.186599	0.184838	0.140519	0.046110	0.271186	0.765795	0.023438	4.992188	4.945312	0.139357
75%	0.199146	0.067020	0.210618	0.175939	0.243660	0.114175	2.931694	13.648905	0.928713	0.533676	0.221104	0.199146	0.169581	0.047904	0.277457	1.177166	0.070312	7.007812	6.992188	0.209183
max	0.251124	0.115273	0.261224	0.247347	0.273469	0.252225	34.725453	1309.612887	0.981997	0.842936	0.280000	0.251124	0.237636	0.204082	0.279114	2.957682	0.458984	21.867188	21.843750	0.932374