Gender Recognition by Voice | 01 | Data Exploration

Project Overview

This project has the intention of analyzing human voice samples in order to create multiple predictive models that can accurately identify the speakers as male or female. The data was sourced from Kaggle and includes 3,168 voice samples that have been preprocessed in R's seewave and tuneR software packages, generating 20 unique acoustic parameters per sample.

This first notebook will focus on understanding and visualizing the data before we proceed to generate various machine learning algorithms.

Import Libraries

In [1]:
import obj
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot backgrounds to white
sns.set_style('whitegrid')

# Set graphics to appear inline with notebook code
%matplotlib inline

Load data into DataFrame

The Dataset

The following acoustic properties of each voice are measured and included within the CSV:

Parameter Description
meanfreq mean frequency (kHz)
sd standard deviation of frequency
median median frequency (kHz)
Q25 first quantile (kHz)
Q75 third quantile (kHz)
IQR interquantile range (kHz)
skew skewness
kurt kurtosis
sp.ent spectral entropy
sfm spectral flatness
mode mode frequency
centroid frequency centroid
peakf peak frequency (frequency with highest energy)
meanfun average of fundamental frequency measured across acoustic signal
minfun minimum fundamental frequency measured across acoustic signal
maxfun maximum fundamental frequency measured across acoustic signal
meandom average of dominant frequency measured across acoustic signal
mindom minimum of dominant frequency measured across acoustic signal
maxdom maximum of dominant frequency measured across acoustic signal
dfrange range of dominant frequency measured across acoustic signal
modindx modulation index
label "male" or "female"

Load data .csv into pandas DataFrame.

In [2]:
data_raw = pd.read_csv('voice.csv')

Discover missing entries (if any), and view parameter data types.

In [3]:
# Discover number of missing entries
print('Missing entries:', data_raw.isnull().sum().sum())
print('\n')

# Show data types
data_raw.info()
print('\n')

# Statistical description of data
data_raw.describe()
Missing entries: 0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
meanfreq    3168 non-null float64
sd          3168 non-null float64
median      3168 non-null float64
Q25         3168 non-null float64
Q75         3168 non-null float64
IQR         3168 non-null float64
skew        3168 non-null float64
kurt        3168 non-null float64
sp.ent      3168 non-null float64
sfm         3168 non-null float64
mode        3168 non-null float64
centroid    3168 non-null float64
meanfun     3168 non-null float64
minfun      3168 non-null float64
maxfun      3168 non-null float64
meandom     3168 non-null float64
mindom      3168 non-null float64
maxdom      3168 non-null float64
dfrange     3168 non-null float64
modindx     3168 non-null float64
label       3168 non-null object
dtypes: float64(20), object(1)
memory usage: 519.8+ KB


Out[3]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm mode centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx
count 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000
mean 0.180907 0.057126 0.185621 0.140456 0.224765 0.084309 3.140168 36.568461 0.895127 0.408216 0.165282 0.180907 0.142807 0.036802 0.258842 0.829211 0.052647 5.047277 4.994630 0.173752
std 0.029918 0.016652 0.036360 0.048680 0.023639 0.042783 4.240529 134.928661 0.044980 0.177521 0.077203 0.029918 0.032304 0.019220 0.030077 0.525205 0.063299 3.521157 3.520039 0.119454
min 0.039363 0.018363 0.010975 0.000229 0.042946 0.014558 0.141735 2.068455 0.738651 0.036876 0.000000 0.039363 0.055565 0.009775 0.103093 0.007812 0.004883 0.007812 0.000000 0.000000
25% 0.163662 0.041954 0.169593 0.111087 0.208747 0.042560 1.649569 5.669547 0.861811 0.258041 0.118016 0.163662 0.116998 0.018223 0.253968 0.419828 0.007812 2.070312 2.044922 0.099766
50% 0.184838 0.059155 0.190032 0.140286 0.225684 0.094280 2.197101 8.318463 0.901767 0.396335 0.186599 0.184838 0.140519 0.046110 0.271186 0.765795 0.023438 4.992188 4.945312 0.139357
75% 0.199146 0.067020 0.210618 0.175939 0.243660 0.114175 2.931694 13.648905 0.928713 0.533676 0.221104 0.199146 0.169581 0.047904 0.277457 1.177166 0.070312 7.007812 6.992188 0.209183
max 0.251124 0.115273 0.261224 0.247347 0.273469 0.252225 34.725453 1309.612887 0.981997 0.842936 0.280000 0.251124 0.237636 0.204082 0.279114 2.957682 0.458984 21.867188 21.843750 0.932374

Plot highly separable data

When initially viewing the data with Seaborn's pairplot module, it was plain to see that a small handful of the 20 parameters were fairly distict across their histogram distribution. These six parameters are plotted below to reveal their histograms as well as their relationship to the other separable parameters.

The exposed quality of separation brings confidence to our ultimate goal of developing machine learning algorithms to distinguish between the possible gender of the source speaker.

In [4]:
# Choose data with visually separable distribution across label
data_of_interest = ['sd','Q25','IQR','sfm','mode','meanfun']

# Colors to differentiate between "male" and "female" labelling
clr_m  = '#3498db'  # blue
clr_fm = '#F67088'  # pink

# Plot data of interest
sns.set(font_scale=1.2)
g = sns.pairplot(data_raw[data_of_interest + ['label']],
                 palette=sns.color_palette([clr_m, clr_fm]), 
                 hue='label',
                 size=2.5)
for ax in g.axes.flatten():
    for t in ax.get_xticklabels():
        t.set(rotation=33)

In the above plot, notice that each histogram (on the diagonal) shows distinct differences in shape between the male and female labels. The most telling parameter is the meanfun (mean fundamental frequency). As one would expect, the fundamental frequencies exhibited by male voices are much lower than those exhibited by females. Our final algoithms will likely place a large predictory weight on the meanfun category.

Plot correlations between data categories

Another important aspect of the data is the correlation between parameters. Highly correlated data does not have as much independent information among the parameters, thus providing less distict patterns fro which a machine learning algorithm can learn.

The clustered heat map below uncovers some correlations within the data, but does seem to indicate enough independent information among the many parameters to build successful prediction models.

In [5]:
g = sns.clustermap(data_raw.corr(), cmap='coolwarm', figsize=(8,8))
for text in g.ax_heatmap.get_yticklabels():
    text.set_rotation('horizontal')

Although the plot above does not show much correlation, there are some relationships that we can make sense of, given our knowledge of the data.

Looking into the methodology behind the R language's specprop (spectral properies) function, we can see that skewness (skew) and kurtosis (kurt) are derived via the following equations:

$$ S = \frac{\sum_{i=1}^N(x_i-\bar{x})^3}{(N-1)\sigma^3} $$
$$ K = \frac{\sum_{i=1}^N(x_i-\bar{x})^4}{(N-1)\sigma^4} $$

Both skewness and kurtosis describe the distribution of data by shape. Skewness describes whether the destribution curve is left-leaning or right-leaning, and kurtosis describes how close to the median value a distribution falls. Because of their similar derivations, it is not surprising that they would be highly correlated in the data.

Similar similarities exist between the centroid and meanfreq, as well as maxdom and dfrange. Each are different specific values that attempt to quantify very similar aspects of the data.

Furthermore, IQR (interquartile range) and Q25 (25th percentile location) are very similar distribution measurements as well. The interquartile range measures the range that encompases the "middle 50%" of the data, equal to the difference between the 75th and 25th percentiles. In this case, these two categories appear to have a strong negative correlation, implying that a low 25th percentile would yield a large IQR value. This relationship makes sense in the context of the data. A similar relationship does not exist with the Q75 category, likely due to the positive skew of the data.

Save raw data for further preparation

A custom module obj.py was created to easily save and load objects between Python environments.

In [6]:
obj.save(data_raw,'var/data_raw')