Speech Recognition with Neural Networks

Predict the transcription from raw audio.

AI RNN Natural Language Processing (NLP) LSTM Tensorflow

Brief

The idea of this project was to create a full pipeline that would take raw sound and then predict the transcription.

The first step I did was to work with different ways to extract features from audio.

Then I tested different Neural Network (NN) architectures in order to crate a good model that could predict with the given features.

Some numbers

15 models trained

2.2h average training time

Model with 1,946,957 parameters

Pipeline

The aim is to have a system that given raw audio is able to predict the transcription of the spoken language. The pipeline is summarized in this image:

The first step is pre-processing the raw audio.
The second is an acoustic model that takes audio features and returns a probability distribution of all possible transcriptions (letters and punctuation).
The last step is to transform the probability and create a transcription.

1. Acoustic Features for Speech Recognition

There are two ways to extract features from audio and both were tested to see which one gave the best results.

The first one is to use the raw audio and transform it to a spectogram where one dimension represents time and the other one the frequency. This is done using Fast Fourier Transform (FFT).

The secon one is to use Mel-Frequency Cepstral Coefficients (MFCC). In general this consists of using only the frequencies that humans can recognize in order to reduce the dimensionality of the data.

2. Deep Neural Networks for Acoustic Modeling

In this part I work with different Neural Network (NN) architectures to see which one performed better. There options ranged from pure Recurent Neural Networks (RNN) to combinations of Convolutional Neural Networks (CNN) + RNN. The models were:

RNN
RNN + TimeDistributed Dense
CNN + RNN + TimeDistributed Dense
RNN (2 layers) + TimeDistributed Dense
Bidirectional RNN + TimeDistributed Dense

And I ended up using a combination of all of the above:

CNN + 2 Bidirectional RNN + 2 TimeDistributed Dense

The full details of the NN can be seen in the next figure:

I used dropout and batch normalization to avoid overfitting and Leaky ReLU to avoid dead ReLUs.

In order to get good results it is really important to note that this is a Connectionist Temporal Classification (CTC) problem and the loss function needs to be a CTC loss.

3. Obtain the predictions

I only use the most probable letter at each time. It is possible to increase the performance of the system a lot by adding a language model.

A language model would use the probabilities of each letter and do a matching with possible words. This will allow much better results.

Results

Below there is an example of the output and the real text.

True transcription:

mister quilter is the apostle of the middle classes and we are glad to welcome his gospel

Predicted transcription:

mis ter cilder is the aposol of the mitl clasos and were gllad to welkom his gosplel

It is not very good but is able to understand some things. With a language model the results would be way better.