Accelerating Speech Recognition System by Adam optimization and CNN for Real Time System using GPU

Rajkumar S. Bhosale; Narendra S. Chaudhari

Rajkumar S. Bhosale
Narendra S. Chaudhari

Abstract

In today’s digital world of computer and Mobile system, Automatic Speech recognition in real time is promising application. The fast responsive speech commands gives flexibility to the user for rapid access of smart systems, Laptop and other devices efficiently instead of typing commands. Along with great state of accuracy and learning by example, deep learning makes machine learning classification at highest peak. In deep learning, Convolutional Neural Network (CNN) class gives promising results in speech recognition. In previous systems bind speech recognition to only two convolutional layers. The proposed model works on five layers of convolutional neural network. The proposed deep neural network classification system applied on 65000 WAVE Google’s Tens or flow dataset and AIY commands. Mel Spectrogram extract from the input speech and Adam optimization algorithm perform training of convolutional neural network (CNN). The Convolutional Neural Network proves to outperform than other models and can achieve accuracy of 95.1% for 6 labels. For better performance of system, we added Background Noise in data. If noise is added, the network not only recognizes different spoken words but also detects input contains any silence or background noise. The Data augmentation support for augmenting the data can increase the effective size of the training data and help prevent the network from over fitting. Training essentially consider very crucial so CPU or GPU (NVDIA Tesla K40 C GPU) is used for training purpose for time efficiency. We can test our newly trained speech command detection network on streaming audio from microphone. Confusion matrix will be calculated for evaluation of system and prediction of the unknown speech words. The proposed system outperformed for 11 labels with Google TebsorFlow and AIY teams, it contains 105,000 wave audio files and five layer model which achieve accuracy of 94.9% in less training time of 4.5116 sec using GPU.