Time Delay Neural Network (TDNN) with Frequency Dependent Grid-Recurrent Neural Network (RNN) for Child Speech Recognition
Automatic speech recognition is the term used to describe the process of converting or transcribing acoustic human speech (i.e. sound waves) into a symbolic form of a human language such as English. Most of the research work that done on ASR focused only in adult speech but nearly 81% of ASR are used by kids aged 6-12 for text message, downloading mobile application, accessing the internet and YouTube-Kids voice command to search for contents. The use of voice commands is rapidly increasing. So, when we tested already available ASR model with child speech corpus and found that overall performance does not obtained adequate word error rate (WER). Since child speech contain lot of inconsistent in acoustic features, scarcity in clean and labeled data set. In the proposed work, we implemented TDNN with Grid-RNN in child speech corpus and obtained better performance. The proposed model is also trained with noise and reverberation that is added manually with original corpus. We used CMU Kids and CSLU Kids corpus to implement our proposed work and also shows 15% improvement in WER when compared the results with traditional TDNN.