Deep Learning Based Methods for Detection, Separation, and Recognition of Overlapping Speech
Abstract
All speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition/verification, and speech synthesis have advanced significantly since the emergence of deep learning techniques. However, the performance of these voice-enable systems degrades rapidly in non-ideal naturalistic environmental circumstances, specifically with the existence of an interfering talker. This challenge, known as the cocktail party problem is a psycho-acoustic phenomena, which refers to the remarkable ability of the human auditory system to selectively attend, recognize and extract meaningful information out of the complex auditory signal in a noisy environment, where the interfering sounds are produced by competing talkers or a variety of noises. For humans, thus perceptual processing is made possible due to bilateral hearing, where for speech technology, single-channel audio streams do not allow for any directional sound processing. However, even for humans with normal hearing abilities, the capacity of the human auditory system to extract and separate simultaneous sources from a mixture is severely compromised. In this dissertation, we propose novel approaches for designing algorithms to detect, separate, and recognize overlapping speech signals as well as extracting higher level information from multi-talker speech segments to reduce the existing gap between real world naturalistic environmental circumstances and current automatic speech technology systems. Specifically, we propose (i) three alternate Convolutional Neural Networks (CNN) models for detection of overlapping speech for segment turns as short as 25 msec, (ii) an attention-based (CNN) architecture, which attends to different sound sources in order to count the number of active speakers, and (iii) formulation of a Probabilistic Permutation Invariant Training framework to optimize and train a Long-Short Term Memory (LSTM) network to estimate the speaker-specific speech signals from a single channel mixed audio recording. Next, we develop a hybrid DNN/HMM speech recognition system to identify and recognize a desired speaker. Experimental results are provided based on simulated overlapping speech signals based on the WSJ, TIMIT, and GRID datasets, which demonstrate the effectiveness of the proposed approaches for processing overlapping speech signals. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and a 92.8% F-score on the GRID dataset. Also, experimental results on TIMIT and GRID datasets show that the proposed Prob-PIT speech separation system significantly outperforms the conventional PIT benchmark in terms of Signal-to-Distortion Ratio (SDR) and Signal-to-Interference Ratio (SIR). The proposed ASR system provides an absolute Word-Error-Rate (WER) improvement of +7% with respect to a conventional ASR system trained without using speaker-specific information. Taken collectively, the advancements on overlapping speech detection, speaker count estimation in multi-speaker scenarios, and speech separation based on probabilistic permutation invariant training have provided important technological improvement to improve speech technology solutions for naturalistic speech scenarios.