Domain Adaptation for Speech Based Emotion Recognition
Abstract
One of the main barriers in the deployment of speech emotion recognition systems in real applications is the lack of generalization of the emotion classifiers. The recognition performance achieved in controlled recordings drops when the models are tested with different speakers, channels, environments and domain conditions. Annotations of new data in the new domain is expensive and time-consuming. Therefore, it is important to design strategies that efficiently use limited amounts of labeled data in the new domain and extract as much useful information as possible from the available unlabeled data to improve the robustness of the system. This thesis studies approaches to generalize emotion classifiers to new domains. First, we explore supervised model adaptation, which modifies the trained model using labeled data from the new domain. We study the data requirements and different approaches for SVM adaptation in the context of supervised adaptation for speech based emotion recognition. The results indicate that even small portion of data used for adaptation can significantly improve the performance. Increasing the speaker diversity in the labeled data used for adaptation does not provide significant gain in performance. Also, we observe that classifiers trained with naturalistic or acted data achieve similar performance after adapting the models to the target domain. Second, we propose solutions for semi-supervised domain adaptation. We explore the use of active learning (AL) in speech emotion recognition. Active learning selects samples in the new domain that are used to adapt the classification models using domain adaptation. We consider two approaches. The first approach focuses on selecting samples that are more beneficial to the classifier. We propose a novel iterative fast converging incremental adaptation algorithm that only uses correctly classified samples at each iteration. This conservative framework creates sequences of smooth changes in the decision hyperplane, resulting in statistically significant improvements over conventional schemes that adapt the models at once using all the available data. The second approach focuses on selecting the features that optimize the performance in the new domain. The method combines AL along with feature selection to build a diverse ensemble that performs well in the new domain. The use of ensembles is an attractive solution, since they can be built to perform well across different mismatches. We study various data selection criteria, and different sample sizes to determine the best approach toward building a robust and diverse ensemble of classifiers. The results demonstrated that we can achieve a significant improvement by performing feature selection on a small set from the target domain. Finally, we explore unsupervised adaptation for speech emotion recognition. We propose to use adversarial multitask training to extract a common representation between training and testing domains. The primary task is to predict emotional attribute-based descriptors for arousal, valence or dominance. The secondary task is to learn a common representation where the train and test domains cannot be distinguished. By using a gradient reversal layer, the gradients coming from the domain classifier are used to bring the source and target domain representations closer. We show that exploiting unlabeled data consistently leads to better emotion recognition performance across all emotional dimensions. We visualize the effect of adversarial training on the feature representation across the proposed deep learning architecture. The analysis shows that the data representations for the train and test domains converge as the data is passed to deeper layers of the network. The proposed advances create appealing strategies to build robust speech emotion classifiers that generalize across domains, providing practical affective-aware solutions to real-life problems.