Methods and Experimental Design for Collecting Emotional Labels Using Crowdsourcing
Abstract
Manual annotations and transcriptions have an ever-increasing importance in areas such as behavioral signal processing, image processing, computer vision, and speech signal processing. Conventionally, this metadata has been collected through manual annotations by experts. With the advent of crowdsourcing services, the scientific community has turned to crowdsourcing for many tasks that researchers deem tedious, but can be easily completed by many human annotators. While crowdsourcing is a cheaper and more efficient approach, the quality of the annotations becomes a limitation in many cases. This work investigates the use of reference sets with predetermined ground-truth to monitor annotators’ accuracy and fatigue, all in real-time. The reference set includes evaluations that are identical in form to the relevant questions that are collected, so annotators are blind to whether or not they are being verified on performance on a specific question. This framework presents a more useful type of verification when compared to traditional ground truth methods, as the data collected for verification is not explicitly overhead. This framework is implemented via the emotional annotation of the MSP-IMPROV database. A subset of the MSP-IMPROV database was annotated with emotional labels by over ten evaluators. This set provides a unique resource to investigate the tradeoff between quality and quantity of the evaluations. The analysis relies on the concept of effective reliability, which suggests than many unreliable annotators can provide labels that are as valuable as labels collected from few experts. Using a post-processing filter, we obtain annotations with different reliability, discarding noisy labels. The study investigates this tradeoff in the context of machine learning evaluations for speech emotion recognition. This study also investigates the incremental value of additional annotations. The emotional labels provided by multiple annotators are commonly aggregated to create consensus labels. We propose a stepwise analysis to investigate changes on the consensus labels as we incrementally add new evaluations. The evaluation demonstrates that the consensus labels are very stable, especially after five evaluations. The protocol for crowdsourcing evaluations and the results for the analyses represent important contributions in the area of affective computing.