Crowdsourcing construction of information retrieval test collections for conversational speech
MetadataShow full item record
Building a test collection for an ad hoc information retrieval system on conversational speech raises new challenges for researchers. Traditional methods for building test collections are costly, and thus they are not feasible to apply to large scale conversational speech data. Constructing a large test collection on conversational speech with high quality at low cost is challenging. Crowdsourcing may represent a promising approach. Crowd workers tend to be less expensive than professional assessors, and crowd workers can work simultaneously to perform jobs on a large scale. However, despite the benefits of scale and cost, the quality of the results delivered by crowd workers may suffer. This thesis focuses on relevance judging, one of the key components of a test collection. We adopt two crowdsourcing platforms: oDesk and MTurk, use audio clips and various versions of transcripts, conduct multiple experiments under diverse settings, and analyze the results qualitatively and quantitatively. We delve into what factors influence the quality of relevance judgments on conversational speech. We also investigate differences between relevance judgements from experts and crowd workers. This thesis also describes best practices for the design of crowdsourcing tasks to improve crowd workers' performance. Ultimately, these may assist researchers in building high-quality test collections on conversational speech at low cost and scale through crowdsourcing.