Human annotators are critical for creating the necessary datasets to train statistical learning algorithms. However, there exist several limiting factors to creating large annotated datasets, such as annotation cost and limited access to qualified annotators. In recent years, researchers have investigated overcoming this data bottleneck by resorting to crowdsourcing, which is the delegation of a particular task to a large group of individuals rather than a single person, usually via an online marketplace.
This thesis is concerned with crowdsourcing annotation tasks that aid either the training, tuning, or evaluation of statistical learners, across a variety of tasks in natural language processing. The tasks reflect a spectrum of annotation complexity, from simple class label selection, through selecting textual segments from a document, to composing sentences from scratch. The annotation setups were novel as they involved new types of annotators, new types of tasks, new types of data, and new types of algorithms that can handle such data.
The thesis is divided into two main parts: the first part deals with text classification, and the second part deals with machine translation (MT).
The first part deals with two examples of the text classification task. The first is the identification of dialectal Arabic sentences and distinguishing them from standard Arabic sentences. We utilize crowdsourcing to create a large annotated dataset of Arabic sentences, which is used to train and evaluate language models for each Arabic variety. The second task is a sentiment analysis task, that of distinguishing positive movie reviews from negative ones. We introduce a new type of annotations called rationales, which complement the traditional class labels, and aid learning system parameters that generalize better to unseen data.
In the second part, we examine how crowdsourcing can be beneficial to machine translation. We start with the evaluation of MT systems, and show the potential of crowdsourcing to edit MT output. We also present a new MT evaluation metric, RYPT, that is based on human judgment, and well-suited for a crowdsourced setting. Finally, we demonstrate that crowdsourcing can be helpful in collecting translations to create a parallel dataset. We discuss a set of features that can help distinguish well-formed translations from those that are not, and we show that crowdsourcing translation yields results of near-professional quality at a fraction of the cost.
Throughout the thesis, we will be concerned with how we can ensure that collected data is of high quality, and we will employ a set of quality control measures for that purpose. Those methods will be helpful not only in detecting spammers and unfaithful annotators, but also those who are simply unable to perform the task properly, which is a more subtle form of undesired behavior.