Dr. Watson: Gamification of Crowdsourcing for Information Extraction

From Master Projects
Revision as of 06:20, 24 September 2013 by Laroyo (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

has title::Dr. Watson: Gamification of Crowdsourcing for Information Extraction
status: finished
Master: project within::Knowledge Technology and Intelligent Internet Applications
Student name: student name::Anca Dumitrache
Start start date:=2013/02/01
End end date:=2013/08/01
Supervisor: Lora Aroyo
Second supervisor: Robert-Jan Sips
Second reader: has second reader::Chris Welty
Company: has company::IBM
Thesis: has thesis::Media:AD_MScThesis.pdf
Poster: has poster::Media:Posternaam.pdf

Signature supervisor



With the growth of the Web, text annotation has come to play an important part in the way that people interact with information online. An example of the possibilities of semi-structured online data is Watson, the system designed by IBM, that won the Jeopardy TV quiz show against human competitors. To build its knowledge base, Watson was trained on a series of databases, taxonomies, and ontologies of publicly available data. Currently, IBM aims at training Watson for question-answering in the medical domain. For this reason, training and evaluation data, in the form of medical text annotation, is needed.

Automated tools for natural language processing often have issues in capturing the ambiguity of expressions within a text (metaphors, for example, are notoriously difficult to parse). This ambiguity is also present in medical text, and the way it is annotated. Experts often disagree over how a certain concept should be formalized. However, by asking experts to annotate text through adhering to strict guidelines, we lose the sense of ambiguity in language. In addition, annotating text takes time, and involves a small number of experts. After factoring in the time it takes to train the experts, this method becomes costly in terms of time, and not particularly scalable.

In order to capture the ambiguities in medical text, we argue that the task of annotating medical text for Watson can be solved via crowdsourcing. While there is already an ongoing effort to for establishing an annotation workflow using human resources from crowdsourcing platforms such as CrowdFlower and AMT, this project focuses on exploring how the crowdsourcing annotation effort can be extended with a professional crowd of medical experts. Our hypothesis is, that a range of annotation tasks (e.g. identification of demographic factors, symptom factors, relations between entities, etc.) are suitable for a paid crowd of lay human workers to perform in order to bootstrap the process with large amounts of annotated data capturing diversity in lexical expression. However, in order to increase the quality of this annotation data, certain tasks (e.g. verification and diagnosis relations) can also be performed by medical students and professionals to capture also the aspects in medical text that require domain knowledge.

This project is performed in collaboration with IBM Amsterdam and IBM Research, New York.

Abstract KIM 1

This presentation will describe the requirements of the gamified crowdsourced application for the medical domain, both from the point of view of the crowd, and of the Watson data that needs to be retrieved. Then, a possible framework for building such an application is presented, discussing features such as learning, playfulness, and personalized game play, as well as possibilities to integrate it with a paid crowdsourcing platform. Finally, an outline for how the evaluation of such a system could be performed will be discussed.