Find the Clue: Identify terms in text

From Master Projects
Jump to: navigation, search

About Find the Clue: Identify terms in text

  • This project has not yet been fulfilled.
  • This project fits in the following Bachelor programs: {{#arraymap:|, |xXx|bachelorproject within::xXx|,}}
  • This project fits in the following masterareas: {{#arraymap:Multimedia, Internet and Web Technology, AI and Communication, Knowledge Technology and Intelligent Internet Applications, Information and Communication Technology, Bioinformatics, Information Sciences|, |xXx|project within::xXx|,}}


Typically Natural Language Processing tools work with prior annotated data (by human annotators), called Ground Truth data, which identifies the Gold Standard interpretation of what different words in the text represent. However, for some types of words, it is difficult for human annotators to achieve a consensus on what the Gold Standard interpretation is. Hence, this Master's project is about experimenting with how such annotation data can be collected through crowdsourcing tasks. Such crowdsourcing tasks can be performed in different subject domains, e.g. medical texts, news papers, historical texts, TV program textual descriptions, museum objects textual descriptions.

For your Master's Project you can choose one of those domains to define crowdsourcing task(s) to collect annotations for some of the following types:

  • relationships between terms, e.g. medical terms, events, people
  • terms referring to medical symptoms, observations, etc.
  • terms referring to political and historical events
  • terms referring to political and historical figures and organizations
  • types of political and historical events

The work on this Master's project is performed in collaboration with IBM Research, Chris Welty one of the developers of Watson, the IBM computer that defeated the best players on the American game show Jeopardy!.


  • setting up crowdsourcing tasks for data collection
  • processing the collected data

Tools and Data

Possible text collections are:

  • wikipedia texts
  • medical texts + vocabularies
  • newspapers + vocabularies
  • historical texts + vocabularies
  • TV program textual descriptions (including teletext) + vocabularies
  • museum objects textual descriptions + vocabularies
  • CrowdFlower, Amazon Mechanical Turk
  • data analysis tools, e.g. Excel, R, etc.
  • (optionally) Natural Language Tools

Recommended prior knowledge

  • Knowledge & Media course
  • Social Web course
  • Research methods course

Extra Information

Contact Lora_Aroyo, Michiel_Hildebrand or Guus_Schreiber for more information about this project.