News Portal Enhancement

From Master Projects
Jump to: navigation, search

About News Portal Enhancement

  • This project has been fulfilled.
  • This project fits in the following Bachelor programs: {{#arraymap:|, |xXx|bachelorproject within::xXx|,}}
  • This project fits in the following masterareas: {{#arraymap:Knowledge Technology and Intelligent Internet Applications, Technical Artificial Intelligence, AI and Communication, Information Sciences|, |xXx|project within::xXx|,}}



The overall goal we are aiming for is to provide visitors of a news portal with more information about the subjects they are reading about. This implies that news are not presented as stand alone pieces of information, but they are:

  • linked to relevant and related news the viewer might also be interested in reading about
  • augmented with background information (such as infobox) summarizing/presenting relevant information

Ultimately, the interest of the news portal owner is to maximize the traffic on the portal, by:

  • stimulating the viewer to read more news contained on the portal
  • providing better information so to attract visitors back on the portal


In this project, the starting point to perform the above mentioned actions of linking news and providing background information is the concept of entity. An entity in the news domain can be a person (such as a politician, a football player), an organization (such ONU, WWF), a location (Rome, Amsterdam), etc. News are automatically parsed by Natural Language Processing tools (which are available and need not be developed in the project) and entities are detected. We assume that each of these entities has an unique identifier and that the tools provide as output an XML file where each entity is annotated with such identifier. The end result of this phase is therefore a repository of news annotated with entities and their identifiers. Entity identifiers provide the following functionality:

  • A means to retrieve documents with 100% recall and precision, since the search can be done using the identifier, which is unambiguous
  • A means to link news to each other (news containing the same entity can be linked together since they are related)

Even though this functionality is already available and provides an initial starting point, the results from a user point of view can still be considerably improved. What is still missing is:

  • a way to provide information which is not contained in one single document: this implies to integrate different sources of knowledge (and not different documents)
  • a way to link documents that keeps context into account: just providing links to all documents containing e.g. a particular politician can be too overwhelming for the viewer if no filtering based on some contextual information is performed. Moreover, some interesting links are possibly not based on the presence of the same entities, but for example because the describe analogous events.

Concerning the first bullet point, this project aims at investigating ways to perform entity-based data integration. Knowledge sources that are publicly available and can be used to experiment are for example Freebase and DBpedia. The second bullet point requires to model some knowledge that can be extracted automatically by the news, with NLP tools. The aim of this task is to define what interesting aspects of news can be modeled and encode them in a knowledge source. This knowledge source can be used to establish interesting links between news, as well as to be used in the data integration mentioned above. As an initial step, we are considering modeling events since these are particularly useful for news.

Specific tasks

In this project, the master student could choose to focus on one from the following two tracks:

Track 1

  • investigate the capability of existing NLP tools specially with respect to their ability to model relations between entities and events
  • decide what to model (events being a good candidate, as said above, in case the investigation of the NLP tools reveal this is possible)
  • experiment with linking news using entities and the modeled aspect of the previous point

Track 2

  • Investigate how to integrate different data sources that describe the same entity
  • Provide background/related information when presenting news on the portal

The tasks can be discussed together with the student depending on his/her interests. A mix of both is also possible.

Background information and related projects

Assigning unique identifiers is the goal of the OKKAM project ( that provides the infrastructure for this project.

Example of news portal “information enrichment” can be seen at

Information on several entities can be found on Freebase, DBpedia and other data sets freely available from the Linking Open Data initiative. The university of Trento is experimenting with data integration based on entities, see "LibSwb: Browsing the Entity Context" (