Semantic enrichment of scientific articles

From Master Projects
Jump to: navigation, search

Semantic enrichment of scientific articles
status: finished
Student name: student name::Fawad Khan
Start start date:=2010/06/01
End end date:=2010/11/30
Supervisor: Paul Groth
Second reader: has second reader::Stefano Bocconi
Poster: has poster::Media:Media:Posternaam.pdf

Signature supervisor



About Semantic enrichment of scientific articles

  • This project has been fulfilled.
  • This project fits in the following Bachelor programs: {{#arraymap:|, |xXx|bachelorproject within::xXx|,}}
  • This project fits in the following masterareas: {{#arraymap:Knowledge Technology and Intelligent Internet Applications, Technical Artificial Intelligence, AI and Communication, Information Sciences|, |xXx|project within::xXx|,}}


This project was not completed.


The broad scope of this project is the implementation of a possible scenario for the future of scientific online publishing. This project is inspired on one hand by current trends in the Semantic Web, and on the other by the efforts of a scientific publisher, Elsevier, to provide added-value services to its customers. The general idea is to enrich an article with information about the concepts contained in the text, so that a reader can not only read the text but also access different sources of related information.


The articles we consider belong to the domain of molecular biology, specifically proteins and their interactions. The general process is the following:

  • at authoring time, an author determines what the relevant entities contained in the article are. A Word plugin is used to help the author perform this task
  • these entities are linked in the text to their relevant source of information (database)
  • All the links are extracted from the text and kept in a separate file, which contains the entities, their links to the external source of information and their positions in the text
  • This separate file is updated with additional information, encoding this additional information in a Semantic Web language
  • When the article is viewed by a user, this information is put back in the text as for example links to relevant information, or through special visualization interfaces

The plugin is currently being developed in the scope of the OKKAM project, and the OKKAM concept should be used all through the process. Basically OKKAM provides an unique id for each distinct entity (in our case for proteins, article authors, etc.). Information about a particular entity can be encoded in a knowledge base referring to the entity by its OKKAM id. Using an entity’s OKKAM id makes then possible to connect all the information about the same entity contained in different knowledge bases. Other tools to perform some of the actions described above are available from Elsevier, which is actively experimenting with these ideas.

Expected contribution

There are mainly two aspects in this project:

  • To build the infrastructure that takes an article as input and performs the necessary format conversion to and from XML
  • To envision new ways to semantically enrich articles and present them to users. Some of this enrichments and visualizations ideas are already being investigated, but there is great room to be inventive.

The candidate can choose their own balance of the two aspects. Preferably the accent should be on the second one, which is the ultimate goal. For the latter we need:

  • Ideas about how to enrich the information beyond the linking to relevant databases.
  • Ideas about how to visualize the information gained from the article.

Technology involved

Skill in programming (preferably Java), knowledge or willing to learn Xml Schema, Xpath, text parsing, Semantic Web languages. Some affinity with biology would be nice.