Linking historical ship records to newspaper archives

From Master Projects
Jump to: navigation, search

has title::Linking historical ship records to newspaper archives
status: finished
Master: project within::Knowledge Technology and Intelligent Internet Applications
Student name: student name::Andrea Cristina Bravo Balado
Start start date:=2013/04/01
End end date:=2014/07/31
Supervisor: Victor de Boer
Second reader: has second reader::Niels Ockeloen
Company: has company::VU
Thesis: has thesis::Media:Thesis_S2116286_linking-historical-ship-records-to-newspaper-archives.pdf

Signature supervisor



Final abstract

Linking historical datasets and making them available for the Web has increasingly become a subject of research in the field of digital humanities. In this master project, we focused on discovering links between ships from a dataset of Dutch maritime events and newspaper articles from historical archives. We have taken a two-stage approach: first, an heuristic-based method for record linkage and then machine-learned algorithms for article classification to be used for post-hoc filtering. Evaluation of the linking method has shown that certain domain features were indicative of mentions of ships in newspapers. Moreover, the classifier methods scored near perfect precision in predicting ship related articles.

More information at the VU Web&Media blog

Abstract KIM 1

In recent years, Information Extraction (IE) has become more relevant in the context of historial research given its potential to find and extract structured information from unstructured sources, such as books, newspapers and other written sources from previous centuries.

In the Netherlands, history is intimately related to the maritime activity and research in this field is very active. In this project we assume that, given the importance of maritime activity in every day life in the XVIII and XIX centuries, announcements on the departures and arrivals of ships or mentions of accidents or the cargo each ship carried, can be found in newspapers.

The Koninklijke Bibliotheek has selected, digitized and made available a collection of newspapers from 1618 to 1995, including original scans, PDF versions and text captured using Optical Character Recognition (OCR) technologies. These noisy texts are the main data source of this project.

This project is divided in two parts. In the first part, the main task is to search for ships using information from the “Noordelijke Monsterrollen” dataset as input data and evaluate the relevance of the results by means of precision and recall, effectively linking both ships and relevant newspaper archives. In the second part, information extraction will be performed, depending on the data obtained in the first part.

Finally, the goal is to link and enrich the existing collection with relevant information found in newspapers which would otherwise require extensive annotating work and man hours to find. During this presentation, I will describe the problem statement of my research proposal, as well as the main task and goals. The context of this project and a general background on information extraction will be explained. Finally, the current planning of the project will be presented.