Platform for Automatic Co-reference Resolution for Linked Data

From Master Projects
Jump to: navigation, search

has title::A Platform for Automatic Co-reference Resolution for Linked Data
status: finished
Master: project within::Knowledge Technology and Intelligent Internet Applications
Student name: student name::mpt410
Start start date:=2010/07/26
End end date:=2010/12/26
Supervisor: Paul Groth
Company: has company::VU
Thesis: has thesis::Media:Thesis.pdf
Poster: has poster::Media:Media:Posternaam.pdf

Signature supervisor



Thesis proposal; A Platform for Automatic Co-reference Resolution for Linked Data

[introduction] The co-reference problem within the Semantic Web is an on-going research topic with already great efforts from different projects. One of the main co-reference issues is the question how do we know that different located- and structured objects are semantically identical to each other? However, an underexposed topic in this field are techniques for managing, publishing and using co-reference information; “Research has been and continues to be carried out in this field, developing systematic analysis and heuristic based approaches to identifying co-references in or between datasets, however techniques for managing, publishing and using co-reference information are lacking.” [1]

[problem description] The current co-reference management and publishing solutions all have necessary features and techniques to perform a co-ref resolution; definition of input dataset and target datasets, specification of a co-reference algorithm, take care of a co-reference context and eventually generation of the co-reference information. However, the average data-publisher does not have the intention and skills to setup such a system, is not aware which target datasets to choose, and don’t know which co-reference algorithms work the best for their situation. Important step in solving the above described problem context is narrowing down the problem description into sub-problems. Therefore, in this thesis, we focus on the sub-problem how to select and process against available target datasets. Because the amount of datasets within the Web Of Data is growing exponentially, the urgency of the question, how to determine candidate datasets to use within the co-reference resolution, is growing. Including the questions how the increase the efficiency, speed, and reliability of the resolution against the different candidate datasets. Without concerning about these questions the co-reference resolution service would; -naively analyse all available datasets one by one -naively analyse the co-reference of each individual triple within the selected dataset

Naively running the co-reference resolution against available datasets is -probably- a reliable method but also -probably- effects the time-period of the co-reference resolution. The exponential growing number of datasets in the Web of Data also effects the scalability of the co-reference resolution. Inevitably is the fact that a higher time-period of the co-reference resolution decreases the efficiency of the resolution.

[research question] The main research question in here is; “Given a dataset and a co-reference resolution service, how to; automatically determine available candidate datasets within the Web of Data and increase the efficiency of co-reference resolution against these candidate datasets”

To guide the research of finding the right answer for this main research question the following sub-questions should be answered; 1. data-publisher input: How to describe the context of a dataset? The context of a datasets is important to determine the right candidate datasets. 2. Dataset selection: which premises are needed within the Web Of Data before introducing an automatic selection of available datasets? 3. Dataset selection: how to automatically select available datasets? 4. Dataset selection: based on the data-publisher dataset context, which candidate datasets should be selected to perform a mapping to? 5. Co-reference resolution: based on the list of candidate datasets, how is an existing co-reference resolution service performing the resolution against these candidate datasets, and subsequently how to measure the performance? 6. Co-reference resolution: based on the performance measurements from an existing co-reference resolutions service, how to improve these performance measurements? 7. Which solution is needed to be able to successfully answer the above sub-questions with reliable results?

The whole process of answering the numbered sub-questions is depicted in the following picture;

[assumptions] In order to find the right answer to above described sub-questions, take the following assumptions into account; -The datasets confirm to the Linked Data principles -The data-publisher is able to define the dataset context -An existing co-reference resolution service is used to answer the above sub-questions -The chosen co-reference resolution service should perform the resolution against physically separated datasets -The chosen co-reference resolution service is based on instance matching -The problem how to select the right algorithm within the co-reference resolution service is outside the scope of this research -The required premises on the Web of Data are simulated

[evaluation] The first part of the thesis is the simulation of co-reference resolution service integration into the Web of Data. This simulation introduces the problem how to perform co-reference resolution against a large collection of available datasets. This research assumes that the simulation represents the future Web of Data. The second part of this thesis is the improvement of the co-reference resolution process against a list of datasets. To evaluate this improvement the performance measurements from an existing co-reference resolution are compared with the performance measurements of the improved solution based on the same list of test datasets. These performance measurements are concrete quantitative values, eg. the time difference between the execution of both co-reference resolution services and the reliability values which indicates that all co-references are found. After answering sub-question five; ‘how to measure the performance’, the complete methodology to evaluate the solution is more concrete.

[1] “Managing Co-reference on the Semantic Web”