Instance Matching in Geospatial Open Data
|has title::Instance Matching in Geospatial Open Data|
|Master:||project within::Knowledge Technology and Intelligent Internet Applications|
|Student name:||student name::Rutger van Willigen|
|Second supervisor:||Bert Spaan|
|Second reader:||has second reader::Paul Groth|
|Company:||has company::Waag Society|
|Thesis:||has thesis::Media:Thesis Rutger van Willigen.pdf|
It is becoming a trend for governments, city municipalities and companies throughout the world to publish data openly. This is a giant step forward for developers, who are able to create new and dynamic applications using this data. A problem that arises, however, is the difference in structure and representation of this data, due to a lack of data standards. Applications for open data therefore usually need to be tailored specifically to the relevant dataset(s) and are therefore usually unable to cope with other datasets and/or dataset structure changes.
Another interesting fact is that distinct datasets can contain the same objects. Recognizing these equivalences (Instance Matching) is incredibly useful, because then it is possible to combine all information on an object from different datasets, and to access this information on a central point. It is, however, difficult to detect whether two objects are (or aren't) the same.
The City Service Development Kit (CitySDK) project of Waag Society is a data portal that acts as a bridge between open data and developers. It aims on connecting all objects from different data sources and providing all this information through a central API with a fixed data structure. This way, developers are able to let their applications interact with the CitySDK API and filter whatever information they think is relevant to their applications. The data representation and instance matching problems are therefore CitySDK's responsibility, lowering the bar for developers to create open data applications.
CitySDK specifically contains geospatial datasets, which means all objects in datasets contain either an address or a geospatial reference (i.e. coordinates). Instance matching could become significantly easier using this data, as geospatial references are pretty solid location indicators. It is however difficult to cope with erroneous data, noisy data or incomplete data. Furthermore, two objects with the same geospatial reference could be distinct. A system is needed that is able to interpret such information to perform accurate instance matching through these datasets.
Instance matching has been the subject of many research projects. Most of the research, however, is focused on matching objects of datasets with contents in the same domains (e.g. city names, medical ontologies, etc). An extra dimension presented in CitySDK lies in the fact that object types from a new dataset can differ greatly from what was already present. It is a challenge to design a system that is able to a) cope with uncertainty, b) recognize the contents of new data and c) find object equivalences correctly. Another issue is that most instance matching is performed through RDF-triple stores, whereas CitySDK runs on a PostgreSQL database.
2nd KIM presentation
The second KIM presentation describes the efforts taken to solve this problem. We have split the problem into two parts: an ontology matching step that aims on finding similarity or equality between data sets using their attribute names, and an instance matching step for which an active learning loop is built -- human feedback is used to learn object similarities or equalities based on the geospatial references of each object. We will discuss the setup, the experiments and the results for these two parts.