POI extraction from crawled content
|has title::POI extraction from crawled content|
|Master:||project within::Technical Artificial Intelligence|
|Student name:||student name::Marina Boia|
|Second reader:||has second reader::Stefan Schlobach|
TomTom is the world’s leading provider in navigation and location solutions. TomTom Places enables users in finding destinations in their area, such as stores, companies, brands and so on.
In doing so, TomTom Places aims to offer a set of points of interest (POIs) as complete and as informative as possible. The quantity, quality and freshness of this data clearly influences the user experience.
The web contains a wealth of information regarding points of interest. Typically, POI owners create their own websites by means of which they provide information about their products and/or services, alongside contact information such as postal address, phone number and email address. However, these websites are created, updated and removed at a pace that makes it impossible to be manually managed. Moreover, the fact that these websites do not usually follow a single pre-defined structure/format makes it difficult to extract the relevant information automatically.
The challenge, therefore, lies in extracting the relevant POI information automatically, independently of the structure/layout of the corresponding POI web pages and of whether these websites refer to a single or multiple points of interest.
In this project, potential techniques for extracting POI information from crawled content will be explored, starting by focusing on the essential aspects: the postal address and the name, and proceeding by investigating ways of extracting additional information such as category, contact information (telephone number, website) and product information. These approaches will be tested on a representative data set containing annotated crawled content of websites in English, while keeping in mind that they also need to be efficient when trained on data in other languages, containing address structures from different countries.
The project will output a proof of concept solution as well as a report detailing on relevant existing work done on the topic, the approach taken, the results obtained and a discussion on these results. The discussion will, therefore, focus on the obtained results, analyzing the cases in which the system is successful in extracting the relevant POI information, as well as the cases in which it is not.
TomTom is the world’s leading provider of location and navigation solutions. Within this context, TomTom Places is a business unit aimed at enabling end users to find destinations in their surroundings. Possible locations users might be interested in are: stores, restaurants, hotels and so on. Clearly, the end user’s experience will be affected by the quality, quantity and freshness of this data. With this in mind, the ultimate goal is to have a set of points of interest as complete, accurate and informative as possible.
The Web contains a wealth of information with regard to points of interest (POIs) people might be interested in. It is common for POI owners to create their own websites for advertisement and awareness purposes. These websites typically contain some overview information about the services or the products they are providing, as well as information on how the physical location can be reached (such as postal address and directions) and on how contact persons can be approached (such as telephone and fax numbers, e-mail addresses). The large amount of such websites and the fast pace at which they are created, updated, or removed makes it impossible to have them manually managed in search of relevant point of interest information. From this, the logical conclusion would be to have a tool that automatically extracts this sort of information from point of interest websites. However, this approach faces several difficulties. Perhaps the most salient one stems from the fact that these websites do not comply with a single, predefined structure. There are not some predefined locations within a web page where we can expect to find the information we are interested in. Layout and display of information varies from website to website.
In view of this, the current project aims to focus on the two most essential aspects that define a point of interest: postal address and name. That is, this project aims to address the following research questions
- Is it possible to effectively extract postal addresses from content of websites?
- Is it possible to effectively match a name of a point of interest to a previously extracted address?
- Is it possible to address these issues in a manner that is not fine tuned for a specific country, but capable of extracting relevant information over several different countries?
In answering the proposed research questions, the following goals were set up:
- Achieve and evaluate postal address extraction for US data
- Achieve and evaluate name extraction for US data
- Extend and evaluate these solutions to either one other country (France)
The project will output a proof of concept that will illustrate the outcome of the methods employed for the two tasks: postal address extraction and name extraction. This will be backed up by a report documenting the entire process.
Knowledge on points of interest is crucial for TomTom Places, a business unit aimed at directing end users to destinations of interest. One source of information currently exploited is the Web, with its wide range of point of interest websites. Because of this, this project has investigated to what extent relevant point of interest information can be automatically retrieved from such websites. Two issues were addressed: extraction of postal address information and extraction of name information.
Regarding address extraction, specific attention was given to solutions applicable for more than one country. The approach uses machine learning algorithms and country address formats in order to spot information of relevance. Extraction was applied to both United States and German data, with comparable results. For United States precision scores get as high as 0.87, while recall scores get approach a value of 0.9 For Germany precision scores get close to a value of 0.92, whereas recall scores reach 0.86.
With regard to name extraction, the topic was tackled at a conceptual level, by providing an overview relevant literature and by outlining several directions for future work on the matter.