Extracting Semantic Information from Powerpoint

From Master Projects
Jump to: navigation, search

has title::PowerPoint Semantics: Extracting semantic content from PowerPoint presentations
status: finished
Student name: student name::Joserelda Boon
Start start date:=2013/06/04
End end date:=2013/10/29
Supervisor: Paul Groth
Second reader: has second reader::Laura Hollink
Company: has company::Vrije University Amsterdam
Thesis: has thesis::Media:Master_Thesis-JBOON-Oct2013.pdf
Poster: has poster::Media:Thesis app workflow - New Page (2).png

Signature supervisor



At present several online platforms are emerging with the purpose to attempt and gather a somewhat structured knowledge base of PowerPoint presentations available on the web. PowerPoint document retrieval and indexation during search queries mainly relies on the meta-information used to describe the contents of these platforms and in some cases the content of the slideshow as well. Inconsistencies or failure by humans to provide this information however poses a threat to the recall of PowerPoint documents in search engine queries. Ideally in case meta-information is omitted or erroneous, precision and recall of a given slideshow are not significantly affected because they would rely on the semantic representation of the PowerPoint contents rather than human attributed meta-information.

This article proposes an experimental (semi-automated) approach to extract relevant PowerPoint content into textual format and convert these to semantic representation of the document. Two additional existing techniques are employed to show that the semantic representations produced are appropriate for document classification and retrieval by machines i.e. search engine bots. The experimental framework of this approach consists of three inter-dependent steps namely; 1) PowerPoint Content Extraction, 2) PowerPoint Document Classification and 3) PowerPoint Semantics Publication.

1) PowerPoint Content Extraction Trough a custom application coded in VBA to perform the following tasks: - Open document and extract slide contents (titles, media, charts, etc.). - Count and save the total amount of text, media and dynamic objects and the human classification per presentation to an ARFF file used for document classification in step 2. - Convert extracted contents to HTML including markup vocabulary for publication and slideshow semantic representation in step 3.

2) PowerPoint Document Classification This step uses the ARFF output from step 1 to classify the slideshows provided as an academic or corporate document based on the difference in amount of media, text and dynamic objects present in each document. The presence of more dynamic objects (tables, charts) leans towards the corporate class while the presence of more text and/ or media objects implies the academic class. Based on this possible pattern several classifiers are trained and tested in WEKA to assess whether significant relationship exists between the features evaluated and the document class pre-assigned by humans. At least 80% strength of agreement (chance corrected agreement) between observers (human vs. machine) must be achieved to support the suggestion that a substantial classification pattern of PowerPoint documents exists between humans and machines.

3) PowerPoint Semantics Publication The classes predicted in the previous step are manually added to the semantic representations (in HTML) produced in step 1 and these are published online for retrieval by machines. In order to give meaning to these representations a domain-specific ontology is developed to describe the concept of PowerPoint documents and their relation to the Schema.org vocabularies for knowledge representation. Appendix 7.1 gives an extensive taxonomy of the classes to be mapped.