[This post is based on the BSc. Thesis of Kim van Putten (Computer Science, VU Amsterdam)]
As part of the Bachelor’s degree Computer Science at the VU Amsterdam, Kim van Putten conducted her bachelor thesis in the context of the DIVE+ project .
The DIVE+ demonstrator is an event-centric linked data browser which aims to provide exploratory search within a heterogeneous collection of historical media objects. In order to structure and link the media objects in the dataset, the events need to be identified first. Due to the size of the data collection manually identifying events in infeasible and a more automatic approach is required. The main goal of the bachelor project was to find a more effective way to extract events from the data to improve linkage within the DIVE+ system.
The thesis focused on event extraction from radio news bulletins of which the text content were extracted using optical character recognition (OCR). Data preprocessing was performed to remove errors from the OCR’ed data. A Named Entity Recognition (NER) tool was used to extract named events and a pattern-based approach combined with NER and part-of-speech tagging tools was adopted to find unnamed events in the data. Errors in the data caused by the OCR were found to cause poor performance of the NER tools, even after data cleaning.
The results show that the proposed methodology improved upon the old event extraction method. The newly extracted events improved the searchability of the media objects in the DIVE+ system, however, they did not improve the linkage between objects in the linked data structure. Furthermore,
the pattern-based method of event extraction was found to be too coarse-grained and only allowed for the extraction of one event per object. To achieve a finer granularity of event extraction, future research is necessary to find a way to identify what the relationships between Named Entities and verbs are and which Named Entities and verbs describe an event.