My visiting period here at the VU has come to an end. It was an interesting journey discovering natural language processing theories and techniques, the crowd-truth world, and how the human perspective gives a new insight into semantic interpretations.
During the Web&Media meeting of the 27th October, I summarized in a talk the research pursued in these eight months. I showed how challenging is to extract relevant entities in TV-programs descriptions, due to the broad amount of topics they cover and the different formats they have. None of existing tools for automatic Named-Entity Recognition and Classification is suitable for these data. Therefore, an integration approach seems the appropriate way to reach the goal.
I illustrated the workflow established for extracting relevant entities from a text in the entertainment domain, relying on the adoption of different annotators, as well as the issues arising in the integration of their outputs. In order to increase the coverage of the annotation task, metrics based on majority-vote are combined with metrics established for the crowd-truth evaluation for gold-standard creation. This approach should be able of capturing cases typically cut off by majority-vote integration techniques, that are unique information and distributed agreement.
Several features are computed in order to capture as many useful characteristics for assessing the relevance of an entity as possible. Human annotators results, gathered through a crowd-sourcing task, are used for collecting positive and negative examples of relevance and, as an ultimate goal, will be adopted for evaluating precision and recall of the entire system.