We are happy to announce that our project exploring relation extraction from natural language has 2 extended abstracts accepted at the Collective Intelligence conference this summer! Here are the papers:
- Crowdsourcing Ambiguity-Aware Ground Truth: we apply the CrowdTruth methodology to collect data over a set of diverse tasks: medical relation extraction, Twitter event identification, news event extraction and sound interpretation. We prove that capturing disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, a method which enforces consensus among annotators. By applying our analysis over a set of diverse tasks we show that, even though ambiguity manifests differently depending on the task, our theory of inter-annotator disagreement as a property of ambiguity is generalizable.
- Disagreement in Crowdsourcing and Active Learning for Better Distant Supervision Quality: we present ongoing work on combining active learning with the CrowdTruth methodology for further improving the quality of DS training data. We report the results of a crowdsourcing experiment ran on 2,500 sentences from the open domain. We show that modeling disagreement can be used to identify interesting types of errors caused by ambiguity in the TAC-KBP knowledge base, and we discuss how an active learning approach can incorporate these observations to utilize the crowd more efficiently.
For those curious about the Big Data Europe technology stack and who rather view videos than read descriptions and documentation, we have started a youtube video channel where BDE researchers explain the how, why and what of the BDE stack. Embedded below is a short clip of Hajira Jabeen explaining how BDE enables someone to get started with Big Data. More clips are available on the channel.
Source: Victor de Boer
Our paper “Social Network Analysis for Trust Prediction” by Davide Ceolin and Simone Potenza of KonnketID has been accepted at the IFIP Trust Management Conference 2017. This paper is the result of a Network Institute Voucher to establish a collaboration between our group and KonnektID and it is also partly funded by the COMMIT Big Data Veracity project.
Abstract: From car rental to knowledge sharing, the connection between online and offline services is increasingly tightening. As a consequence, online trust management becomes crucial for the success of services run in the physical world. In this paper, we outline a framework for identifying social web users more inclined to trust others by looking at their profiles. We use user centrality measures as a proxy of trust, and we evaluate this framework on data from Konnektid, a knowledge-sharing social Web platform. We introduce four metrics for measuring trust. Performance achieved an accuracy between 43% and 99%.
On February 24th 2017 the Kick-off meeting for the Linkflows project took place. The meeting was hosted by Vrije Universiteit Amsterdam. During this meeting, the partners involved in the project were introduced.
Linkflows is an innovation PhD project with two external contributors that introduces the timely topic of semantic publishing and scientific assessment, and links it to the existing research, collections and collaborations central in the Web & Media Group, e.g. linked data, crowdsourcing, quality assessment and multimedia collections.
The aim of the Linkflows project is to make scientific contributions on the Web, e.g. articles, reviews, blog posts, multimedia objects, datasets, individual data entries, annotations, discussions, etc., better valorized and efficiently assessed in a way that allows for their automated interlinking, quality evaluation and inclusion in scientific workflows.
The PhD candidate for this project is Cristina-Iulia Bucur. The daily supervisors are Tobias Kuhn and Davide Ceolin and co-promoter is Lora Aroyo.
The partners involved in the Linkflows project:
On the 10th of March “Narrativizing disruption”, a DIVE+ centered CLARIAH-funded research pilot, was presented at the CLARIAH toog day. The pilot (2017-2018) focuses on the question how exploratory search can support media researchers interpret disruptive media events as lucid narratives. Disruptive media events, such as terrorist attacks or environmental disasters, are difficult to interpret due to an inability to grasp the story. This leads to problems for media scholars, who analyse how narratives construct different political, economic or cultural meanings around such events. Offering media scholars the ability to explore and create lucid narratives about media events therefore greatly supports their interpretative work.
This project studies how exploratory search can help to understand how ‘disruptive’ events are constructed as narratives across media, and instilled with specific cultural-political meanings. This project approaches this question by using CLARIAH components (DIVE+’s navigation and bookmarking pane) to examine how scholars use and create narratives to understand media events. Academic insights conclude how exploratory search supports narrative generation. Software-specific insights produce recommendations at the entity, interface and user level, provide starting points for media research, and recommendations for at the entity, interface and user level, provide starting points for media research, and recommendations for auto-generating narratives based on exploratory search practices.
Our ControCurator paper abstract titled “ControCurator: Understanding Controversy Using Collective Intelligence” has been accepted at Collective Intelligence 2017. In this paper we describe the aspects of controversy: the time-persistence, emotion, multiple actors, polarity and openness. Using crowdsourcing, the ControCurator dataset of 31888 controversy annotations was obtained for the relevance of these aspects to 5048 Guardian articles. The results indicate that each of these aspects is a positive indicator of controversy, but also that there is a clear difference in their signal strength. Most notably, the emotion was found to be the highest indicator. Though, all the measured controversy aspects were found to positively correlate with controversy. These results suggest that the controversy model is accurate and useful for modeling controversy in news articles.
The full dataset with controversy annotations is available for download at https://github.com/ControCurator/controcurator-corpus/releases/tag/1.0
On 7th of March the DIVE+ project was presented at Cross Media Café: Uit het Lab. DIVE+ is result of a true inter-disciplinary collaboration between computer scientists, humanities scholars, cultural heritage professionals and interaction designers. In this project, we use the CrowdTruth methodology and framework in order to crowdsource events for the news broadcasts from The Netherlands Institute for Sound and Vision (NISV) that are published under open licenses in the OpenImages platform.
As part of the digital humanities effort, DIVE+ is also integrated in the CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure, next to other media studies research tools, that aims at supporting the media studies researchers and scholars by providing access to digital data and tools. In order to develop this project we work together with eScience Center, which is also funding the DIVE+ project.
Check the slides!
Our demo of ControCurator titled “ControCurator: Human-Machine Framework For Identifying Controversy” will be shown at ICT Open 2017. In this demo the ControCurator human-machine framework for identifying controversy in multimodal data is shown. The goal of ControCurator is to enable modern information access systems to discover and understand controversial topics and events by bringing together crowds and machines in a joint active learning workflow for the creation of adequate training data. This active learning workflow allows a user to identify and understand controversy in ongoing issues, regardless of whether there is existing knowledge on the topic.
As a teaser for our upcoming ICT4D students. Have a look at this nice video that André Baart made
Source: Victor de Boer
Our paper “Harnessing Diversity in Crowds and Machines for Better NER Performance” (Oana Inel and Lora Aroyo) has been accepted for the ESWC 2017 Research Track. The paper is to be published together with the proceedings of the conference.
Over the last years, information extraction tools have gained a great popularity and brought significant improvement in performance in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to their highly specialized and constrained input and training data. Thus, each tool is able to extract only a subset of the named entities (NE) mentioned in a given text. In order to improve NE Coverage, we propose a hybrid approach, where we first aggregate the output of various NER tools and then validate and extend it through crowdsourcing. The results from our experiments show that this approach performs significantly better than the individual state-of-the-art tools (including existing tools that integrate individual outputs already). Furthermore, we show that the crowd is quite effective in (1) identifying mistakes, inconsistencies and ambiguities in currently used ground truth, as well as in (2) a promising approach to gather ground truth annotations for NER that capture a multitude of opinions.