Trip Report: Dagstuhl Seminar on Citizen Science

A month ago, I had the opportunity to attend the Dagstuhl Seminar  Citizen Science: Design and Engagement. Dagstuhl is really a wonderful place. This was my fifth time there. You can get an impression of the atmosphere from the report I wrote about my first trip there. I have primarily been to Dagstuhl for technical topics in the area of data provenance and semantic data management as well as for conversations about open science/research communication.

This seminar was a great chance for me to learn more about citizen science and discuss its intersection with the practice of open science. There was a great group of people there covering the gamut from creators of citizen science platforms to crowd-sourcing researchers. 17272.01.l

As usual with Dagstuhl seminars, it’s less about presentations and more about the conversations. There will be a report documenting the outcome and hopefully a paper describing the common thoughts of the participants. Neal Reeves took vast amounts of notes so I’m sure that this will be a good report :-). Here’s a whiteboard we had full of input:

2017-07-05 11.28.24.jpg

Thus, instead of trying to relay what we came up with (you’ll have to wait for the report), I’ll just pull out some of my own brief highlights.

Background on Citizen Science

There were a lot of good pointers on where to start understand current thinking around citizen science. First, two tutorials from the seminar:

What do citizen science projects look like:

Example projects:

How should citizen science be pursued:

And a Book:

Open Science & Citizen Science

Claudia Göbel gave an excellent talk about the overlap of citizen science and open science. First, she gave an important reminder that science in particular in the 1700s was done as public demonstrations walking us through the example painting below. 2017-07-04 11.23.02

She then looked at the overlap between citizen science and open science. Summarized below:

citizenopenscience.png

A follow-on discussion at the with some of the seminar participants led to input for a whitepaper that is being developed through the ECSA on Citizen & Open Science for Europe. Check out the preliminary draft. I look forward to seeing the outcome.

Questioning Assumptions

One thing that I left the seminar thinking about was was the need to question my own (and my field’s) assumptions. This was really inspired by talking to Chris Welty and reflecting on his work with Lora Aroyo on the issues in human annotation and the construction of gold sets.  Some assumptions to question:

  • What qualifications you need to have to be considered a scientist.
  • Interoperability is a good thing to pursue.
  • Openness is a worthy pursuit.
  • We can safely assume a lack of dynamics in computational systems.
  • That human performance is good performance.

Indeed, in Marissa Ponti she pointed to the example below and highlighted some of the potential ramifications of what each of these (what at first blush are positive) citizen science projects could lead to. 2017-07-03 10.06.36

That being said, the ability to rapidly engage more people in the science system seems to be a good thing indeed. An an assumption I’m happy to hold.

Random

Filed under: trip report Tagged: citizen science, dagstuhl, open science
Source: Think Links

Posted in Paul Groth, Staff Blogs

Identifying emotions in email with human-level accuracy

As part of the Master’s degree Business Analytics at the VU Amsterdam, Erwin Huijzer completed his master thesis at Anchormen:
“Identifying effective affective email responses; Predicting customer affect after email conversation”

When customers contact a company with regards to queries and complaints, often they prefer to use email. Handling these emails is a massive task for the Customer Support department. Automating email handling can help improve , reduce costs and shorten response time. However, awareness of customer emotion during the conversation is an important aspect in effective email handling.

In the thesis, sentiment analysis was used on incoming customer emails to determine the initial emotion of a customer. Furthermore, affect analysis was applied to predict the customer’s emotion after the response email from Customer Support. Both analyses were executed using supervised machine learning which trains computer models based on labelled data. This required manual labelling of a set of emails with sentiment (None, Neg, Pos, Mix) and emotions (Anger, Disgust, Fear, Joy, Sadness).

Manual labelling revealed that humans find it very difficult to determine emotions in email. Still, using majority vote, a reliable labelset could be determined. Applying machine learning (voting ensemble of Random Forest and Neural Net) on the labelled data resulted in human-level accuracy for Anger and Joy. For Disgust, the model even significantly outperforms human annotation. Using the same voting ensemble and including SVM, leads to human-level performance on Sentiment too. In both sentiment and emotions, the domain specific models trained on a small (742) set of emails outperforms a commercial model that was trained on millions of news sources.

Machine learning to predict customer affect, showed low performance. Still, results are significantly better than the benchmarks. A more direct measurement of customer affect may however drastically improve performance.

The full thesis is available for download here. The presentation is available here.

Posted in Masters Projects

A Concentric-based Approach to Represent Topics in Tweets and News

[This post is based on the BSc. Thesis of Enya Nieland and the BSc. Thesis of Quinten van Langen (Information Science Track)]

The Web is a rich source of information that presents events, facts and their evolution across time. People mainly follow events through news articles or through social media, such as Twitter. The main goal of the two bachelor projects was to see whether topics in news articles or tweets can be represented in a concentric model where the main concepts describing the topic are placed in a “core”, and the concepts less relevant are placed in a “crust”. In order to answer to this question, Enya and Quinten addressed the research conducted by José Luis Redondo García et al. in the paper “The Concentric Nature of News Semantic Snapshots”.

Enya focused on the tweets dataset and her results show that the approach presented in the aforementioned paper does not work well for tweets. The model had a precision score of only 0.56. After a data inspection, Enya concluded that the high amount of redundant information found in tweets, make them difficult to summarise and identify the most relevant concepts. Thus, after applying stemming and lemmatisation techniques, data cleaning and similarity scores together with various relevance thresholds, she improved the precision to 0.97.

Quinten focused on topics published in news articles. When applying the method described in the reference article, Quinten concluded that relevant entities from news articles can be indeed identified. However, his focus was also to identify the most relevant events that are mentioned when talking about a topic. As an addition, he calculated a term frequency inverse document frequency (TF-IDF) score and an event-relation (temporal relations and event-related concepts) score for each topic. These combined scores determines the new relevance score of the entities mentioned in a news article. The improvements made improved the ranking of the events, but did not improve the ranking of the other concepts, such as places or actors.

Following, you can check the final presentations that the students gave to present their work:

A Concentric-based Approach to Represent News Topics in Tweets
Enya Nieland, June 21st 2017

The Relevance of Events in News Articles
Quentin van Langen, June 21st 2017

Posted in CrowdTruth, Projects

Elevator Annotator: Local Crowdsourcing on Audio Annotation

[This post is based on Anggarda Prameswari’s Information Sciences MSc. Thesis]

For her M.Sc. Project, conducted at the Netherlands Institute for Sound and Vision (NISV), Information Sciences student Anggarda Prameswari (pictured right) investigated a local crowdsourcing application to allow NISV to gather crowd annotations for archival audio content. Crowdsourcing and other human computation techniques have proven their use for collecting large numbers of annotations, including in the domain of cultural heritage. Most of the time, crowdsourcing campaigns are done through online tools. Local crowdsourcing is a variant where annotation activities are based on specific locations related to the task.

The two variants of the Elevator Annotator box as deployed during the experiment.
The two variants of the Elevator Annotator box as deployed during the experiment.

Anggarda, in collaboration with NISV’s Themistoklis Karavellas, developed a platform called “Elevator Annotator”, to be used on-site. The platform is designed as a standalone Raspberry Pi-powered box which can be placed in an on-site elevator for example. It features a speech recognition software and a button-based UI to communicate with participants (see video below).

The effectiveness of the platform was evaluated in two different locations (at NISV and at Vrije Universiteit) and with two different modes of interaction (voice input and button-based input) through a local crowdsourcing experiment. In this experiments, elevator-travellers were asked to participate in an experiment. Agreeing participants were then played a short sound clip from the collection to be annotated and asked to identify a musical instrument.

The results show that this approach is able to achieve annotations with reasonable accuracy, with up to 4 annotations per hour. Given that these results were acquired from one elevator, this new form of crowdsourcing can be a promising method of eliciting annotations from on-site participants.

Furthermore, a significant difference was found between participants from the two locations. This indicates that indeed, it makes sense to think about localized versions of on-site crowdsourcing.

More information:

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Events panel at DHBenelux2017

At the Digital Humanities Benelux 2017 conference, the e-humanities Events working group organized a panel with the titel “A Pragmatic Approach to Understanding and Utilizing Events in Cultural Heritage”. In this panel, researchers from  Vrije Universiteit Amsterdam, CWI, NIOD, Huygens ING, and Nationaal Archief presented different views on Events as objects of study and Events as building blocks for historical narratives.

#DHBenelux #panel: understanding events #fullhouse @ChielvdAkker kicks off with #digital #hermeneutics for #interpretation #support pic.twitter.com/0j9kEAF8SG

— Lora Aroyo (@laroyo) July 5, 2017

The session was packed and the introductory talks were followed by a lively discussion. From this discussion it became clear that consensus on the nature of Events or what typology of Events would be useful is not to be expected soon. At the same time, a simple and generic data model for representing Events allows for multiple viewpoints and levels of aggregations to be modeled. The combined slides of the panel can be found below. For those interested in more discussion about Events: A workshop at SEMANTICS2017 will also be organized and you can join!

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

DIVE+ receives the Grand Prize at the LODLAM Summit in Venice

We are excited to announce that DIVE+ has been awarded the Grand Prize at the LODLAM Summit, held at the Fondazione Giorgio Cini this week. The summit brought together ~100 experts in the vibrant and global community of Linked Open Data in Libraries, Archives and Museums. It is organised bi-annually since 2011. Earlier editions were held in the US, Canada and Australia, making the 2017 edition the first in Europe.

The Grand Prize (USD$2,000) was awarded by the LODLAM community. It’s recognition of how DIVE+ demonstrates social, cultural and technical impact of linked data. The Open Data Prize (of USD$1,000) was awarded to WarSampo for its groundbreaking approach to publish open data

Fondazione Giorgio Cini. Image credit: Johan Oomen CC-BY

.Five finalists were invited to present their work, selected from a total of 21 submissions after an open call published earlier this year. Johan Oomen, head of research at the Netherlands Institute for Sound and Vision presented DIVE+ on day one of the summit. The slides of his pitch have been published, as well as the demo video that was submitted to the open call. Next to DIVE+ (Netherlands) and WarSampo (Finland) the finalists were Oslo public library (Norway), Fishing in the Data Ocean (Taiwan) and Genealogy Project (China). The diversity of the finalists is a clear indication that the use of linked data technology is gaining momentum. Throughout the summit, delegates have been capturing the outcomes of various breakout sessions. Please look at the overview of session notes and follow @lodlam on Twitter to keep track.

Pictured: Johan Oomen (@johanoomen) pitching DIVE+. Photo: Enno Meijers. 

DIVE+ is an event-centric linked data digital collection browser aimed to provide an integrated and interactive access to multimedia objects from various heterogeneous online collections. It enriches the structured metadata of online collections with linked open data vocabularies with focus on events, people, locations and concepts that are depicted or associated with particular collection objects. DIVE+ is the result of a true interdisciplinary collaboration between computer scientists, humanities scholars, cultural heritage professionals and interaction designers. DIVE+ is integrated in the national CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure.

Pictured: each day experts shape the agenda for that day, following the OpenSpace format. Image credit: Johan Oomen (cc-by)

DIVE+ is a collaborative effort of the VU University Amsterdam (Victor de Boer, Oana Inel, Lora Aroyo, Chiel van den Akker, Susane Legene), Netherlands Institute for Sound and Vision (Jaap Blom, Liliana Melgar, Johan Oomen), Frontwise (Werner Helmich), University of Groningen (Berber Hagendoorn, Sabrina Sauer) and the Netherlands eScience Centre (Carlos Martinez). It is supported by CLARIAH and NWO.

The LODLAM Challenge was generously sponsored by Synaptica. We would also like to thank the organisers, especially Valentine Charles and Antoine Isaac of Europeana and Ingrid Mason of Aarnet for all of their efforts. LODLAM 2017 has been a truly unforgettable experience for the DIVE+ team.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Getting down with LOD tools at the 2nd CLARIAH Linked Data workshop

[cross-post from clariah.nl]

On Tuesday 13 June 2017, the second CLARIAH Linked Data workshop took place. After the first workshop in September which was very much an introduction to Linked Data to the CLARIAH community, we wanted to organise a more hands-on workshop where researchers, curators and developers could get their hands dirty.

The main goal of the workshop was to introduce relevant tools to novice as well as more advanced users. After a short plenary introduction, we therefore split up the group where for the novice users the focus was on tools that are accompanied by a graphical user interface, like OpenRefine and Gephi; whereas we demonstrated API-based tools to the advanced users, such as the CLARIAH-incubated COW, grlc, Cultuurlink and ANANSI. Our setup, namely to have the participants convert their own dataset to Linked Data and query and visualise, was somewhat ambitious as we had not taken into account all data formats or encodings. Overall, participants were able to get started with some data, and ask questions specific to their use cases.

It is impossible to fully clean and convert and analyse a dataset in a single day, so the CLARIAH team will keep investigating ways to support researchers with their Linked Data needs. For now, you can check out the CultuurLink slides and tutorial materials from the workshop and keep an eye out on this website for future CLARIAH LOD events.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: Language, Data and Knowledge 2017

Last week, I was the first Language, Data and Knowledge Conference (LDK 2017) hosted in Galway, Ireland. If you show up at a natural language processing conference (especially someplace like LREC) you’ll find a group of people who think about and use linked/structured data. Likewise, if you show up at a linked data/semantic web conference, you’ll find folks who think about and use NLP. I would characterize LDK2017 as place where that intersection of people can hang out for a couple of days.

The conference had ~80 attendees from my count. I enjoyed the setup of a single track, plenty of time to talk, and also really trying to build the community by doing things together. I also enjoyed the fact that there were 4 keynotes for just two days. It really helped give spark to the conference.

Here are some my take-aways from the conference:

Social science as a new challenge domain

Antal van den Bosch gave an excellent keynote emphasizing the need for what he termed holistic approach to language especially for questions in the humanities and social science (tutorial here). This holistic approach takes into account the rich context that word occur in. In particular, he called out the notions of ideolect and socialect that are ways word are understood/used individually and in a particular social group. He are argued the understanding of these computational is a key notion in driving tasks like recommendation.

I personally was interested in Antal’s joint work with Folgert Karsdorp (checkout his github repos!) on Story Networks – constructing networks of how stories are told and retold. For example, how the story of Red Riding Hood has morphed and changed overtime and what are the key sources for its work. This reminded me of the work on information diffusion in social networks. This has direct bearing on how we can detect and track how ideas and technologies propagate in science communication.

I had a great discussion with SocialAI team (Erica Briscoe & Scott Appling) from Georgia Tech about their work on computational social science. In particular, two pointers: the new DARPA next generation social science program to scale-up social science research and their work on characterizing technology capabilities from data for innovation assessment.

Turning toward the long tail of entities

There were a number of talks that focused on how to deal with entities that aren’t necessarily popular. Bichen Shi presented work done at Nokia Bell Labs on entity mention disambiguation. They used Apache Spark to train 700,000 classifiers – one per every entity mention in wikipedia. This allowed them to obtain much more accurate per-mention entity links. Note they used Gerbil for their evaluation. Likewise, Hendrik ter Horst focused on entity linking specifically targeting technical domains (i.e. MeSH & chemicals). During Q/A it was clear that straight-up gazeetering provides an extremely strong baseline in this task. Marieke van Erp presented work on fine-grained entity typing in Spanish and Dutch using word embeddings to go classify hundreds up types.

Natural language generation from KBs is worth a deeper look

Natural language generation from knowledge bases continues a pace. Kathleen McKeown‘s keynote touched on this, in particular, her recent work on mining paraphrasal templates that combines both knowledge bases and free text.  I was impressed with the work of Nina Dethlefs on using deep learning for generating textual description from  a knowledge base. The key insight was how to quickly generate systems to do NLG where the data was sparse using hierarchical composition. In googling around when writing this trip report I stumbled upon Ehud Reiter’s blog which is a good read.

A couple of nice overview slides

While not a theme, there we’re some really nice slides describingfundamentals.

From C. Maria Keet:

2017-06-20 10.09.40

From Christian Chiarcos/Bettina Klimek:

2017-06-20-11-09-34.jpg

From Sangha Nam

2017-06-19 11.07.02

Overall, it was a good kick-off to a conference. Very well organized and some nice research.

Random Thoughts

Filed under: academia, linked data, trip report Tagged: #ldk2017, linked data, nlp, trip report
Source: Think Links

Posted in Paul Groth, Staff Blogs

Collective Intelligence 2017 – Trip Report

On June 15-16 the Collective Intelligence conference took place at New York University. The CrowdTruth team was present with Lora Aroyo, Chris Welty and Benjamin Timmermans. Together with Anca Dumitrache and Oana Inel we published a total of six papers at the conference.

Keynotes

The first keynote was presented by Geoff Mulgan, CEO of NESTA. He set the context of the conference by stating that there is a problem with technological development, namely that it only takes knowledge out of society and does not put it back in. Also, he made it clear that many of the tools we see today like Google Maps are actually nothing more than companies that were bought and merged together. This combination of things is what creates the power. He also defined what the biggest trends are in collective intelligence: the observation e.g. citizen generated data on floods, predictive models e.g. fighting fires with data, memory e.g. what works centers on crime reduction, and judgement e.g. adaptive learning tool for schools. Though, there are a few issues with collective intelligence: Who pays for all of this? What skills are needed for CI? What are the design principles of CI? What are the centers of expertise? These are all not yet clear. However, what is clear is that there is a new field emerging through combining AI with CI: Intelligence Design. We used to think systems resolve this intelligence, but actually we need to steer and design it.

In a plenary session there was an interesting talk on public innovation by Thomas Kalil. He defined the value of concreteness as things that happen when particular people or organisations take some action in pursuit of a goal. These actions are more likely to affect change if you can articulate who would needs to do what. He said he would like to identify the current barriers to prediction markets and areas where governments could be a user and funder of collective intelligence. This can be achieved through connecting people that are working to solve similar problems locally, e.g. in local education. Then change can be driven realistically, by making clear who needs to do what. Though, it was noted also that people need to be willing and able for change to work.

Parallel Sessions

There were several interesting talks during the parallel sessions. Thomas Malone spoke about using contest webs to address the problem of global climate change. He claims that funding science can be both straightforward and challenging, for instance government policy does not always correctly address the need of a domain issues, and even conflicts of interest may exist. Also, fundamental research can be tough to convince the general public of its use, as it is not sexy. Digital entrepreneurship is furthermore something that is often overlooked. There are hard problems, and there are new ways of solving them. It is essential now to split the problems up into parts, solve each of them with AI, and combine them back together.

#CrowdTruth at @cicon17 presented by @cawelty #Crowdsourcing Ambiguity-aware #GroundTruth pic.twitter.com/9jio4GLHR4

— Lora Aroyo (@laroyo) June 15, 2017

Chris Welty presented our work on Crowdsourcing Ambiguity Aware Ground Truth at Collective Intelligence 2017.

Also Mark Whiting presented his work on Daemo, a new crowdsourcing platform that has a self-governing marketplace. He stress the fact that crowdsourcing platforms are notoriously disconnected from user interests. His new platform has a user driven design, in order to get rid of the flaws that exist in for instance Amazon Mechanical Turk.

Plenary Talks

Daniel Weld from the University of Washington presented his work on argumentation support in crowdsourcing. Their work uses argumentation support in crowd tasks to allow workers to reconsider their answers based on the argumentation of others. They found this to significantly increase the annotation quality of the crowd. He also claimed that humans will always need to stay in the loop of machine intelligence, for instance to define what the crowd should work on. Through this, hybrid human-machine systems are predicted to become very powerful.

Hila Lifshitz-Assaf of NYU Stern School of Business gave an interesting talk on changing innovation processes. The process of innovation has changed from a lane inventor, to labs, to collaborative networks, and now into open innovation platforms. The main issue with this is that the best practices of innovation fail in the new environment. In standard research and development there is a clearly defined and selectively permeable, whereas with open innovation platforms this is not the case. Experts can participate from in and outside the organisation. It is like open innovation: managing undefined and constantly changing knowledge in which anyone can participate. For this to work, you have to change from being a problem solve to a solution seeker. It is a shift from thinking: The lab is my world, to the world is my lab. Still, problem formulation is key as you need to define the problems in ways that cross boundaries. The question always remains, what is really the problem?

Poster Sessions

In the poster sessions there were several interesting works presented, for instance work on real-time synchronous crowdsourcing using “human swarms” by Louis Rosenberg. Their work allows people to change their answers through the influence of the rest of the swarm of people. Another interesting poster was by Jie Ren of Fordham University, who presented a method for comparing the divergent thinking and creative performance of crowds compared to experts. We ourselves had a total of five posters covering both poster sessions, which were received well by the audience.

@8w @cawelty @laroyo presenting Part I of our #CrowdTruth posters with @oana_inel @anouk_anca at the @cicon17 #informationExtraction pic.twitter.com/1lOPFGC2Vp

— Lora Aroyo (@laroyo) June 15, 2017

Posted in CrowdTruth, Projects

ESWC 2017 – Trip Report

Between 28th of May and 1st of June 2016 the 14th Extended Semantic Web Conference took place in Portorož, Slovenia. As part of the CrowdTruth team and project, Oana Inel presented her paper written together with Lora Aroyo in the first day of the conference. More about the paper that was presented can be found in a previous post. In the last day of the conference, Lora was the keynote speaker.

The Semantic Web group at the Vrije Universiteit Amsterdam had other great presentations. During the Scientometrics Workshop Al Idrissou talked about the SMS platform that links and enriches data for studying science. During the poster and demo session people were invited to check SPARQL2Git: Transparent SPARQL and Linked Data API Curation via Git by Albert Meroño-Peñuela and Rinke Hoekstra. Furthermore, the Semantic Web group had a candidate paper for the 7-year impact award “OWL reasoning with WebPIE: calculating the closure of 100 billion triples”, by Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen and Henri Bal.

Keynotes

I’ll start by writing a couple of words about the keynotes, which covered this year a high range of areas, domains and subjects. In the first keynote presentation at ESWC 2017, on Tuesday, Kevin Crosby, from RavenPack, stressed the importance of data as a factor in decision making for financial markets. In his talk entitled “Bringing semantic intelligence to financial markets”, he focused on the current issues related to data analytics in decision making: the lack of skills and expertise, the quality and completeness of data and the timeliness of data. However, the most important issue is the fact that although we live in the age of data, only around 29% of the decisions in the financial market are made based on data.

The second keynote speaker was John Sheridan, the digital director of The National Archives in UK. While giving a nice overview of the British history, he talked about how semantic technologies are used to preserve the history at The National Archives in UK, in a talk entitled “Semantic Web technologies for Digital Archives”. Nowadays, semantic technologies are used at large in order to make the cultural heritage collections publicly available online. However, people still struggle to search and browse through archives without having the context of the data. As a take home message, we need to work towards the second generation digital archives that should measure risks, provide trust evidence, redefine context, embrace uncertainty, enable use and access.

In the last day of the conference Lora Aroyo gave her keynote presentation, “Disrupting the Semantic Comfort Zone”. Lora started her keynote by looking back into the history of Semantic Web and AI and how her own journey embraced the changes along the way. Something was clear: the humans were always in the centre and they still continue to be. The second part of the presentation focused on introducing the underlying idea of the CrowdTruth project. As a final note, I’ll leave here the following question from Lora: “Will the next AI winter be the winter of human intelligence or not?”

NLP & ML Tracks

Federico Bianchi presented during the ML track an approach that uses active learning to rank semantic associations. The problem is well-known, we have an information overload in contextual KB exploration and even for small amounts of texts there is a lot of data to be considered. In order to determine which semantic associations are most interesting to users, Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs defines a ranking function based on a serendipity heuristic, i.e., relevance and unexpectedness.

The paper “All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking” by Kunal Jha, Michael Röder and Axel-Cyrille Ngonga Ngomo draws the attention over the current gold standards and makes similar claims as the ones we presented in our paper: the gold standards for not share a common set of rules for annotating named entities, they are not thoroughly checked and they are not refined and updated to newer versions. Thus, the need for the EAGLET benchmark curation tool for named entities!

Using semantic annotations for providing a better access to scientific publications is a subject that nowadays caught the attention of many researchers. Sepideh Mesbah, PhD student at Delft University of Technology presented “Semantic Annotation of Data Processing Pipelines in Scientific Publications”, a paper that proposes an approach and workflow for extracting semantically rich metadata from scientific publications, by classifying the content of scientific publications and extracting the named entities (objectives, datasets, methods, software, results).

Jose G. Moreno presented the paper “Combining Word and Entity Embeddings for Entity Linking” which introduces a natural idea for entity linking by using a combination of entity and word embeddings. The claims of the authors are the following: you shall know a word by the company it keeps and you shall know an entity by the company it keeps in a KB, word context by alignment, word/entity context by concatenation.

Social Media Track

The Social Media track started with a presentation by Hassan Saif – “A Semantic Graph-based Approach for Radicalisation Detection on Social Media”. The approach presented in the paper uses semantic graph representation in order to discover patterns among pro and anti ISIS users on social media. Overall, pro-ISIS users tend to discuss about religion, historical events and ethnicity, while anti-ISIS users focus more on politics, geographical locations and intervention against ISIS. The second presentation – “Crowdsourced Affinity: A Matter of Fact or Experience” by Chun Lu – took us in a different domain – a travel destination recommendation scenario that is based on a user-entity affinity, i.e., the likelihood of a user to be attracted by an entity (book film, artist) or to perform an ection (click, purchase, like, share). The main finding of the paper was that in general, a knowledge graph helps to assess more accurately the affinity, while a folksonomy helps to increase its diversity and novelty. The Social Media Track had two papers nominated for best student research paper – the aforementioned paper and the paper “Linked Data Notifications” presented by Sarven Capadisli, Amy Guy, Christoph Lange, Sören Auer, Andrei Sambra and Tim Berners-Lee. The latter was also the winner!

Best student paper award of #eswc2017 goes to @csarven and @rhiaro for Linked Data Notifications pic.twitter.com/7eZauUW6n1

June 1, 2017

In-Use and Industrial Track

Social media was highly relevant for the In-Use track as well. The Swiss Armed Forces is developing a Social Media Analysis system aiming to detect events such as natural disasters and terrorists activity by performing semantic tweet analysis. If you want to know more, you can the paper “ArmaTweet: Detecting Events by Semantic Tweet Analysis”. This track has as well nominations for best in-use paper. The winning paper in this category was “smartAPI: Towards a More Intelligent Network of Web APIs”, presented by Amrapali Zaveri.

Won the best in-use paper award for our #smartAPI work! Congrats to all co-authors! #eswc2017 #api #FAIR pic.twitter.com/FKzAgwuzFU

— Amrapali Zaveri (@AmrapaliZ) June 1, 2017

Open Knowledge Extraction Challenge

During the Open Knowledge Extraction challenge, Raphaël Troncy presented the participating system ADEL – an adaptable entity extraction and linking framework, also the challenge winning entry. The ADEL framework can be adapted to a variety of different generic or specific entity types that need to be extracted, as well as to different knowledge bases to be disambiguated to, such as DBpedia and MusicBrainz). Overall, this self-configurable system tries to solve a difficult problem with current NER tools, i.e., the fact that they are only tailored for specific data, scenarios and applications.

OKE Challenge winner @ #eswc2017 #oke2017 #benchmarking #bigdata #linkeddata #semanticweb #H2020 https://t.co/Uo4cWeFStS pic.twitter.com/mybaguTdOe

— Project HOBBIT (@hobbit_project) June 2, 2017

Workshops

On Monday, during the second day of workshops I attended two workshops, 3rd international workshop on Semantic Web for Scientific Heritage, SW4SH 2017 and Semantic Deep Learning, SemDeep-17, now at the first edition. During the SW4SH 2017 workshop, Francesco Beretta had a detailed keynote, entitled “Collaboratively Producing Interoperable Ontologies and Semantically Annotated Corpora” in which he presented a couple of projects for digital humanities (symogih.org, the corpus analysis environment TXM, among others) and how linked (open) data, ontologies, automated tools for natural language processing and semantics are finding their place in the daily projects of humanities scholars. However, all these tools, approaches and technologies are not 100% embraced, as humanities scholars are seldom content with precision values of 90% and they feel the urge of manually tweak the data, until it looks perfect.

During SemDeep-17, Sergio Oramas presented the paper “ELMDist: A vector space model with words and MusicBrainz entities”. This article makes it clear that it’s still unclear how NLP and semantic technologies can contribute in Music Information Retrieval areas such as music and artist recommendation and similarity. The approach presented uses NLP processing in order to disambiguate the entities from the musical texts and then runs the word2vec algorithm over this sense level space. Overall, their results show promising results, meaning that textual descriptions can be used in order to improve the Music Information Retrieval area. The last paper of the workshop, “On Semantics and Deep Learning for Event Detection in Crisis Situations”, was presented by Hassan Saif. As the title suggests, the paper tries to solve the problem of event detection in crisis situations from social media, using Dual-CNN, a semantically-enhanceddeep learning model. Altought the model has successful results in identifying the existence of events and their types, its performance drops significantly when identifying event-related information such as the number of people affected, total damages.

Posted in CrowdTruth, Projects