Last week I was in Portorož, Slovenia to attend the 12th Extended Semantic Web Conference. I was there to present my paper Crowdsourcing Disagreement for Collecting Semantic Annotation during the PhD Symposium, but also to check out what new and interesting research is out in the Semantic Web community, particularly relating to crowdsourcing and natural language processing. Here’s my impressions of the conference:
To start, a big thanks to Claudia d’Amato and Philippe Cudré-Mauroux for doing a great job with the organization! First, I think working with a mentor during the rebuttal phase significantly improved the quality of my paper — sometime you just need that outside perspective. Secondly, it was a great idea to give a poster slot to participants in the symposium. Because of the timing, I think not so many people show up at PhD Symposium discussions, and if they do, they are typically supervisors of the students involved. By having a slot in the poster session, we had the chance to really get involved with the main conference goers, and get a bigger audience for our work.
Overall, the accepted papers skewed more towards applied/in-use Semantic Web than core research on reasoning or standards. Much research seems to be done in closed domains, medical or industry, with significant reliance on domain experts. I guess this means the Semantic Web finally has mainstream recognition?
My presentation went quite well, even though I still feel nervous when speaking in public! We took some risks by forgoing the typical template with bullet points and lots of text, and I think this paid off in the end. I also had an interesting discussion with Marta Sabou on my work — seems that people are interested in methodology and models for crowdsourcing with disagreement analysis. This seems to be the logical step after the experimental work I have been doing lately, where we try to show crowdsoursing disagreement actually works in practice. The slides of my talk are here:
Dena Tahvildari also participated in the PhD Symposium, giving a talk on her project about how to implement semantics in laboratory notebooks. Getting scientists to change their lab habits is definitely ambitious work, but I think very important as well, in the spirit of open publishing.
— Anca Dumitrache (@anouk_anca) June 1, 2015
Machine Learning / NLP
ML was such a hot topic last week! In fact, all three keynotes touched on it in one way or another: Viktor Mayer-Schönberger discussed the importance of big data (and implicitly semantics) in improving quality of ML methods, Lise Getoor talked about how statistics and soft logic models can be used in learning, and Massimo Poesio discussed the ambiguity in NLP gold standards (very relevant for CrowdTruth!). Many papers in the conference also dealt with ML topics. Still, it seems to me, a lot of these approaches are exploratory, and a bit superficial, in the way that they use semantic data. Fadi put it well:
— Fadi Maali (@fsheer) June 1, 2015
On the other hand, I think we have finally reached peak Named Entity Recognition! GERBIL, developed at Universität Leipzig and winner of best demo, tackles the ever more complex problem of picking which NER tool to use for your data. On their project description page, they perfectly capture the main problems in the field:
The idea of GERBIL emerged in September 2014 when a couple of articles released at the same time claimed to be state-of-the-art. Especially, those approaches were not easily comparable due to their heterogeneous set-up, dataset use and evaluation metrics.
Ricardo Usbeck, one of the GERBIL dev’s, had another interesting talk, this time on hybrid question-answering. Their system combines data from the linked (structured) and document (unstructured) Web to form the answer, sort of like IBM Watson does, but open source. I am curious to see where this idea will go, especially since I suspect a big bottleneck in Q&A systems is not the actual answer selection mechanism, but rather the ability to work with heterogeneous data. Maybe there is a possibility to use crowdsourced answers as another data source?
The conference also had a small, but noteworthy track on crowdsourcing. Seyi Feyisetan from Southampton talked about behaviour patterns in crowdsourcing workers for NER, and found that the input text and the annotation type can be responsible for uneven performance in workers. Mazin Alsarem and Tamara Bobic both presented interesting work on entity ranking with a crowdsourced gold standard (here and here), but I think both could have used more annotators per task, and a more in-depth analysis of what causes disagreement. I have not studied the task of entity ranking in much detail, but I would like to look into the FRanCo dataset to explore ways of modeling ambiguity.
- ambiguity is an inherent feature in language;
- established methods for collecting ground truth produce unreliable data because they disregard ambiguity;
- crowdsourcing ground truth has the potential to capture ambiguity in language through measuring inter-annotator disagreement.
These findings came through his work with anaphora resolution, though for the most part he has worked with gamification rather than paid crowdsourcing. I remember looking into his Phrase Detectives game when preparing the literature review for my MSc thesis, but I will have to check if he has some papers out there specifically on disagreement analysis. Overall though, it’s so refreshing to see this perspective coming from computational linguistics!
Massimo Poesio: "In NLP we need to start thinking about disagreement more seriously." YES!! #eswc2015
— Anca Dumitrache (@anouk_anca) June 4, 2015
Sarven Capadisli is trying to get the Web Science community to publish Web-compliant research. His demo called This ‘Paper’ is a Demo was meant to highlight the possibilities for publishing outside of the LaTeX/PDF paradigm, with some performance art thrown in. This is important stuff if you are interested in (Linked) Open Data, and the change has to come from individual researchers. So please consider switching to HTML publishing! I already tried it for one paper, it was a fun exercise, and I hope to keep it up. Sarven has made available his code for publishing papers online (he even has ACM and Springer stylesheets).
As usual, VU University had some great presentations in the conference. Wouter Beek talked about how to fix common problems of semantic data with the LOD Laundromat, and Hamid Bazoobandi presented his work on optimizing dictionary compression for dynamic RDF data using Tries.
And finally, some travel highlights from Slovenia:
- I really enjoyed Piran, a beautiful fishing town with Venice-style classical Italian architecture.
- The seafood is to die for! Try Ribja Kantina in Portorož, or Pirat in Piran.
- Postojna cave is a famous tourist destination, but well worth the hype! Next time I will go for a speleology tour. They even have concerts in there!