[Reading club] Dance in the World of Data and Objects

This is a first post in a new series on Semantic Web reading club. During this weekly reading club we discuss a research paper related to Semantic Web, Human Computation or Computer Science in general. Every week, one group member selects and prepares a paper to discuss. This week it was my time and I chose a paper from 2013: “Dance in the World of Data and Objects” by Katerina El Raheb and Yannis Ioannidis (full citation and abstract below). The paper presents the need for (OWL) ontologies for dance representation. A quite nice slide deck supporting the paper is found here.

‘Dance’. CC-By (Teresa Alexander-Arab) 

Computer-interpretable knowledge representations for dance is something I have been thinking about for a while now. I am mostly interested in representations that actually match the conceptual level at which dancers and choreoraphers communicate and how these are related to low-level representations such as Labanotation. I am currently supervising two Msc students on this topic.

The paper by El Raheb and Ioannidis and our discussion afterwards outlined the potential use of such a formal representations for:

  1. Archiving dance and for retrieval. This is a more ‘traditional’ use of such representations in ICT for Cultural Heritage. An interesting effect of having this represented using standard semantic web languages is that we can connect deep representations of choreographers to highly heterogeneous knowledge about for example dance or musical styles, locations, recordings, emotions etc. An interesting direct connection could be to Albert Merono’s RDF midi representations.
  2. For dance analysis. By having large amounts of data in this representation, we can support Digital Humanities research. Both in more distant reading, but potentially also more close analysis of dance. Machine learning techniques could be of use herer.
  3. For creative support. Potentially very interesting is to investigate to what extent representations of dance can be used to support the creative process of dancers and choreographers. We can think of pattern-based adaptations of choreographies.

Abstract: In this paper, we discuss the challenges that we have faced and the solutions we have identified so far in our currently on-going effort to design and develop a Dance Information System for archiving traditional dance, one of the most significant realms of intangible cultural heritage. Our approach is based on Description Logics and aims at representing dance moves in a way that is both machine readable and human understandable to support semantic search and movement analysis. For this purpose, we are inspired by similar efforts on other cultural heritage artifacts and propose to use an ontology on dance moves (DanceOWL) that is based on the Labanotation concepts. We are thus able to represent dance movement as a synthesis of structures and sequences at different levels of conceptual abstraction, which serve the needs of different potential users, e.g., dance analysts, cultural anthropologists. We explain the rationale of this methodology, taking into account the state of the art and comparing it with similar efforts that are also in progress, outlining the similarities and differences in our respective objectives and perspectives. Finally, we describe the status of our effort and discuss the steps we intend to take next as we proceed towards the original goal.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Automatic interpretation of spreadsheets

When humans read a spreadsheet table, they look at both the table design and the text within the table. They interpret the table layout, and use background knowledge, to understand the meaning and context of the data in a spreadsheet table. In our research we teach computers to do the same. We describe our method in the paper “Combining information on structure and content to automatically annotate natural science spreadsheets”.

Read more ›

Posted in Papers, Uncategorized

DIVE+ Submitted to LODLAM

Here’s the submission to the annual LODLAM challenge from the DIVE+ team. In this video, we introduce the ideas behind DIVE+ and take you for a exploratory swim in the linked media knowledge graph!

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: Challenges in Extracting and Managing References

Workshop "Challenges in Extracting and Managing References" today. https://t.co/iYWfgoU5h8 excellent speakers @gesis_org #excitews2017 pic.twitter.com/k7mU1gDTMp

— Philipp Mayr (@Philipp_Mayr) March 30, 2017

At the end of last week, I was at a small workshop held by the EXCITE project around the state of the art in extracting references from academic papers (in particular PDFs). This was an excellent workshop that brought together people who are deep into the weeds of this subject including, for example, the developers of ParsCit and CERMINE. While reference string extraction sounds fairly obscure the task itself touches on a lot of the challenges one needs in general for making sense of the scholarly literature.

Begin aside: Yes, I did run a conference called Beyond the PDF 2 and  have been known to tweet things like:

Data Narratives – from @yolandagil @dgarijov – computers should write the paper! https://t.co/clSiFUxYgp pic.twitter.com/pr0XLq7btJ

— Paul Groth (@pgroth) March 23, 2017

But, there’s a lot of great information in papers so we need to get our machines to read. end aside.

You can roughly catergorize the steps of reference extraction as follows:

  1. Extract the structure of the article.  (e.g. find the reference section)
  2. Extract the reference string itself
  3. Parsing the reference string into its parts (e.g. authors, journal, issue number, title, …)

Check out these slides from Dominika Tkaczyk that give a nice visual overview of this process. In general, performance on this task is pretty good (~.9 F1) for the reference parsing step but gets harder when including all steps.

There were three themes that popped out for me:

  1. The reading experience
  2. Resources
  3. Reading from the image

The Reading Experience

Min-Yen Kan gave an excellent talk about how text mining of the academic literature could improve the ability for researchers to come to grips with the state of science. He positioned the field as one where we have the ground work  and are working on building enabling tools (e.g. search, management, policies) but there’s still a long way to go in really building systems that give insights to researchers. As custodian of the ACL Anthology about trying to put these innovations into practice. Prof. Kan is based in Singapore but gave probably one of the best skype talks I have ever been part of it. Slides are below but you should check it out on youtube.

Another example of improving the reading experience was David Thorne‘s presentation around some of the newer things being added to Utopia docs – a souped-up PDF reader. In particular, the work on the Lazarus project which by extracting assertions from the full text of the article allows one to traverse an “idea” graph along side the “citation” graph. On a small note, I really like how the articles that are found can be traversed in the reader without having to download them separately. You can just follow the links. As usual, the Utopia team wins the “we hacked something really cool just now” award by integrating directly with the Excite projects citation lookup API.

Finally, on the reading experience front. Andreas Hotho presented BibSonomy the social reference manager his research group has been operating over the past ten years. It’s a pretty amazing success resulting in 23 papers, 160 papers use the dataset, 96 million google hits, ~1000 weekly active users active. Obviously, it’s a challenge running this user facing software from an academic group but clearly it has paid dividends. The main take away I had in terms of reader experience is that it’s important to identify what types of users you have and how the resulting information they produce can help or hinder in its application for other users (see this paper).


The interesting thing about this area is the number of resources available (both software and data) and how resources are also the outcome of the work (e.g. citation databases).  Here’s a listing of the open resources that I heard called out:

This is not to mention the more general sources of information like, CiteSeer, ArXiv or PubMed, etc. What also was nice to see is how many systems built on-top of other software. I was also happy to see the following:

both @knmnyn & Roman Kern calling out the ScienceIE #semeval task https://t.co/FKQZSezski @ElsevierLabs @IAugenstein #excitews2017

— Paul Groth (@pgroth) March 30, 2017

An interesting issue was the transparency of algorithms and quality of the resulting citation databases.  Nees Jan van Eck from CWTS and developer of VOSViewer gave a nice overview of trying to determine the quality of reference matching in the Web of Science. Likewise, Lee Giles gave a review of his work looking at author disambiguation for CiteSeerX and using an external source to compare that process. A pointer that I hadn’t come across was the work by Jurafsky on author disambiguation:

Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63:5, 1030-1047.

Reading from the image

In the second day of the workshop, we broke out into discussion groups. In my group, we focused on understanding the role of deep learning in the entire extraction process. Almost all the groups are pursing this.

Neural ParsCit is coming! – 21% error reduction for reference string parsing – from @knmnyn #excitews2017

— Paul Groth (@pgroth) March 30, 2017

I was thankful to both Akansha Bhardwaj and Roman Kern for walking us through their pipelines. In particular, Akansha is using scanned images of reference sections as her source and starting to apply CNN’s for doing semantic segmentation where they were having pretty good success.

We discussed the potential for doing the task completely from the ground up using a deep neural network. This was an interesting discussion as current state of the art techniques already use quite a lot of positional information for training This can be gotten out of the pdf and some of the systems already use the images directly. However, there’s a lot of fiddling that needs to go on to deal with the pdf contents so maybe the image actual provides a cleaner place to start. However, then we get back to the issue of resources and how to appropriately generate the training data necessary.

Random Notes

  • The organizers set-up a slack backchannel which was useful.
  • I’m not a big fan of skype talks, but they were able to get two important speakers that way and they organized it well. When it’s the difference between having field leaders and not, it makes a big difference.
  • EU projects can have a legacy – Roman Kern is still using code from http://code-research.eu where Mendeley was a consortium member.
  • Kölsch is dangerous but tasty
  • More workshops should try the noon to noon format.



Filed under: academia, linked data, trip report Tagged: citation extraction, citations, deep learning, linked data, references
Source: Think Links

Posted in Paul Groth, Staff Blogs

Web and Media at ICT.OPEN2017

On 21 and 22 March, researchers from VU’s Web and Media group attended ICT.OPEN, the principal ICT research conference in the Netherlands. Here over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network. The conference featured some great keynote speeches, including one from Nissan’s Erik Vinkhuyzen on the role of anthropological and sociological research to develop better self-driving cars.  Barbara Terhal from Aachen University gave a challenging, but well-presented talk on the challenges regarding robustness for quantum computing.

As last year, the Web and Media group this year was well represented through multiple oral presentations with accompanying posters and demonstrations :

  • Oana Inel, Carlos Martinez and Victor de Boer presented DIVEplus. Oana did such a good job presenting the project in the main programme (see Oana’s DIVE+@ICTOpen2017 slides), through the demo and in front of a poster that the poster was selected as best Poster in the SIKS track.
  • Benjamin Timmermans, Tobias Kuhn and Tibor Vermeij presented the Controcurator project with a demonstration and poster presentation. In the demo the ControCurator human-machine framework for identifying controversy in multimodal data is shown.
  • Tobias Kuhn discussed “Genuine Semantic Publishing” in the Computer Science track on the first day. His slides can be found here. After the talk there was a very interesting discussion about the role of the narrative writing process and how it would relate to semantic publishing.
  • Ronald Siebes and Victor de Boer then discussed how Big and Linked Data technologies developed in the Big Data Europe project are used to deliver pharmacological web-services for drug discovery. You can read more in Ronald’s blog post.
  • Benjamin Timmermans and Zoltan Zslavik also presented the CrowdTruth demonstrator, which is shown in this short demonstrator video.
  • Sabrina Sauer presented the MediaNow project with a nice poster titled MediaNow – using a living lab method to understand media professionals’ exploratory search.


Posted in Events, Trip Reports Tagged with: , , , , ,

IBM Ph.D. Fellowship 2017-2018

Oana Inel received for the second time the IBM Ph.D. Fellowship. Her research topic focuses on data enrichment with events and event-related entities, by combining the computer power with the crowd potential to identify their relevant dimension, granularity and perspective. She performs her research and experiments in the context of the CrowdTruth project, a project in collaboration with IBM Benelux Centre for Advanced Studies.

Posted in CrowdTruth, Projects

Hands on BDE Health at ICT.OPEN 2017

Last week, BigDataEurope was present at the principal ICT research conference in the Netherlands, ICT.OPEN, where over 500 scientists from all ICT research disciplines & interested researchers from industry come together to learn from each other, share ideas and network.

This is the first time that the NWO, the Netherlands Organisation for Scientific Research, added the “Health” track, a recognition of the increased importance of ICT in the domain of diagnosis, drug discovery and health-care. We presented a short paper written by Ronald Siebes, Victor de Boer, Bryn Williams-Jones, Kiera McNeice and Stian Soiland-Reyes covering the current state of the SC1 “Health, demographic change and well-being” pilot which implements the Open PHACTS functionality on the Big Data Europe infrastructure.

We succeeded to demonstrate the ease of use and practical value of the SC1 pilot for researchers in the domain of Drug Discovery and developers of Big Linked Data solutions and are looking forward to further strengthen our collaboration with the various. The paper was accepted as a poster presentation but also selected for an oral presentation at the “Health & ICT” track.



Posted in BigDataEurope, Ronald Siebes, Trip Reports Tagged with: , , ,


On March 14th, I presented a paper about the SIRUP project at IUI’17. IUI stays for Intelligent User Interface, and it is an international conference where the Human-Computer Interaction (HCI) community meets the Artificial Intelligence (AI) community. It is a highly competitive venue, with an acceptance rate below 25%. Our paper introduces a model for serendipity in recommender systems using curiosity theory. Here the abstract of the paper:

In this paper, we propose a model to operationalise serendipity in content-based recommender systems. The model, called SIRUP, is inspired by the Silvia’s curiosity theory, based on the fundamental theory of Berlyne, aims at (1) measuring the novelty of an item with respect to the user profile, and (2) assessing whether the user is able to manage such level of novelty (coping potential). The novelty of items is calculated with cosine similarities between items, using Linked Open Data paths. The coping potential of users is estimated by measuring the diversity of the items in the user profile. We deployed and evaluated the SIRUP model in a use case with TV recommender using BBC programs dataset. Results show that the SIRUP model allows us to identify serendipitous recommendations, and, at the same time, to have 71% precision.

The paper is available here.


Posted in Uncategorized

Genuine Semantic Publishing

Here you can find the slides for my talk at ICT Open 2017:

Posted in Uncategorized


The DIVE+ team is present on the 21st and 22nd of March at the ICTOpen 2017 conference to present and showcase the latest developments of the tool. As part of the latest developments, DIVE+ is also integrated in the CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities) research infrastructure, next to other media studies research tools (CLARIAH MediaSuite), that aim at supporting the media studies researchers and scholars by providing access to digital data and tools. During the Meet the Demo sessions we also screencast the new DIVE+ interface that provides support for the automatic generation of narratives and storylines. Following you can check the DIVE+ presentation.

For more insights, you can also check our short demo!

Posted in DIVE+