Dancing and Semantics

This post describes the MSc theses of Ana-Liza Tjon-a-Pauw and Josien Jansen. 

As a semantic web researcher, it is hard to sometimes not see ontologies and triples in aspects of my private life. In this case, through my contacts with dancers and choreographers, I have since a long time been interested in exploring knowledge representation for dance. After a few failed attempts to get a research project funded, I decided to let enthusiastic MSc. students have a go to continue with this exploration. This year, two Information Sciences students, Josien Jansen and Ana-Liza Tjon-a-Pauw, were willing to take up this challenge, with great success. With their background as dancers they did not only have the necessary background knowledge at but also access to dancers who could act as study and test subjects.

The questions of the two projects was therefore: 1) How can we model and represent dance in a sensible manner so that computers can make sense of choreographs and 2) How can we communicate those choreographies to the dancers?

Screenshot of the mobile choreography assistant prototype

Josien’s thesis addressed this first question. Investigating to what extent choreographers can be supported by semi-automatic analysis of choreographies through the generation of new creative choreography elements. She conducted an online questionnaire among 54 choreographers. The results show that a significant subgroup is willing to use an automatic choreography assistant in their creative process. She further identified requirements for such an assistant, including the semantic levels at which should operate and communicate with the end-users. The requirements are used for a design of a choreography assistant “Dancepiration”, which we implemented as a mobile application. The tool allows choreographers to enter (parts of) a choreography and uses multiple strategies for generating creative variations in three dance styles. Josien  evaluated the tool in a user study where we test a) random variations and b) variations based on semantic distance in a dance ontology. The results show that this latter variant is better received by participants. We furthermore identify many differences between the varying dance styles to what extent the assistant supports creativity.

Four participants during the 2nd user experiment. From left to right this shows variations presented through textual, 2D animation, 3D animation, and auditory instructions.

In her thesis, Ana-Liza dove deeper into the human-computer interaction side of the story. Where Josien had classical ballet and modern dance as background and focus, Ana-Liza looked at Dancehall and Hip-Hop dance styles. For her project, Ana-Liza developed four prototypes that could communicate pieces of computer-generated choreography to dancers through Textual Descriptions, 2-D Animations, 3-D Animations, and Audio Descriptions. Each of these presentation methods has its own advantages and disadvantages, so Ana-Liza made an extensive user survey with seven domain experts (dancers). Despite the relatively small group of users, there was a clear preference for the 3-D animations. Based on the results, Ana-Liza also designed an interactive choreography assistant (IDCAT).

The combined theses formed the basis of a scientific article on dance representation and communication that was accepted for publication in the renowned ACE entertainment conference, co-authored by us and co-supervisor Frank Nack.

You can find more information here:

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

ABC-Kb Network Insitute project kickoff

The ABC-Kb team, clockwise from top-left: Dana Hakman, Cerise Muller, Victor de Boer, Petra BosVU’s Network Institute has a yearly Academy Assistant programme where small interdisciplinary research projects are funded. Within these projects, Master students from different disciplines are given the opportunity to work on these projects under supervision of VU staff members. As in previous years, this year, I also participate as a supervisor in one of these projects, in collaboration with Petra Bos from the Applied Linguistics department. And after having found two enthusiastic students: Dana Hakman from Information Science and Cerise Muller from Applied Linguistics, the project has just started.

Our project “ABC-Kb: A Knowledge base supporting the Assessment of language impairment in Bilingual Children” is aimed at supporting language therapists by (re-)structuring information about language development for bilingual children. Speech language therapists and clinical linguists face the challenge of diagnosing children as young as possible, also when their home language is not Dutch. Their achievements on standard (Dutch) language tests will not be reliable indicators for a language impairment. If diagnosticians had access to information on the language development in the Home Language of these children, this would be tremendously helpful in the diagnostic process.

This project aims to develop a knowledge base (KB) collecting relevant information on the specificities of 60 different home languages (normal and atypical language development), and on contrastive analyses of any of these languages with Dutch. To this end, we leverage an existing wiki: meertaligheidentaalstoornissenvu.wikispaces.com

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

DIVE+ in Europeana Insight

This months’ edition of Europeana Insight features articles from this year’s LODLAM Challenge finalists, which include the winner: DIVE+. The online article “DIVE+: EXPLORING INTEGRATED LINKED MEDIA” discusses the DIVE+ User studies, data enrichment, exploratory interface and impact on the cultural heritage domain.

The paper was co-authored by Victor de Boer, Oana Inel, Lora Aroyo, Chiel van den Akker, Susane Legene, Carlos Martinez, Werner Helmich, Berber Hagendoorn, Sabrina Sauer, Jaap Blom, Liliana Melgar and Johan Oomen

Screenshot of the Europeana Insight article

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

SEMANTiCS2017

This year, I was conference chair of the SEMANTiCS conference, which was held 11-14 Sept in Amsterdam. The conference was in my view a great success, with over 310 visitors across the four days, 24 parallel sessions including academic and industry talks, six keynotes, three awards, many workshops and lots of cups of coffee. I will be posting more looks back soon, but below is a storify item giving an idea of all the cool stuff that happened in the past week.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Event Extraction From Radio News Bulletins For Linked Data

[This post is based on the BSc. Thesis of Kim van Putten (Computer Science, VU Amsterdam)]

As part of the Bachelor’s degree Computer Science at the VU Amsterdam, Kim van Putten conducted her bachelor thesis in the context of the DIVE+ project .

The DIVE+ demonstrator is an event-centric linked data browser which aims to provide exploratory search within a heterogeneous collection of historical media objects. In order to structure and link the media objects in the dataset, the events need to be identified first. Due to the size of the data collection manually identifying events in infeasible and a more automatic approach is required. The main goal of the bachelor project was to find a more effective way to extract events from the data to improve linkage within the DIVE+ system.

The thesis focused on event extraction from radio news bulletins of which the text content were extracted using optical character recognition (OCR). Data preprocessing was performed to remove errors from the OCR’ed data. A Named Entity Recognition (NER) tool was used to extract named events and a pattern-based approach combined with NER and part-of-speech tagging tools was adopted to find unnamed events in the data. Errors in the data caused by the OCR were found to cause poor performance of the NER tools, even after data cleaning.

The results show that the proposed methodology improved upon the old event extraction method. The newly extracted events improved the searchability of the media objects in the DIVE+ system, however, they did not improve the linkage between objects in the linked data structure. Furthermore,
the pattern-based method of event extraction was found to be too coarse-grained and only allowed for the extraction of one event per object. To achieve a finer granularity of event extraction, future research is necessary to find a way to identify what the relationships between Named Entities and verbs are and which Named Entities and verbs describe an event.

The full thesis is available for download here and the presentation here. Following, we show a poster that summrizes the main findings and the presentation of the thesis.

Poster - Event Extraction for Radio New Bulletins

Posted in DIVE+

Discovering the underlying structure of controversial issues with topic modeling

[This post is by Tibor Vermeij about his Master project]

For the Master project of the Information Sciences programme at the Vrije Universiteit, Tibor Vermeij investigated a solution to discover the structure of controversial issues on the web. The project was done in collaboration with the Controcurator project.

Detecting controversy computationally has been getting more and more attention. Because there is a lot of data available digitally, controversy detection methods that make use of machine learning and natural language processing techniques have become more common. However, a lot of studies try to detect the controversy of articles, blog post or tweets individually. The relation between controversial entities on the web is not often explored.

To explore the structure of controversial issues a combination of topic modeling and hierarchical clustering was used. With topic modeling, the content discussed in a set of Guardian articles was discovered. The resulting topics were used as input for a Hierarchical agglomerative clustering algorithm to find the relations between articles.

The clusters were evaluated with a user study. A Questionnaire was sent out that tested the performance of the pipeline in three categories: The similarity of articles within a cluster, the cohesion of the clusters and the hierarchy and the relation between controversy of single articles compared to the controversy of their corresponding clusters.

The questionnaire showed promising results. The approach can be used to get an indication of the general content of the articles. Articles within the same cluster were more similar compared to articles of different clusters, which means that the chosen clustering method resulted in coherent topics in the controversial clusters that were retrieved. Opinions on controversy itself showed a high amount of variance between participants, re-enforcing the subjectiveness of human controversy estimation. While the deviation between the individual assessments was quite high, averaged rater scores were comparable to calculated scores which suggest a correlation between the controversy of articles within the same cluster.

The full thesis can be found here https://drive.google.com/file/d/0B6qAc8tgJOHUWVo4UURqWkZ1UDg/view?usp=sharing.

The presentation can be found here https://drive.google.com/open?id=14ELkY_9UxppL62uxLg5cAMk8yYKqmHKFQtbAznT3HX0.

Posted in Masters Projects

Share, repeat and verify Scientific Experiments with Software Containers

[This post is by Rogier Mars about his Master project]

During my years at the VU as a student Information Sciences, I was often requested to form a project group and work on some kind of problem. Most likely, I was the one to implement the technical part after coming to a solution with my team. I always enjoyed this, mainly because of the high variety in the work performed. A small selection includes an information visualization for crime rates in the Netherlands; analysis of the influence of weather on the Dutch public transportation; using a cognitive system to enhance the performance of a tourist chat bot; programming AI to compete in games against other student groups and developing smart home technologies to aid in elderly care.

All of these prototypes and experiments consisted partly of software development and partly of the clever use of existing information technology to make life easier or to come to new insights and ideas. As you can imagine, during such a project, you are totally sucked into it. There is a deadline to reach, a presentation to prepare, the pressure is high to finish the project successfully. This pressure often results in sloppier working methods: for example, I did not include any documentation at all. I must confess: it would take me quite some time to dig into any of these projects. Even if I could find the code and data, it is highly likely that I could get it working without some serious trouble solving.

As it turns out I’m not the only researcher having these problems. Even for more recent ACM conferences and journals that are backed by code and data, in fifty percent of the time other researchers could not successfully repeat them. This has a negative effect on the efficiency of replication studies: which could result in less quality research. This had me thinking: if I could go back in time, could I do it better now? How would I do that and how hard is it? Could other researchers do this as well?

The software container platform Docker emerged in 2013 and is widely used by businesses throughout the world for web application hosting. With Docker you can easily create software containers that are portable to any other operating system that runs Docker, currently: Windows, Linux and MacOS. In literature is described how Docker can be used efficiently for research:

“By encapsulating the computational environment of scientific experiments into software containers, you bypass many dependency issues and the need for precise documentation.“

My master project included an implementation of Docker on several scientific experiments for means of increasing repeatability and I evaluated of this method with students and researchers in and around the Amsterdam area. By means of a controlled experiment, I’ve created the scenario for researchers to work with Docker on an example project on their own personal computer. Afterwards I’ve evaluated this method by means of existing scales and measures in questionnaires. I’ve compared this method with the traditional approach and participants were equally divided over both methods. The focus lied on usability, perceived usefulness and perceived ease of use. How the Docker method worked exactly was harder for participants to grasp than how the existing method worked, but overall they deemed it as more useful for repeating and verifying scientific experiments. The method was not perceived more usable, but it was definitely more reliable. There was still a difference in the perceived usefulness between new and existing users: it appears that if you understand how Docker works, you perceive it as more useful in general and for research.

If I could do it again, I would use Docker to create a computational environment for my experiments and I incline other researchers to do the same. The responsibility for successful execution of the code could shift from the replicator to the creator of the experiment. Eventually, this could make replication studies for computational science more fun and less time consuming.

Posted in Masters Projects

“New life for old media” to be presented at NEM Summit 2017

The extended abstract “Investigations into Speech Synthesis and Deep Learning-based colorization for audiovisual archives” has been accepted for publication at the NEM (New NEM (cc-by circle ©heese https://www.flickr.com/photos/gratisdbth/7805513264)Eureopean Media) Summit 2017 to be held in Madrid end-of-November. This paper is based on Rudy Marsman’s thesis “Speech technology and colorization for audiovisual archives” and describes his research on using AI technologies in the context of an the Netherlands Institute for Sound and Vision. Specifically, Rudy experimented with developing speech synthesis software based on a library of narrated news videos (using the voice of the late Philip Bloemendal) and with the use of pre-trained deep learning colorization networks to colorize archival videos.

You can read more in the draft paper [PDF]:

Rudy Marsman, Victor de Boer, Themistoklis Karavellas, Johan Oomen New life for old media: Investigations into Speech Synthesis and Deep Learning-based colorization for audiovisual archives. Extended Abstract proceedings of NEM summit 2017

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Lisbon Machine Learning Summer School 2017 – Trip Report

In the second half of July (20th of July – 27th of July) I attended the Lisbon Machine Learning Summer School (LxMLS2017). As every year, the summer school is held in Lisbon, Portugal, at Instituto Superior Técnico (IST). The summer school is organized jointly by IST, the Instituto de Telecomunicações, the Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC-ID), Unbabel, and Priberam Labs.

Around 170 students (mostly PhD students but also master students) attended the summer school. It’s important to mention that around 40% of the applicants are accepted, so make sure you have a strong motivation letter! For eight days we learned about machine learning with focus on natural language processing. The day was divided into 3 parts: lectures in the morning, labs in the afternoon and practical talks in the evening (yes, quite a busy schedule).

Morning Lectures

In general, the morning lectures and the labs mapped really well, first learn the notions and then put them into practice. During the labs we worked with Python and IPython Notebooks. Most of the labs had the base code already implemented and we just had to fill in some functions. However, for some of the lectures/labs this wasn’t that easy. I’m not going to discuss in detail the morning lectures but I’ll mention the speakers and their topics (also, the slides are available of the website of the summer school):

  • Mario Figueiredo: an introduction to probability theory which proved to be fundamental for understanding the following lectures.
  • Stefan Riezler: an introduction to linear learners using an analogy with the perceptual system of a frog, i.e., given that the goal of a frog is to capture any object of the size of an insect or worm providing it moves like one, can we build a model of this perceptual system and learn to capture the right objects?
  • Noah Smith: gave an introduction of sequence models such as Markov models and Hidden Markov models and presented the Viterbi algorithm which is used to find the most likely sequence of hidden states.
  • Xavier Carreras: talked about structured predictors (i.e., given training data, learn a predictor that performs well on unseen inputs) using as running example a named entity recognition task. He also discussed about Conditional Random Fields (CRF), approach that gives good results in such tasks.
  • Yoav Goldberg: talked about syntax and parsing by providing many examples of using them in sentiment analysis, machine translation and many other examples. Compared to the rest of the lectures, this one had much less math and was easy to follow!
  • Bhiksha Raj: gave an introduction to neural networks, more exactly convolutional neural networks (CNN) and recurrent neural networks (RNN). He started with the early models of human cognition, associationism (i.e., humans learn through association) and connectionism (i.e., the information is in the connexions and the human brain is a connectionist machine).
  • Chris Dyer: discussed about modeling sequential data with recurrent networks (but not only). He showed many examples related to language models, long short-term memories (LSTMs), conditional language models, among others. However, even if it’s easy to think of tasks that
 could be solved by conditional language models, most of the times the data does not exist, a problem that seems to appear in many fields and many examples.

Practical Talks

In the last part of the day we had practical talks or special talks of concrete applications that are based on the techniques learnt during the morning lectures. During the first day we were invited to attend a panel discussion named “Thinking machines: risks and opportunities” at the conference “Innovation, Society and Technology” where 6 speakers (Fernando Pereira – VP and Engineering Fellow at Google, Luís Sarmento – CTO at Tonic App’s, André Martins – Unbabel Senior researcher, Mário Figueiredo – Instituto de Telecomunicações at IST, José Santos Victor – president of the Institute for Systems and Robotics at IST and Arlindo Oliveira – president of Instituto Superior Técnico) in the AI field discussed about the benefits and risks of artificial intelligence and automatic learning. Here are a couple of thoughts:

  • Fernando Pereira: In order to enable people to make better use of technology, we need to make machines smarter at interacting with us and helping us.
  • André Martins pointed out an interesting problem: people spend time on solving very specific things but these are never generalized. -> but what if this is not possible?
  • Fernando Pereira: we build smart tools but only a limited amount of people are able to control them, so we need to build the systems in a smarter way and make the systems responsible to humans.

Another evening hosted the Demo Day, an informal gathering that brings together a number of highly technical companies and research institutions, all with the aim of solving machine learning problems through technology. There were a lot of enthuziastic people to talk to, many demos and products. I even discovered a new crowdsourcing platform, DefinedCrowd that soon might start competing with CrowdFlower and Amazon Mechanical Turk.

Here are some other interesting talks that we followed:

  • Fernando Pereira – “Learning and representation in language understanding”: talked about learning language representation using machine learning. However, machine understanding of language is not a solved problem. Learning from labeled data or learning with distant supervision may not yield the desired results, so it’s time to go implicit. He then introduced the work done by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: Attention Is All You Need. In this paper, the authors claim that you do not need complex CNNs or RNNs models, but it’s enough to use attention mechanisms in order to obtain quality machine translation data.
  • Graham Neubig – “Simple and Efficient Learning with Dynamic Neural Networks”: dynamic neural networks such as DyNet can be used as alternatives to TensorFlow or Theano. According to Graham, here as some advantages of using such nets: the API is closer to standard Python/C++ and it’s easier to implement nets with varying structure and some disadvantages: it’s harder to optimize graphs (but still possible) and it’s also harder to schedule data transfer.
  • Kyunghyun Cho – “Neural Machine Translation and Beyond”: showed why sentence-level and word-level machine translation is not desired: (1) it’s inefficient to handle various morphological words variants, (2) we need good tokenisation for every language (not that easy), (3) they are not able to handle typos or spelling errors. Therefore, character-level translation is what we need because it’s more robust to errors and handles better rare tokens (which are actually not necessarily rare).
Posted in CrowdTruth, Projects

Trip Report: Dagstuhl Seminar on Citizen Science

A month ago, I had the opportunity to attend the Dagstuhl Seminar  Citizen Science: Design and Engagement. Dagstuhl is really a wonderful place. This was my fifth time there. You can get an impression of the atmosphere from the report I wrote about my first trip there. I have primarily been to Dagstuhl for technical topics in the area of data provenance and semantic data management as well as for conversations about open science/research communication.

This seminar was a great chance for me to learn more about citizen science and discuss its intersection with the practice of open science. There was a great group of people there covering the gamut from creators of citizen science platforms to crowd-sourcing researchers. 17272.01.l

As usual with Dagstuhl seminars, it’s less about presentations and more about the conversations. There will be a report documenting the outcome and hopefully a paper describing the common thoughts of the participants. Neal Reeves took vast amounts of notes so I’m sure that this will be a good report :-). Here’s a whiteboard we had full of input:

2017-07-05 11.28.24.jpg

Thus, instead of trying to relay what we came up with (you’ll have to wait for the report), I’ll just pull out some of my own brief highlights.

Background on Citizen Science

There were a lot of good pointers on where to start understand current thinking around citizen science. First, two tutorials from the seminar:

What do citizen science projects look like:

Example projects:

How should citizen science be pursued:

And a Book:

Open Science & Citizen Science

Claudia Göbel gave an excellent talk about the overlap of citizen science and open science. First, she gave an important reminder that science in particular in the 1700s was done as public demonstrations walking us through the example painting below. 2017-07-04 11.23.02

She then looked at the overlap between citizen science and open science. Summarized below:

citizenopenscience.png

A follow-on discussion at the with some of the seminar participants led to input for a whitepaper that is being developed through the ECSA on Citizen & Open Science for Europe. Check out the preliminary draft. I look forward to seeing the outcome.

Questioning Assumptions

One thing that I left the seminar thinking about was was the need to question my own (and my field’s) assumptions. This was really inspired by talking to Chris Welty and reflecting on his work with Lora Aroyo on the issues in human annotation and the construction of gold sets.  Some assumptions to question:

  • What qualifications you need to have to be considered a scientist.
  • Interoperability is a good thing to pursue.
  • Openness is a worthy pursuit.
  • We can safely assume a lack of dynamics in computational systems.
  • That human performance is good performance.

Indeed, in Marissa Ponti she pointed to the example below and highlighted some of the potential ramifications of what each of these (what at first blush are positive) citizen science projects could lead to. 2017-07-03 10.06.36

That being said, the ability to rapidly engage more people in the science system seems to be a good thing indeed. An an assumption I’m happy to hold.

Random

Filed under: trip report Tagged: citizen science, dagstuhl, open science
Source: Think Links

Posted in Paul Groth, Staff Blogs