Semantics2019 trip report

Last week, I attended the SEMANTiCS2019 conference in Karlsruhe, Germany. This was the 15th edition of the conference that brings together Academia and Industry around the topic of Knowledge Engineering and Semantic Technologies and the good news was that this year’s conference was the biggest ever with 426 unique participants.

Closing Session – The Power Of #KnowledgeGraphs & #SemanticAI:#SemanticsConf 2019 Says Thank You For Your AMAZING Contributions, Participations and Sponsorships!#DataScience #BigData #ML #MachineLearning #SemanticWeb #Semantics #blockchain #AI #KI #IoT #tech #OpenData https://t.co/GBhui4ZCqt

— SEMANTiCS Conference (@SemanticsConf) September 11, 2019

I was not able to join the workshop day or the dbpedia day on monday and thursday respectively, but was there for the main programme. The first day opened with a keynote from Oracle’s Michael J. Sullivan about Hybrid Knowledge Management Architecture and how Oracle is betting on Semantic Technology to work in combination with data lake architectures.

The vision of FAIR and hope for the future @micheldumontier #SEMANTICS2019 #semanticsconf pic.twitter.com/Tm54yPgBt8

— CMP Content Services (@CmpContent) September 10, 2019

The 2nd keynote by Michel Dumontier of Maastricht University covered the principles of FAIR publishing of data and current avances in actually measuring FAIRness of datasets.

Robin Keskisärkkä, @evabl444, Kelli Lind and @olafhartig win best Paper Award for RSP-QL* . Congratulations! #Semantics2019 pic.twitter.com/7lRF5jigfj

— Victor de Boer (@victordeboer) September 11, 2019

During one of the parallel sessions I attended the presentation of the eventual best paper winner Robin Keskisärkkä, Eva Blomqvist, Leili Lind, and Olaf Hartig. RSP-QL*: Enabling Statement-Level Annotations in RDF Streams . This was a very nice talk for a very nice and readable paper. The paper describes the combination of current RDF stream reasoning language RSP-QL and how it can be extended with the principles of RDF* that allow for statements about statements without traditional re-ification. The paper nicely mixes formal semantics, an elegant solution, working code, and a clear use case and evaluation. Congratulations to the winners.

Other winners included the best poster, which was won by our friends over at UvA.

Amsterdam success with a best poster win for Anthi Symeonidou, Viachaslau Sazonau and @pgroth! #semantics2019 pic.twitter.com/cwhrRSZYRc

— Victor de Boer (@victordeboer) September 11, 2019

The second day for me was taken up by the Special Track on Cultural Heritage and Digital Humanities, which consisted of research papers, use case presentations and posters that relate to the use of Semantic technologies in this domain. The program was quite nice, as the embedded tweets below hopefully show.

@victordeboer openning special track on cultural heritage and digital humanities at #Semantics2019 pic.twitter.com/cK5UkHKxoE

— Artem Revenko (@revenkoartem) September 11, 2019

The #Semantics2019 special track on Cultural Heritage and #DigitalHumanities starts with a use case talk on KG-based @museodelprado project by Ricardo Alonso Mariana of GLOSS pic.twitter.com/H43NPGtSfM

— Victor de Boer (@victordeboer) September 11, 2019

up next: #LinkedSaeima which publishes Latvia's parliamentary debates as LOD. #semantics2019 pic.twitter.com/BJ6y5xrrel

— Victor de Boer (@victordeboer) September 11, 2019

The always amazing @vpresutti talks about knives and what we know about them in her #semantics2019 keynote on commonsense knowledge. pic.twitter.com/u52Vmv6y2Q

— Victor de Boer (@victordeboer) September 11, 2019

Victoria Eyharabide kicks off the last session of our Special Track with a talk about a #KnowledgeGraph on medieval #music and #iconography. #Semantics2019 @albertmeronyo #digitalhumanities @SorbonneParis1 pic.twitter.com/QFtlTT1tX4

— Victor de Boer (@victordeboer) September 11, 2019

#semantics2019 @heikopaulheim talks extracting numbers from Wikipedia abstracts to enrich #dbpedia. pic.twitter.com/kB0AUVmWRl

— Victor de Boer (@victordeboer) September 11, 2019

The winners of the #codingdavinci Hackathon close the #Semantics2019 special track on #digitalhumanities and #CulturalHeritage with #schmankerl Time machine! pic.twitter.com/n61RTToq5l

— Victor de Boer (@victordeboer) September 11, 2019

All in all, this years edition of SEMANTICS was a great one, I hope next year will be even more interesting (I will be general chairing it).

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Linked Art Provenance

In the past year, together with Ingrid Vermeulen (VU Amsterdam) and Chris Dijkshoorn (Rijksmuseum Amsterdam), I had the pleasure to supervise two students from VU, Babette Claassen and Jeroen Borst, who participated in a Network Institute Academy Assistant project around art provenance and digital methods. The growing number of datasets and digital services around art-historical information presents new opportunities for conducting provenance research at scale. The Linked Art Provenance project investigated to what extent it is possible to trace provenance of art works using online data sources.

Caspar Netscher, the Lacemaker, 1662, oil on canvas. London: the Wallace Collection, P237

In the interdisciplinary project, Babette (Art Market Studies) and Jeroen (Artificial Intelligence) collaborated to create a workflow model, shown below, to integrate provenance information from various online sources such as the Getty provenance index. This included an investigation of potential usage of automatic information extraction of structured data of these online sources.

This model was validated through a case study, where we investigate whether we can capture information from selected sources about an auction (1804), during which the paintings from the former collection of Pieter Cornelis van Leyden (1732-1788) were dispersed. An example work , the Lacemaker, is shown above. Interviews with various art historian validated the produced workflow model.

The workflow model also provides a basic guideline for provenance research and together with the Linked Open Data process can possibly answer relevant research questions for studies in the history of collecting and the art market.

More information can be found in the Final report

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: SIGMOD/PODS 2019

It’s not so frequently that you get a major international conference in your area of interest around the corner from your house. Luckily for me, that just happened. From June 30th – July 5th, SIGMOD/PODS was hosted here in Amsterdam. SIGMOD/PODS is one of the major conferences on databases and data management. Before diving into the event itself, I really wanted to thank  Peter Boncz, Stefan Manegold, Hannes Mühleisen and the whole organizing team (from @CWI_DA and the NL DB community) for getting this massive conference here:

#SIGMOD2019-Opening: This is the 2nd biggest #SIGMOD ever, there are 1050 other participants (up to now) // @SIGMOD2019 pic.twitter.com/GOBthVTbiw

— Benjamin Hättasch (@bhaettasch_cs) July 2, 2019

and pulling off things like this:

A successful #SIGMOD2019 reception at van Gogh museum last night by #MonetDB and @cwi_da, adding a healthy dose of culture to the DBMS community @ACTiCLOUD @FashionBrain1 @ExaNeSt_H2020 pic.twitter.com/J73vk7kSok

— MonetDB Team (@MonetDB) July 3, 2019

Oh and really nice badges too:BKBnl49c.jpgGood job!

Sl1lfSTB.jpeg

Surprisingly, this was the first time I’ve been at SIGMOD. While I’m pretty acquainted with the database literature, I’ve always just hung out in different spots. Hence, I had some trepidation attending wondering if I’d fit in? Who would I talk to over coffee? Would all the papers be about join algorithms or implications of cache misses on some new tree data structure variant? Now obviously this is all pretty bogus thinking, just looking at the proceedings would tell you that. But there’s nothing like attending in person to bust preconceived notions. Yes, there were papers on hardware performance and join algorithms – which were by the way pretty interesting – but there were many papers on other data management problems many of which we are trying to tackle (e.g. provenance, messy data integration).  Also, there were many colleagues that I knew (e.g. Olaf & Jeff above). Anyway, perceptions busted! Sorry DB friends you might have to put up with me some more 😀.

I was at the conference for the better part of 6 days – that’s a lot of material – so I definitely missed a lot but here are the four themes I took from the conference.

  1. Data management for machine learning
  2. Machine learning for data management
  3. New applications of provenance
  4. Software & The Data Center Computer

Data Management for Machine Learning

iU1GfnUz.jpeg

Matei Zaharia (Stanford/Databricks) on the need for data management for ML

The success of machine learning has rightly changed computer science as a field. In particular, the data management community writ large has reacted trying to tackle the needs of machine learning practitioners with data management systems. This was a major theme at SIGMOD.

Really interesting – using a variety of knowledge to do weak supervision at scale – check out the lift #sigmod https://t.co/Pjiz2XyLBw pic.twitter.com/ahPqV3nvad

— Paul Groth (@pgroth) July 2, 2019

There were a number of what I would term holistic systems that helped manage and improve the process of building ML pipelines including using data. Snorkel DryBell provides a holistic system that lets engineers employ external knowledge (knowledge graphs, dictionaries, rules) to reduce the number of needed training examples needed to create new classifiers. Vizier provides a notebook data science environment backed fully by a provenance data management environment that allows data science pipelines to be debugged and reused.  Apple presented their in-house system for helping data management specifically designed for machine learning – from my understanding all their data is completely provenance enabled – ensuring that ML engineers know exactly what data they can use for what kinds of model building tasks.

I think the other thread here is the use of real world datasets to drive these systems. The example that I found the most compelling was Alpine Meadow++ to use knowledge about ML datasets (e.g. Kaggle) to improve the suggestion on new ML pipelines in an AutoML setting. rsfZ2iZO.jpeg

On a similar note, I thought the work of Suhail Rehman from the University of Chicago on using over 1 million juypter  notebooks to understand data analysis workflows was particularly interesting. In general, the notion is that we need to taking a looking at the whole model building and analysis problem in a holistic sense inclusive of data management . This was emphasized by the folks doing the Magellan entity matching project in their paper on Entity Matching Meets Data Science.

9On2RADL.jpeg

Machine Learning for Data Management

On the flip side, machine learning is rapidly influencing data management itself. The aforementioned Megellan project has developed a deep learning entity matcher. Knowledge graph construction and maintenance is heavily reliant on ML. (See also the new work from Luna Dong & colleagues which she talked about at SIGMOD). Likewise, ML is being used to detect data quality issues (e.g. HoloDetect).

ML is also impacting even lower levels of the data management stack.

xVcvdTX8.jpeg

Tim Kraska list of algorithms that are or are being MLified

I went to the tutorial on Learned Data-intensive systems from Stratos Idreos and Tim Kraska. They overviewed how machine learning could be used to replace parts or augment of the whole database system and when that might be useful.

KbYGVEA2.jpegIt was quite good, I hope they put the slides up somewhere. The key notion for me is this idea of instance optimality: by using machine learning we can tailor performance to specific users and applications whereas in the past this was not cost effective because the need for programmer effort. They suggested 4 ways to create instance optimized algorithms and data structures:

  1. Synthesize traditional algorithms using a model
  2. Use a CDF model of the data in your system to tailor the algorithm
  3. Use a prediction model as part of your algorithm
  4. Try to to learn the entire algorithm or data structure

They had quite the laundry list of recent papers tackling this approach and this seems like a super hot topic.

Another example was SkinnerDb which uses reinforcement learning to on the fly to learn optimal join ordering. I told you there were papers on joins that were interesting.

BBv8y0SZ.jpeg

New Provenance Applications

There was an entire session of SIGMOD devoted to provenance, which was cool.  What I liked about the papers was that that they had several new applications of provenance or optimizations for applications beyond auditing or debugging.

In addition to these new applications, I saw some nice new provenance capture systems:

Software & The Data Center Computer

This is less of a common theme but something that just struck me. Microsoft discussed their upgrade or overhaul of the database as a service that they offer in Azure. Likewise, Apple discussed FoundationDB – the mult-tenancy database that underlines CloudKit.

LI03R_uu.jpeg

JD.com discussed their new file system to deal with containers and ML workloads across clusters with tens of thousands of servers. These are not applications that are hosted in the cloud but instead they assume the data center. These applications are fundamentally designed with the idea that they will be executed on a big chunk of an entire data center. I know my friends at super computing have been doing this for ages but I always wonder how to change one’s mindset to think about building applications that big and not only building them but upgrading & maintaining them as well.

Wrap-up

Overall, this was a fantastic conference. Beyond the excellent technical content, from a personal point of view, it was really eye opening to marinate in the community. From the point of view of the Amsterdam tech community, it was exciting to have an Amsterdam Data Science Meetup with over 500 people.

Excited that #SIGMOD2019 is meeting the local Amsterdam data science community @ams_ds pic.twitter.com/NuDUHxCegx

— Paul Groth (@pgroth) July 4, 2019

If you weren’t there, video of much of the event is available.

Random Notes

G. Gottlob presenting his 2009 PODS paper “A General Datalog-based Framework for Tractable Query Answering” which receives the Test of time award. One of the first papers I read when in my PhD Great to see all the theory & how they have taken it to practice w/ Vadalog #sigmod2019 pic.twitter.com/zlfLYi4gnc

— Juan Sequeda (@juansequeda) July 1, 2019

 

 

Source: Think Links

Posted in Paul Groth, Staff Blogs

Remembering Maarten van Someren

Last week, while abroad, I received the very sad news that Maarten van Someren passed away. Maarten was one of the core teachers and AI researchers at Universiteit van Amsterdam for 36 years and for many people in AI in the Netherlands, he was a great teacher and mentor. For me personally, as my co-promotor he was one of the persons who shaped me into the AI researcher and teacher I am today.

Maarten van Someren at my PhD defense (photo by Jochem Liem)

Before Maarten asked me to do a PhD project under his and Bob Wielinga‘s supervision, I had known him for several years as UvA’s most prolific AI teacher. Maarten was involved in many courses, (many in Machine Learning) and in coordinating roles. I fondly look back at Maarten explaining Decision Trees, the A* algorithm and Vapnik–Chervonenkis dimensions. He was one of the staff members who really was a bridge between research and education and gave students the idea that we were actually part of the larger AI movement in the Netherlands.

After I finished my Master’s at UvA in 2003, I bumped into Maarten in the UvA elevator and he asked me whether I would be interested in doing a PhD project on Ontology Learning. Maarten explained that I would start out being supervised by both him and Bob Wielinga, but that after a while one of them would take the lead, depending on the direction the research took. In the years that followed, I tried to make sure that direction was such that both Bob and Maarten remained my supervisors as I felt I was learning so much from them. From Maarten I learned how to always stay critical about the assumptions in your research. Maarten for example kept insisting that I explain why we would need semantic technologies in the first place, rather than taking this as an assumption. Looking back, this has tremendously helped me sharpen my research and I am very thankful for his great help. I was happy to work further with him as a postdoc on the SiteGuide project before moving to VU.

In the last years, I met Maarten several times at shared UvA-VU meetings and I was looking forward to collaborations in AI education and research. I am very sad that I will no longer be able to collaborate with him. AI in the Netherlands has lost a very influential person in Maarten.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: ESWC 2019

From June 2 – 6, I had the pleasure of attending the Extended Semantic Web Conference 2019 held in Portorož, Solvenia. After ESWC, I had another semantic web visit with Axel Polleres, Sabrina Kirrane and team in Vienna. We had a great time avoiding the heat and talking about data search and other fun projects. I then paid the requisite price for all this travel and am just now getting down to emptying my notebook. Note to future self, do your trip reports at the end of the conference.

It’s been awhile since I’ve been at ESWC so it was nice to be back. The conference I think was down a bit in terms the number of attendees but the same community spirit and interesting content (check out the award winners) was there.  Shout out to Miriam Fernandez and the team for making it an invigorating event:

BIG THX everyone for all the lovely moments at #eswc2019! Thx to all authors 4 the exciting work and presentations, SPC & PC members, keynote speakers, sponsors, … but specially to an absolutely amaizing OC team! Thanks to all of your for making the SW community so special 🙂 pic.twitter.com/LNtdxHvZcH

— Miriam Fernandez (@miriam_fs) June 7, 2019

So what was I doing there. I was presenting work at the Deep Learning for Knowledge Graph workshop on trying to see if we could answer structured (e.g. SPARQL) queries over text (paper):

The workshop itself was packed. I think there were about 30-40 people in the room.  In addition to the presenting the workshop paper, I was also one of the mentors for the doctoral consortium. It was really nice to see the next up and coming students who put a lot of work into the session: a paper, a revised paper, a presentation and a poster. Victor and Maria-Esther did a fantastic job organizing this.

So what were my take-aways from the conference. I had many of the same thoughts coming out of this conference that I had when I was at the recent AKBC 2019 especially around the ideas of polyglot representation and scientific literature understanding as an important domain driver (e.g. a Predicting Entity Mentions in Scientific Literature and Mining Scholarly Data for Fine-Grained Knowledge Graph Construction. ) but there were some additional things as well.

Target Schemas

The first was a notion that I’ll term “target schemas”. Diana Maynard in her keynote talked about this. These are little conceptually focused ontologies designed specifically for the application domain. She talked about how working with domain experts to put together these little ontologies that could be the target for NLP tools was really a key part of building these domain specific analytical applications.   I think this notion of simple schemas is also readily apparent in many commercial knowledge graphs.

The notion of target schemas popped up again in an excellent talk by Katherine Thornton on the use of ShEx. In particular, I would call out the introduction of an EntitySchema part of Wikidata. (e.g. Schema for Human Gene or Software Title). These provide these little target schemas that say something to the effect of “Hey if you match this kind of schema, I can use them in my application”. I think this is a really powerful development.

Katherine Thornton presenting shex schema sharing on @wikidata since last Tuesday #eswc2019 pic.twitter.com/nYurHqiZtn

— Paul Groth (@pgroth) June 5, 2019

The third keynote by Daniel Quercia was impressive. The Good City Life project about applying data to understand cities just makes you think. You really must check it out. More to this point of target schemas, however, was the use of these little conceptual descriptions in the various maps and analytics he did. By, for example, thinking about how to define urban sounds or feelings on a walking route, his team was able to develop these fantastic and useful views of the city.

Impressive data insights into cities from @danielequercia https://t.co/hfGMovdsDS #eswc2019 pic.twitter.com/wrg6WhkUke

— Paul Groth (@pgroth) June 6, 2019

I think the next step will be to automatically generate these target schemas. There was already some work headed into that direction. One was Generating Semantic Aspects for Queries , which was about how to use document mining to select which attributes for entities one should show for an entity. Think of it as selecting what should show up in a knowledge graph entity panel. Likewise, in the talk on Latent Relational Model for Relation Extraction, Gaetano Rossiello talked about how to think about using analogies between example entities to help extract these kind of schemas for small domains:

m97pj-zb.jpeg

I think this notion is worth exploring more.

Feral Spreadsheets

What more can I say:

Great term – feral spreadsheets – @dianamaynard #eswc2019 pic.twitter.com/maSrOt2DCV

— Paul Groth (@pgroth) June 5, 2019

We need more here. Things like MantisTable. Data wrangling is the problem. Talking to Daniel about the data behind his maps just confirmed this problem as well.

Knowledge Graph Engineering

This was a theme that was also at AKBC – the challenge of engineering knowledge graphs. As an example, the Knowledge Graph Building workshop was packed. I really enjoyed the discussion around how to evaluate the effectiveness of data mapping languages led by Ben de Meester especially with emphasis around developer usability. The experiences shared by the team from the industrial automation from Festo were really insightful. It’s amazing to see how knowledge graphs have been used to accelerate their product development process but also the engineering effort and challenges to get there.

GBVSKwXC

Likewise, Peter Haase in his audacious keynote (no slides – only a demo) showed how far we’ve come in the underlying platforms and technology to be able to create commercially useful knowledge graphs. This is really thanks to him and the other people who straddle the commercial/research line. It was neat to see the Open PHACTS style biomedical knowledge graph being built using SPARQL and api service wrappers:

V9RbjOo_.jpeg

However, still these kinds of wrappers need to be built, the links need to be created and more importantly the data needs to be made available. A summary of challenges:

#eswc2019 Industry presentation by @Siemens Very interesting analysis of the challenges of constructing and using Knowledge Graphs. @eswc_conf pic.twitter.com/4veU79x0CH

— Miriam Fernandez (@miriam_fs) June 4, 2019

Overall, I really enjoyed the conference. I got a chance to spend sometime with a bunch of members of the community and it’s exciting to see the continued excitement and the number of new research questions.

Random Notes

 

Source: Think Links

Posted in Paul Groth, Staff Blogs

Exploring Automatic Recognition of Labanotation Dance Scores

[This post describes the research of Michelle de Böck and is based on her MSc Information Sciences thesis.]

Digitization of cultural heritage content allows for the digital archiving, analysis and other processing of that content. The practice of scanning and transcribing books, newspapers and images, 3d-scanning artworks or digitizing music has opened up this heritage for example for digital humanities research or even for creative computing. However, with respect to the performing arts, including theater and more specifically dance, digitization is a serious research challenge. Several dance notation schemes exist, with the most established one being Labanotation, developed in 1920 by Rudolf von Laban. Labanotation uses a vertical staff notation to record human movement in time with various symbols for limbs, head movement, types and directions of movements.

Generated variations of movements used for training the recognizers

Where for musical scores, good translations to digital formats exist (e.g. MIDI), for Lanabotation, these are lacking. While there are structured formats (LabanXML, MovementXML), the majority of content still only exists either in non-digitized form (on paper) or in scanned images. The research challenge of Michelle de Böck’s thesis therefore was to identify design features for a system capable of recognizing Labanotation from scanned images.

Examples of Labanotation files used in the evaluation of the system.

Michelle designed such a system and implemented this in MATLAB, focusing on a few movement symbols. Several approaches were developed and compared, including approaches using pre-trained neural networks for image recognition (AlexNet). This approach outperformed others, resulting in a classification accuracy of 78.4%. While we are still far from developing a full-fledged OCR system for Labanotation, this exploration has provided valuable insights into the feasibility and requirements of such a tool.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

The ESWC2019 PhD Symposium

As part of the ESWC 2019 conference program, the ESWC PhD Symposium was held in wonderful Portoroz, Slovenia. The aim of the symposium, this year organized by Maria-Esther Vidal and myself, is to provide a forum for PhD students in the area of Semantic Web to present their work and discuss their projects with peers and mentors.

Jana Vatascinova talks about the all-important challenge of ontology matching in the biomedical domain. #ESWC2019 pic.twitter.com/EwY2gPhf13

— Victor de Boer (@victordeboer) June 2, 2019

Even though this year, we received 5 submissions, all of the submissions were of high quality, so the full day symposium featured five talks by both early and middle/late stage PhD students. The draft papers can be found on the symposium web page and our opening slides can be found here. Students were mentored by amazing mentors to improve their papers and presentation slides. A big thank you to those mentors: Paul Groth, Rudi Studer, Maria Maleshkova, Philippe Cudre-Mauroux,  and Andrea Giovanni Nuzzolese.

Stefan Schlobach is our keynote speaker in the #ESWC2019 PhD symposium. pic.twitter.com/XMflRJSLIl

— Victor de Boer (@victordeboer) June 2, 2019

The program also featured a keynote by Stefan Schlobach, who talked about the road to a PhD “and back again”. He discussed a) setting realistic goals, b) finding your path towards those goals and c) being a responsible scientist and person after the goal is reached.

#eswc2019 Doctoral Consortium. Very interesting list on how not to get a PhD, or things you should not do if you want to get a PhD. Check the TED talk about procrastination 😉 @Stefan Schlobach pic.twitter.com/nwelUYssCN

— Miriam Fernandez (@miriam_fs) June 2, 2019

Students also presented their work through a poster session and the posters will also be found at the main conference poster session on tuesday 4 June.

No time for an after-lunch dip! Markus Schröder from DFKI started his presentation on semantic enrichment of enterprise data. #ESWC2019 phd symposium pic.twitter.com/BBftLtwgbr

— Victor de Boer (@victordeboer) June 2, 2019

Want to see what the next wave of #semanticweb researchers are working on? Come see us *now* at the #eswc2019 PhD symposium poster session. pic.twitter.com/ojQFXrY94Q

— Victor de Boer (@victordeboer) June 2, 2019

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: AKBC 2019

About two weeks ago, I had the pleasure of attending the 1st Conference on Automated Knowledge Base Construction held in Amherst, Massachusetts. This conference follows up on a number of successful workshops held at venues like NeurIPS and NAACL. Why a conference and not another workshop? The general chair and host of the conference (and he really did feel like a host), Andrew McCallum articulated this as coming from three drivers: 1) the community spans a number of different research areas but was getting its own identity; 2) the workshop was outgrowing typical colocation opportunities and 3) the motivation to have a smaller event where people could really connect in comparison to some larger venues.

Automated knowledge base construction is at the intersection of area – Andrew McCallum @akbc_conf #akbc2019 pic.twitter.com/m7DHBat7Ph

— Paul Groth (@pgroth) May 20, 2019

I don’t know the exact total but I think there was just over 110 people at the conference. Importantly, there were top people in the field and they stuck around and hung out. The size, the location, the social events (a lovely group walk in the forest in MA), all made it so that the conference achieved the goal of having time to converse in depth. It reminded me a lot of our Provenance Week events in the scale and depth of conversation.

@akbc_conf #AKBC2019 photos. Thank you to all the organizers, staff, speakers and participants who made it such an engaging, insightful, friendly, and fun conference. Already looking forward to #AKBC2020! Some photos: pic.twitter.com/VkhJtZMTfo

— andrewmccallum (@andrewmccallum) May 24, 2019

Oh and Amherst is a terribly cute college town:

2019-05-19 16.56.06.jpg

Given that the conference subject is really central to my research, I found it hard to boil down everything into a some themes but I’ll give it a shot:

  • Representational polyglotism
  • So many datasets so little time
  • The challenges of knowledge (graph) engineering
  • There’s lots more to do!

Representational polyglotism

Untitled 2.png

One of the main points that came up frequently both in talks and in conversation was around what one should use as representation language for knowledge bases and for what purpose. Typed graphs have clearly shown their worth over the last 10 years but with the rise of knowledge graphs in a wide variety of industries and applications. The power of the relational approach especially in its probabilistic form  was shown in excellent talks by Lise Getoor on PSL and by Guy van den Broeck. For efficient query answering and efficiency in data usage, symbolic solutions work well. On the other hand, the softness of embedding or even straight textual representations enables the kind of fuzziness that’s inherent in human knowledge. Currently, our approach to unify these two views is often to encode the relational representation in an embedding space, reason about it geometrically, and then through it back over the wall into symbolic/relational space.  This was something that came up frequently and Van den Broek took this head on in his talk.

Then there’s McCallum’s notion of text as a knowledge graph. This approach was used frequently to different degrees, which is to be expected given that much of the contents of KGs is provided through information extraction. In her talk, Laura Dietz, discussed her work where she annotated the edges of a knowledge graph with paragraph text to improve entity ranking in search.  Likewise, the work presented by Yejin Choi around common sense reasoning used natural language as the representational “formalism”. She discussed the ATOMIC (paper) knowledge graph  which represents a crowed sourced common sense knowledge as natural language text triples (e.g. PersonX finds ___ in the literature).  She then described transformer based, BERT-esque, architectures  (COMET: Commonsense Transformers for Knowledge Graph Construction) that perform well on common sense reasoning tasks based on these kinds of representations.

The performance of BERT style language models on all sorts of tasks, led to Sebastian Riedel considering whether one should treat these models as the KB:

IMG_0738.jpg

It turns out that out-of-the box BERT performs pretty well as a knowledge base for single tokens that have been seen frequently by the model. That’s pretty amazing. Is storing all our knowledge in the parameters of a model the way to go? Maybe not but surely it’s good to investigate the extent of the possibilities here. I guess I came away from the event thinking that we are moving toward an environment where KBs will maintain heterogenous representations and that we are at a point where we need to embrace this range of representations to produce results in order face the challenges of the fuzzy. For example, the challenge of reasoning:

Great banquet talk by @earnmyturns, telling us about the challenge of reasoning#AKBC2019 #AKBC #ML #NLProc pic.twitter.com/3f6txOhVYI

— AKBC 2019 (@akbc_conf) May 22, 2019

or of disagreement around knowledge as discussed by Chris Welty:

wWPblo8e.jpeg

So many datasets so little time

Progress in this field is driven by data and there were a lot of new datasets presented at the conference. Here’s my (probably incomplete) list:

  • OPIEC – from the makers of the MINIE open ie system – 300 million open information extracted triples with a bunch of interesting annotations;
  • TREC CAR dataset – cool task, auto generate articles for a search query;
  • HAnDS – a new dataset for fined grained entity typing  to support thousands of types;
  • HellaSwag – a new dataset for common sense inference designed to be hard for state-of-the-art transformer based architectures (BERT);
  • ShARC – conversational question answering dataset focused on follow-up questions
  • Materials Synthesis annotated data for extraction of material synthesis recipes from text. Look up in their GitHub repo for more interesting stuff
  • MedMentions – annotated corpora of UMLs mentions in biomedical papers from CZI
  • A bunch of datasets that were submitted to EMNLP so expect those to come soon – follow @nlpmattg.

The challenges of knowledge (graph) engineering

Juan Sequeda has been on this topic for a while – large scale knowledge graphs are really difficult to engineer. The team at DiffBot – who were at the conference – are doing a great job of supplying this engineering as a service through their knowledge graph API.  I’ve been working with another start-up SeMI who are also trying to tackle this challenge. But this is still complicated task as underlined for me when talking to Francois Scharffe who organized the recent industry focused Knowledge Graph Conference. The complexity of KG (social-technical) engineering was one of the main themes of that conference. An example of the need to tackle this complexity at AKBC was the work presented about the knowledge engineering going on for the KG behind Apple’s Siri. Xiao Ling emphasized that they spent a lot of their time thinking about and implementing systems for knowledge base construction developer workflow:

Cool to see Apple using the combination of various public knowledge bases – wikidata, musicbrainz, discogs to power Siri #AKBC2019 https://t.co/2XN44hxVjC pic.twitter.com/G8uduHyw00

— Paul Groth (@pgroth) May 20, 2019

Thinking about these sorts of challenges was also behind several of the presentations in  the Open Knowledge Network workshop: Vicki Tardif from the Google Knowledge Graph discussed these issues in particular with reference to the muddiness of knowledge representation (e.g. how to interpret facets of a single entity? or how to align the inconsistencies of people with that of machines?). Jim McCusker and Deborah McGuinness’ work on the provenance/nanopublication driven WhyIs framework for knowledge graph construction is an important in that their software views a knowledge graph not as an output but as a set of tooling for engineering that graph.

The best paper of the conference Alexandria: Unsupervised High-Precision Knowledge Base Construction using a Probabilistic Program was also about how to lower the barrier to defining knowledge base construction steps using a simple probabilistic program. Building a KB from a single seed fact is impressive but then you need the engineering effort to massively scale probabilistic inference.

Alexandra Meliou’s work on using provenance to help diagnose these pipelines was particularly relevant to this issue. I have now added a bunch of her papers to the queue.

There’s lots more to do

One of the things I most appreciated was that many speakers had a set of research challenges at the end of their presentations. So here’s a set of things you could work on in this space curated from the event. Note these may be paraphrased.

  • Laura Dietz:
    • General purpose schema with many types
    • High coverage/recall (40%?)
    • Extraction of complex relations (not just triples + coref)
    • Bridging existing KGs with text
    • Relevant information extraction
    • Query-specific knowledge graphs
  • Fernando Pereira
    • combing source correlation and grounding
  • Guy van den Broeck
    • Do more than link predication
    • Tear down the wall between query evaluation and knowledge base completion
    • The open world assumption – take it seriously
  • Waleed Ammar
    • Bridge sentence level and document level predictions
    • Summarize published results on a given problem
    • Develop tools to facilitate peer review
    • How do we crowd source annotations for a specialized domain
    • What are leading indicators of a papers impact?
  • Sebastian Riedel
    • Determine what BERT actually knows or what it’s guessing
  • Xian Ren
    • Where can we source complex rules that help AKBC?
    • How do we induce transferrable latent structures from pre-trained models?
    • Can we have modular neural networks for modeling compositional rules?
    • Ho do we model “human effort” in the objective function during training?
  • Matt Gardner
    • Make hard reading datasets by baking required reasoning into them

Finally, I think the biggest challenge that was laid down was from Claudia Wagner, which is how to think a bit more introspectively about the theory behind our AKBC methods and how we might even bring the rigor of social science methodology to our technical approaches:

@clauwa bringing social science methodology to #akbc2019 – the need to document design decisions pic.twitter.com/LT3BogSeNq

— Paul Groth (@pgroth) May 21, 2019

I left AKBC 2019 with a fountain of ideas and research questions, which I count as a success. This is a community to watch.  AKBC 2020 is definitely on my list of events to attend next year.

Random Pointers

 

Source: Think Links

Posted in Paul Groth, Staff Blogs

6th International Symposium “Perspectives on ICT4D”

On 23 May, as part of the VU ICT4D course, for the 6th time, W4RA and SIKS organized the annual symposium “Perspectives on ICT4D“. This year’s theme was how to tackle “Global Challenges” in a collaborative, trans-disciplinary way. Food Security is one of the Global Challenges Lia van Wesenbeeck – Director of the Amsterdam Centre for World Food Studies – gave a great presentation on “Tackling World Food Challenges”.

Our international speaker on the same topic, Mr. Seydou Tangara, coordinator of the AOPP, was unfortunately not able to join due to visa problems. He was replaced by prof. Hans Akkermans, who presented the Vienna manifesto on digital humanism and its relation to ICT4D.

Andre Baart from UvA talked about the CARPA project and challenges in developing applications for people in Mali while Jaap Gordijn discussed the need for business modelling for developing sustainable services, with interesting case studies from Sarawak, Malaysia.

The ICT4D students presented their voice application services during the coffee break. They demonstrated applications ranging from equipment-lending services to seed markets and weather services.

The 6th edition of the @VUamsterdam workshop "Perspectives on #ICT4D" starts now. A nice mixed program with various speakers and student project presentations. pic.twitter.com/7XmfxYt7jE

— Victor de Boer (@victordeboer) May 23, 2019

Andre Baart from #UvA business school talks about #CARPA project for crowdsourcing responsible production in Africa. #perspectives on #ict4d pic.twitter.com/nBkwruz3BX

— Victor de Boer (@victordeboer) May 23, 2019

Next up is Lia van Wesenbeeck – Director of the Amsterdam Centre for World Food  Studies. She talks about tackling world food challenges working with world bank. #perspectives on #ict4d pic.twitter.com/RCN5jKoghA

— Victor de Boer (@victordeboer) May 23, 2019

prof. Hans Akkermans closes the first half of the #perspectives on #ict4d workshop by talking about the Vienna #digitalhumanism manifesto. pic.twitter.com/p6jJuBjtmw

— Victor de Boer (@victordeboer) May 23, 2019

Jaap Gordijn now talks about the importance of sustainable service development in #ict4d. pic.twitter.com/6S5myyT0DV

— Victor de Boer (@victordeboer) May 23, 2019

#ict4d students present their hard work during the symposium poster and demo session. pic.twitter.com/DrZ2Sda8pm

— Victor de Boer (@victordeboer) May 23, 2019

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Digital Humanities in Practice 2018/2019

Last friday, the students of the class of 2018/2019 of the course Digital Humanities and Social Analytics in Practice presented the results of their capstone internship project. This course and project is the final element of the Digital Humanities and Social Analytics minor programme in which students from very different backgrounds gain skills and knowledge about the interdisciplinary topic.

Poster presentation of the DHiP projects

The course took the form of a 4-week internship at an organization working with humanities or social science data and challenges and student groups were asked to use these skills and knowledge to address a research challenge. Projects ranged from cleaning, indexing, visualizing and analyzing humanities data sets to searching for bias in news coverage of political topics. The students showed their competences not only in their research work but also in communicating this research through great posters.

Super excited to see what the @VUamsterdam #DH in Practice have worked on the past month, including three @KNAWHuC projects! pic.twitter.com/57SFRlPvLQ

— Marieke van Erp (@merpeltje) February 1, 2019

The complete list of student projects and collaborating institutions is below:

  • “An eventful 80 years’ war” at Rijksmuseum identifying and mapping historical events from various sources.
  • An investigation into the use of structured vocabularies also at the Rijksmuseum
  • “Collecting and Modelling Event WW2 from Wikipedia and Wikidata” in collaboration with Netwerk Oorlogsbronnen (see poster image below)
  • A project where an search index for Development documents governed by the NICC foundation was built.
  • “EviDENce: Ego Documents Events modelliNg – how individuals recall mass violence” – in collaboration with KNAW Humanities Cluster (HUC)
  • “Historical Ecology” – where students searched for mentions of animals in historical newspapers – also with KNAW-HUC
  • Project MIGRANT: Mobilities and connection project in collaboration with KNAW-HUC and Huygens ING
  • Capturing Bias with media data analysis – an internal project at VU looking at indentifying media bias
  • Locating the CTA Archive Amsterdam where a geolocation service and search tool was built
  • Linking Knowledge Graphs of Symbolic Music with the Web – also an internal project at VU working with Albert Merono
One of the posters visualizing the events and persons related to the occupation of the Netherlands in WW2

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer