Trip Report: Dagstuhl Seminar on Knowledge Graphs

Last week, I was at Dagstuhl for a seminar on knowledge graphs specifically focused on new directions for knowledge representation. Knowledge Graphs have exploded in practice since the release of Google’s Knowledge Graph in 2012. Examples include knowledge graphs at AirBnb, Zalando, and Thomson Reuters. Beyond commercial knowledge graphs, there are many successful academic/public knowledge graphs including WikiData, Yago, and Nell.

The emergence of these knowledge graphs has led to expanded research interest in constructing, producing and maintaining knowledge bases. As an indicator checkout the recent growth in papers using the term knowledge graph (~10x papers per year since 2012):

knowledgegraph-dagstuhl-20180910-f32c5b3e.png

The research in this area is found across fields of computer science ranging from the semantic web community to natural processing and machine learning and databases. This is reflected in the recent CFP for the new Automated Knowledge Base Construction Conference.

This particular seminar primarily brought together folks who had a “home” community in the semantic web but were deeply engaged with another community. For example, Prof. Maria-Esther Vidal who is well versed in the database literature. This was nice in that there was already quite a lot of common ground but people who could effectively communicate or at least point to what’s happening in other areas. This was different than many of the other Dagstuhl seminars I’ve been to (this was my 6th), which were much more about bringing together different areas. I think both styles are useful but it felt like we could go faster as the language barrier was lower.

Still about to shape the future of #KnowledgeGraphs at @dagstuhl pic.twitter.com/vCt33eKk5Z

— Heiko Paulheim (@heikopaulheim) September 13, 2018

The broad aim of the seminar was to come with research challenges coming from the experience that we’ve had over the last 10 years. There will be a follow-up report that should summarize the thoughts of the whole group. There were a lot of sessions and a lot of amazing discussions both during the day and in the evening facilitated by cheese & wine (a benefit of Dagstuhl) so it’s hard to summarize everything even just on a personal level but I wanted to pull out the things that have stuck with me now that I’m back at home:

1) Knowledge Graphs of Everything

We are increasingly seeing knowledge graphs that cover an entire category of entities. For example, Amazon’s product graph aims to be a knowledge graph of all products in the world, one can think of Google and Apple maps as databases of every location in the world, a database of every company that has ever had a web page, or a database of everyone in India. Two things stand out. One, is that these are large sets of instance data. I would contend their focus is not deeply modeling the domain in some expressive logic ala Cyc. Second, a majority of these databases are build by private companies. I think it’s an interesting question as to whether things like Wikidata can equal these private knowledge graphs in a public way.

Once you start thinking at this scale, a number of interesting questions arise: how you keep these massive graphs up to date; can you integrate these graphs, how do you manage access control and policies (“controlled access”); what can you do with this; can we extend these sorts of graphs to the physical system (e.g. in IoT); what about a knowledge graph of happenings (ie. events). Fundamentally, I think this “everything notion” is a useful framing device for research challenges.

2) Knowledge Graphs as a communication medium

A big discussion point during the seminar was the integration of symbolic and sub-symbolic representations. I think that’s obvious given the success of deep learning and importantly in the representation space – embeddings. I liked how Michael Witbrock framed symbols as a strong prior on something being the case. Indeed, using background knowledge has been shown to improve learning performance on several tasks (e.g. Baier et al. 2018, Marino et al. 2017).

But this topic in general got us thinking about the usefulness of knowledge graphs as an exchange mechanism for machines. There’s is a bit of semantic web dogma that expressing things in a variant of logic helps for machine to machine communication. This is true to some degree but you can image that machine’s might like to consume a massive matrix of numbers instead of human readable symbols with logical operators.

Given that, then, what’s the role of knowledge graphs? One can hypothesize that it is for the exchange of large scale information between humanity and machines and vis versa. Currently, when people communicate large amounts of data they turn towards structure (i.e. libraries, websites with strong information architectures, databases). Why not use the same approach to communicate with machines then. Thus, knowledge graphs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume.

On a somewhat less grand note, we discussed the role of integrating different forms of representation in one knowledge graph. For example, keeping images represented as images and audio represented as audio alongside facts within the same knowledge graph. Additionally, we discussed different mechanisms for attaching semantics to the symbols in knowledge graphs (e.g. latent embeddings of symbols). I tried to capture some of that thinking in a brief overview talk.

In general, as we think of knowledge graphs as a communication medium we should think how to both tweak and expand the existing languages of expression we use for them and the semantics of those languages.

3) Knowledge graphs as social-technical processes

The final kind of thing that stuck in my mind is that at the scale we are talking about much of the issues resolve around the notions of the complex interplay between humans and machines in producing, using and maintaining knowledge graphs. This was reflected in multiple threads:

  • Juan Sequeda’s thinking emerging from his practical experience on the need for knowledge / data engineers to build knowledge graphs and the lack of tooling for them. In some sense, this was a call to revisit the work of ontology engineering but now in the light of this larger scale and extensive adoption.
  • The facts established by the work of Wouter Beek and co on empirical semantics that in large scale knowledge graphs actually how people express information differs from the intended underlying semantics.
  • The notions of how biases and perspectives are reflected in knowledge graphs and the steps taken to begin to address these. A good example is the work of wikidata community to present the biases and gaps in its knowledge base.
  • The success of schema.org and managing the overlapping needs of communities. This stood out because of the launch of Google Dataset search service based on schema.org metadata.

While not related directly to knowledge graphs during the seminar the following piece on the relationship between AI systems and humans came was circulating:

Kate Crawford and Vladan Joler, “Anatomy of an AI System: The Amazon Echo As An Anatomical Map of Human Labor, Data and Planetary Resources,” AI Now Institute and Share Lab, (September 7, 2018) https://anatomyof.ai

There is critical need for more data about the interface between the knowledge graph and its maintainers and users.

As I mentioned, there was lots more that was discussed and I hope the eventual report will capture this. Overall, it was fantastic to spend a week with the people below – both fun and thought provoking.

Random ponters:

Whiteboard action! @dagstuhl #knowledgegraphs pic.twitter.com/cZT0NwWUp2

— Marieke van Erp (@merpeltje) September 12, 2018

Source: Think Links

Posted in Paul Groth, Staff Blogs

The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive

[This post describes the Master Project work of Information Science students Tim de Bruijn and John Brooks and is based on their theses]

Audiovisual archives adopt structured vocabularies for their metadata management. With Semantic Web and Linked Data now becoming more and more stable and commonplace technologies, organizations are looking now at linking these vocabularies to external sources, for example those of Wikidata, DBPedia or GeoNames.

However, the benefits of such endeavors to the organizations are generally underexplored. For their master project research, done in the form of an internship at the Netherlands Institute for Sound and Vision (NISV), Tim de Bruijn and John Brooks conducted a case study into the benefits of linking the “Common Thesaurus for Audiovisual Archives(or GTAA) and the general-purpose dataset Wikidata. In their approach, they identified various use cases for user groups that are both internal (Tim) as well as external (John) to the organization. Not only were use cases identified and matched to a partial alignment of GTAA and Wikidata, but several proof of concept prototypes that address these use cases were developed. 

 

For the internal users, three cases were elaborated, including a calendar service where personnel receive notifications when an author of a work has passed away 70 years ago, thereby changing copyright status of the work. This information is retrieved from the Wikidata page of the author, aligned with the GTAA entry (see fig 1 above).

A second internal case involves the new ‘story platform’ of NISV. Here Tim implemented a prototype enduser application to find stories related to the one currently shown to the user, based on persons occuring in that story (fig 2).

The external cases centered around the users of the CLARIAH Media Suite. For this extension, several humanities researchers were interviewed to identify worthwile extensions with Wikidata information. Based on the outcomes of these interviews, John Brooks developed the Wikidata retrieval service (fig 3).

The research presented in the two theses are a good example of User-Centric Data Science, where affordances provided by data linkages are aligned with various user needs. The various tools were evaluated with end users to ensure they match their actual needs. The research was reported in a research paper which will be presented at the MTSR2018 conference: (Victor de Boer, Tim de Bruijn, John Brooks, Jesse de Vos. The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive. To appear in Proceedings of MTSR 2018 [Draft PDF])

Find out more:

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Developing a Sustainable Weather Information System in Rural Burkina Faso

[This post describes the Information Sciences Master Project of Hameedat Omoine and is based on her thesis.] 

In the quest to improve the lives of farmers and improve agricultural productivity in rural Burkina Faso, meteorological data has been identified as one of the is key information needs for local farmers. Various online weather information services are available, but many are not tailored specifically to tis target user group. In a research case study, Hameedat Omoine designed a weather information system that collects not only weather but also related agricultural information and provides the farmers with this information to allow them to improve agricultural productivity and the livelihood of the people of rural Burkina Faso.

The research and design of the system was conducted at and in collaboration with 2CoolMonkeys, a Utrecht-based Open data and App-development company with expertise in ICT for Development (ICT4D).

Following the design science research methodology, Hameedat investigated the requirements for a weather information system, and the possible options for ensuring the sustainability of the system. Using a structured approach, she developed the application and evaluated it in the field with potential Burkinabe end users. The mobile interface of the application featured weather information and crop advice (seen in the  images above). A demonstration video is shown below

Hameedat developed multiple alternative models to investigate the sustainability of the application. For this she used the e3value approach and language. The image below shows a model for the case where a local radio station is involved.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

An Approach to a Sustainable Weather Information System in Rural Burkina Faso

[This post describes the Information Sciences Master Project of Hameedat Omoine and is based on her thesis.] 

In the quest to improve the lives of farmers and improve agricultural productivity in rural Burkina Faso, meteorological data has been identified as one of the is key information needs for local farmers. Various online weather information services are available, but many are not tailored specifically to tis target user group. In a research case study, Hameedat Omoine designed a weather information system that collects not only weather but also related agricultural information and provides the farmers with this information to allow them to improve agricultural productivity and the livelihood of the people of rural Burkina Faso.

The research and design of the system was conducted at and in collaboration with 2CoolMonkeys, a Utrecht-based Open data and App-development company with expertise in ICT for Development (ICT4D).

Following the design science research methodology, Hameedat investigated the requirements for a weather information system, and the possible options for ensuring the sustainability of the system. Using a structured approach, she developed the application and evaluated it in the field with potential Burkinabe end users. The mobile interface of the application featured weather information and crop advice (seen in the  images above). A demonstration video is shown below

Hameedat developed multiple alternative models to investigate the sustainability of the application. For this she used the e3value approach and language. The image below shows a model for the case where a local radio station is involved.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: Provenance Week 2018

A couple of weeks ago I was at Provenance Week 2018 – a biennial conference that brings together various communities working on data provenance. Personally, it’s a fantastic event as it’s an opportunity to see the range of work going on from provenance in astronomy data to the newest work on database theory for provenance. Bringing together these various strands is important as there is work from across computer science that touches on data provenance.

James Cheney's TaPP keynote. Different flavors of provenance. #provenanceweek .. pic.twitter.com/OdFCqKQCGs

— Bertram Ludäscher (@ludaesch) July 11, 2018

The week is anchored by the International Provenance and Annotation Workshop (IPAW) and the Theory and Practice of Provenance (TaPP) and includes events focused on emerging areas of interest including incremental re-computation , provenance-based security and algorithmic accountability. There were 90 attendees up from ~60 in the prior events and here they are:

IMG_0626-2.jpg

The folks at Kings College London, led by Vasa Curcin, did a fantastic job of organizing the event including great social outings on-top of their department building and with a boat ride along the thames. They also catered to the world cup fans as well. Thanks Vasa!

2018-07-11 21.29.07

I had the following major takeaways from the conference:

Improved Capture Systems

The two years since the last provenance week have seen a number of improved systems for capturing provenance. In the systems setting, DARPAs Transparent Computing program has given a boost to scaling out provenance capture systems. These systems use deep operating system instrumentation to capture logs over the past several years these have become more efficient and scalable e.g. Camflow, SPADE. This connects with the work we’ve been doing on improving capture using whole system record-and-replay. You  can now run these systems almost full-time although they capture significant amounts of data (3 days = ~110 GB). Indeed, the folks at Galois presented an impressive looking graph database specifically focused on working with provenance and time series data streaming from these systems.

Beyond the security use case, sciunit.run was a a neat tool using execution traces to produce reproducible computational experiments.

There were also a number of systems for improving the generation of instrumentation to capture provenance. UML2PROV automatically generates provenance instrumentation from UML diagrams and source code using the provenance templates approach. (Also used to capture provenance in an IoT setting.) Curator implements provenance capture for micro-services using existing logging libraries. Similarly, UNICORE now implements provenance for its HPC environment. I still believe structured logging is one of the under rated ways of integrating provenance capture into systems.

Finally, there was some interesting work on reconstructing provenance. In particular, I liked Alexander Rasin‘s work on reconstructing the contents of a database from its environment to answer provenance queries:2018-07-10 16.34.08.jpg

Also, the IPAW best paper looked at using annotations in a workflow to infer dependency relations:

Kudos to Shawn and Tim for combining theory & practice (logic inference & #YesWorkflow) in powerful new ways! #IPAW2018 Best Paper is available here: https://t.co/te3ce7mV6Y https://t.co/WT0F0PcMz7

— Bertram Ludäscher (@ludaesch) July 27, 2018

Lastly, there was some initial work on extracting provenance of  health studies directly from published literature which I thought was a interesting way of recovering provenance.

Provenance for Accountability

Another theme (mirrored by the event noted above) was the use of provenance for accountability. This has always been a major use for provenance as pointed out by Bertram Ludäscher in his keynote:

The need for knowing where your data comes from all the way from 1929 @ludaesch #provenanceweek https://t.co/TIgPEOFjxb pic.twitter.com/2NpbSMI699

— Paul Groth (@pgroth) July 9, 2018

However, I think due to increasing awareness around personal data usage and privacy the need for provenance is being recognized. See, for example, the Royal Society’s report on Data management and use: Governance in the 21st century. At Provenance Week, there were several papers addressing provenance for GDPR, see:

I'd like to shamelessly pitch our GDPRov ontology as a superset of this work. The key difference here being justification is used as a legal concept. We use hasLegalBasis as a property. https://t.co/W4V9r2QXwA

— Harshvardhan Pandit (@coolharsh55) July 10, 2018

Also, the I was impressed with the demo from Imosphere using provenance for accountability and trust in health data:

Great to be part of #provenanceweek at @KingsCollegeLon, here's Anthony @ScampDoodle demonstrating the data provenance functionality within Atmolytics at yesterday's sessions. To learn more about the benefits of data provenance in analytics go to https://t.co/8NdmN2ECrP pic.twitter.com/ueDQUi7jSG

— Imosphere (@Imosphere) July 11, 2018

Re-computation & Its Applications

Using provenance to determine what to recompute seems to have a number of interesting applications in different domains. Paolo Missier showed for example how it can be used to determine when to recompute in next generation sequencing pipelines.

Our #provenanceWeek IPAW 2018 conference paper on using provenance to facilitate re-computation analysis in the ReComp project. Link to paper: here: https://t.co/zeZ9xROm2S
Link to presentation: https://t.co/w6cVwpdLGT

— Paolo Missier (@PMissier) July 9, 2018

I particular liked their notion of a re-computation front – what set of past executions do you need to re-execute in order to address the change in data.

Wrattler was a neat extension of the computational notebook idea that showed how provenance can be used to automatically propagate changes through notebook executions and support suggestions.

Marta Mattoso‘s team discussed the application of provenance to track the adjustments when performing steering of executions in complex HPC applications.

The work of Melanie Herschel‘s team on provenance for data integration points to the benefits of potentially applying recomputation using provenance to make the iterative nature of data integration speedier as she enumerated in her presentation at the recomputation worskhop.2018-07-12 15.01.42.jpg

You can see all the abstracts from the workshop here. I understand from Paolo that they will produce a report from the discussions there.

Overall, I left provenance week encouraged by the state of the community, the number of interesting application areas, and the plethora of research questions to work on.

Random Links

 

Source: Think Links

Posted in Paul Groth, Staff Blogs

Testimonials Digital Humanities minor at DHBenelux2018

At the DHBenelux 2018 conference, students from the VU minor “Digital Humanities and Social Analytics” presented their final DH in Practice work. In this video, the students talk about their experience in the minor and the internship projects. We also meet other participants of the conference talking about the need for interdisciplinary research.

 

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Big Data Europe Project ended

All good things come to an end, and that also holds for our great Horizon2020 project “Big Data Europe“, in which we collaborated with a broad range of techincal and domain partners to develop (Semantic) Big Data infrastructure for a variety of domains. VU was involved as work package leader in the Pilot and Evaluation work package and co-developed methods to test and apply the BDE stack in Health, Traffic, Security and other domains..

You can read more about the end of the project in this blog post at the BDE website.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

André Baart and Kasadaka win IXA High Potential Award

Andre and his prizeOn 19 June, André Baart was awarded the High Potential Award at the Amsterdam Science & Innovation en Impact Awards for his and W4RA‘s work on the Kasadaka platform.

Kasadaka (“talking box”) is an ICT for Development (ICT4D) platform to develop voice-based technologies for those who are not connected to the Internet, cannot not read and write, and speak underresourced languages.

As part of a longer-term project, the Kasadaka Voice platform and software development kit (VSDK), has been developed by André Baart as part of his BSc and MSc research at VU. In that context it has been extensively tested in the field, for example by Adama Tessougué, journalist and founder of radio Sikidolo in Konobougou, a small village in rural Mali. It was also evaluated in the context of the ICT4D course at VU, by 46 master students from Computer Science, Information Science and Artificial Intelligence. The Kasadaka is now in Sarawak Malaysia, where it will be soon deployed in a Kampong, by Dr. Cheah Waishiang, ICT4D researcher at the University of Malasia Sarawak (UNIMAS), and students from VU and UNIMAS.

André is currently pursuing his PhD in ICT4D at Universiteit van Amsterdam and still member of the W4RA team.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

A Brief Trip Report from WebSci 2018

The early part of last week I attended the Web Science 2018 conference. It was hosted here in Amsterdam which was nice for me. It was nice to be at a conference where I could go home in the evening.

Web Science is an interesting research area in that it treats the Web itself as an object of study. It’s a highly interdisciplinary area that combines primarily social science with computer science. I always envision it as a loop with studies of what’s actually going on the Web leading to new interventions on the Web which we then need to study.

There were what I guess a hundred or so people there … it’s a small but fun community. I won’t give a complete rundown of the conference. You can find summaries of each day done by Cat Morgan (Workshop DayDay 1Day 2Day 3) but instead give an assortment of things that stuck out for me:

And some tweets:

The crowd waiting for @timberners_lee Turing Lecture is insane! #WebSci18 pic.twitter.com/2jpdVQZ3sV

— Roy Lee (@SRoyLee) May 29, 2018

Just like Global Warming, Facebook is anthropogenic – humans created it and it’s a lot easier to change (than global warming). You have an obligation to replace and fix it — and it’s an interdisciplinary endeavour to guide us on how #WebSci18 #turingaward @timberners_lee pic.twitter.com/0zm2EdC38d

— electronic max (@emax) May 29, 2018

It's amazing to consider that something so profoundly simple (the humble URL), can be so powerful, and of course, scalable. At the same time, smart people still struggle to grock this concept. #WebSci18 #webscience @W3C #linkeddata https://t.co/a2XrydT3Sm

— Bernadette Hyland (@BernHyland) May 29, 2018

Find more details in the paper https://t.co/fV5LYxFd1N https://t.co/mckKs7doGp

— metrics-project (@metrics_project) May 28, 2018

Lots of case studies here at #websci18 – always highly interesting but I’m wondering about generalizability – maybe need websci meta reviews? https://t.co/4cY4pIdfcS

— Paul Groth (@pgroth) May 30, 2018

Source: Think Links

Posted in Paul Groth, Staff Blogs

Presenting the CARPA project

The ICT4D project CARPA, funded by NWO-WOTRO had its first stakeholder workshop today at the Amsterdam Business School of UvA. From our project proposal: The context for CARPA (Crowdsourcing App for Responsible Production in Africa) lies in sustainable and responsible business. Firms are under increasing pressure to ensure sustainable, responsible production in their supply chains.. Lack of transparency about labour abuses and environmental damages has led some firms to cease purchases from the region

The first stakeholder workshop at #UvA of #CAPRA project on developing an #ict4d crowdsourcing app for responsible production in #Africa #NWO#WOTRO @AndreBaart @marcelworring pic.twitter.com/sgfTb2P2XE

— Victor de Boer (@victordeboer) May 15, 2018

.With an interdisciplinary partnership of local NGOs and universities in DRC, Mali, and South Africa, this project aims to generate new evidence-based knowledge to improve transparency about business impacts on responsible production.

Co-creating a smartphone application, we will use crowdsourcing methods to obtain reports of negative social and environmental business impacts in these regions, and follow them over time to understand access to justice and whether and how remediation of such impacts occurs. Data integration and visualization methods will identify patterns in order to provide context and clarity about business impacts on sustainability over time. A website will be developed to provide ongoing public access to this data, including a mapping function pinpointing impact locations.

The project will be led by Michelle Westermann-Behaylo from UvA, with the research work on the ground being executed by UvA’s Francois Lenfant and Andre Baart. Marcel Worring and myself are involved in supervisory roles.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer