Trip Report: ISWC 2018

Two weeks ago, I had the pleasure of attending the 17th International Semantic Web Conference held at Asiolomar Conference Grounds in California. A tremendously beautiful setting in a state park along the ocean. This trip report is somewhat later than normal because I took the opportunity to hang out for another week along the coast of California.

Before getting into the content of the conference, I think it’s worth saying, if you don’t believe that there are capable, talented, smart and awesome women in computer science at every level of seniority, the ISWC 2018 organizing committee + keynote speakers is the mike drop of counter examples:

Back home after an incredible #iswc2018. Thank you to the team for making it happen. It was a pleasure and an honour to be part of this w/ @AnLiGentile @vrandezo @kbontcheva @jrlsgoncalves @laroyo @miriam_fs @vpresutti @laurakoesten @maribelacosta @merpeltje @iricelino @iswc2018

— Elena Simperl (@esimperl) October 19, 2018

Now some stats:

  •  438 attendees
  •  Papers
    •  Research Track: 167 submissions – 39 accepted – 23% acceptance rate
    •  In Use: 55 submissions – 17 accepted – 31% acceptance rate
    •  Resources: 31 submissions – 6 accepted – 19% acceptance rate
  •  38 Posters & 39 Demos
  • 14 industry presentations
  • Over 1000 reviews

These are roughly the same as the last time ISWC was held in the United States. So on to the major themes I took away from the conference plus some asides.

Knowledge Graphs as enterprise assets

It was hard to walk away from the conference without being convinced that knowledge graphs are becoming fundamental to delivering modern information solutions in many domains. The enterprise knowledge graph panel was a demonstration of this idea. A big chunk of the majors were represented:

#iswc2018 Enterprise-scale Knowledge Graphs, very exciting panel! Microsoft, Facebook, eBay, Google, IBM, … Fantastic impact of the SW community 🙂 pic.twitter.com/2iONorKt1J

— Miriam Fernandez (@miriam_fs) October 11, 2018

The stats are impressive. Google’s Knowledge Graph has 1 billion things and 70 billion assertions. Facebook’s knowledge graph which they distinguish from their social graph and has just ramped up this year has 50 Million Entities and 500 million assertions. More importantly, they are critical assets for applications, for example, at eBay their KG is central to creating product pages, at Google and Microsoft, KGs are key to entity search and assistants, and at IBM they use it as part of their corporate offerings. But you know it’s really in-use when knowledge graphs are used for emoji:

Stickers on fb messages are driven by knowledge graphs :O #ISWC2018 pic.twitter.com/4wWm2h3H8t

— Helena Deus (@hdeus) October 11, 2018

It wasn’t just the majors who have or are deploying knowledge graphs. The industry track in particular was full of good examples of knowledge graphs being used in practice. Some ones that stood out were: Bosch’s use of knowledge graphs for question answering in DIY, multiple use cases for digital twin management (Siemens, Aibel); use in a healthcare chatbot (Babylon Health); and for helping to regulate the US finance industry (FINRA). I was also very impressed with Diffbot’s platform for creating KGs from the Web. I contributed to the industry session presenting how Elsevier is using knowledge graphs to drive new products in institutional showcasing and healthcare.

Standing room in the industry track of #iswc2018 #iswc_conf. Fantastic to see how Semantic Web and Knowledge Graphs are being used in the real world. It’s not just an academic exercise anymore pic.twitter.com/K0PLbVMjDg

— Juan Sequeda (@juansequeda) October 10, 2018

Beyond the wide use of knowledge graphs, there was a number of things I took away from this thread of industrial adoption.

  1. Technology heterogeneity is really the norm. All sorts of storage, processing and representation approaches were being used. It’s good we have the W3C Semantic Web stack but it’s even better that the principles of knowledge representation for messy data are being applied. This is exemplified by Amazon Neptune’s support for TinkerPop & SPARQL.
  2. It’s still hard to build these things. Microsoft said it was hard at scale. IBM said it was hard for unique domains. I had several people come to me after my talk about Elsevier’s H-Graph discussing similar challenges faced in other organizations that are trying to bring their data together especially for machine learning based applications. Note, McCusker’s work is some of the better publicly available thinking on trying to address the entire KG construction lifecycle.
  3. Identity is a real challenge. I think one of the important moves in the success of knowledge graphs was not to over ontologize. However, record linkage and thinking when to unify an entity is still not a solved problem. One common approach was towards moving the creation of an identifiable entity closer to query time to deal with the query context but that removes the shared conceptualization that is one of the benefits of a Knowledge Graph. Indeed, the clarion call by Google’s Jamie Taylor to teach knowledge representation was an outcome of the need for people who can think about these kinds of problem.

In terms of research challenges, much of what was discussed reflects the same kinds of ideas that were discussed at the recent Dagstuhl Knowledge Graph Seminar so I’ll point you to my summary from that event.

Finally, for most enterprises, their knowledge graph(s) were considered a unique asset to the company. This led to an interesting discussion about how to share “common knowledge” and the need to be able to merge such knowledge with local knowledge. This leads to my next theme from the conference.

Wikidata as the default option

@ma_kr talking @wikidata sparql service. Showing that #semtech is scalable and not too complicated #iswc2018 pic.twitter.com/x6z7vlMRPV

— Paul Groth (@pgroth) October 11, 2018

When discussing “common knowledge”, Wikidata has become a focal point. In the enterprise knowledge graph panel, it was mentioned as the natural place to collaborate on common knowledge. The mechanics of the contribution structure (e.g. open to all, provenance on statements) and institutional attention/authority (i.e. Wikimedia foundation) help with this. An example of Wikidata acting as a default is the use of Wikidata to help collate data on genes

Fittingly enough, Markus Krötzsch and team won the best in-use paper with a convincing demonstration of how well semantic technologies have worked as the query environment for Wikidata. Furthermore, Denny Vrandečić (one of the founders of Wikidata) won the best blue sky paper with the idea of rendering Wikipedia articles directly from Wikidata.

Deep Learning diffusion

As with practically every other conference I’ve been to this year, deep learning as a technique has really been taken up. It’s become just part of the semantic web researchers toolbox. This was particularly clear in the knowledge graph construction area. Papers I liked with DL as part of the solution:

While not DL per sea , I’ll lump embeddings in this section as well. Papers I thought that were interesting are:

The presentation of the above paper was excellent. I particularly liked their slide on related work:

iswc2018-1fd3fcf3.png

As an aside, the work on learning rules and the complementarity of rules to other forms of prediction was an interesting thread in the conference. Besides the above paper, see the work from Heiner Stuckenschmidt’s group on evaluating rules and embedding approaches for knowledge graph completion. The work of Fabian Suchanek’s group on the representativeness of knowledge bases is applicable as well in order to tell whether rule learning from knowledge graphs is coming from a representative source and is also interesting in its own right. Lastly, I thought the use of rules in Beretta et al.’s work to quantify the evidence of an assertion in a knowledge graph to help improve reliability was neat.

Information Quality and Context of Use

The final theme is a bit harder for me to solidify and articulate but it lies at the intersection of information quality and how that information is being used. It’s not just knowing the provenance of information but it’s knowing how information propagates and was intended to be used. Both the upstream and downstream need to be considered. As a consumer of information I want to know the reliability of the information I’m consuming. As a producer I want to know if my information is being used for what it was intended for.

The later problem was demonstrated by the keynote from Jennifer Golbeck on privacy. She touched on a wide variety of work but in particular it’s clear that people don’t know but are concerned with what is happening to their data.

What we are ready to compromise when it comes to #privacy? @jengolbeck @iswc2018 #iswc_conf #iswc2018 pic.twitter.com/Ez9hZJlvNC

— Angelo A. Salatino (@angelosalatino) October 10, 2018

There was also quite a bit of discussion going on about the decentralized web and Tim Berners-Lee’s Solid project throughout the conference. The workshop on decentralization was well attended. Something to keep your eye on.

The keynote by Natasha Noy also touched more broadly on the necessity of quality information this time with respect to scientific data.

Natasha Noy presenting @google’s dataset search stressing in the importance of #metadata #dataquality and #provenance #ISWC2018 @iswc2018 pic.twitter.com/b3Dv4yWVhr

— Amrapali Zaveri (@amrapaliz) October 11, 2018

The notion of propagation of bias through our information systems was also touched on and is something I’ve been thinking about in terms of data supply chains:

#ISWC2018 "Debiasing knowledge graphs" Janowicz et all. Biases are in word embeddings (doctor-male/nurse-female), image search, etc. Data is not neutral! In SW what we get are statements but not necessarily facts about the world. How can we really de-bias? pic.twitter.com/HJ9ca7FXaS

— Miriam Fernandez (@miriam_fs) October 11, 2018

That being said I think there’s an interesting path forward for using technology to address these issues. Yolanda Gil’s work on the need for AI to address our own biases in science is a step forward in that direction. This is a slide from her excellent keynote at SemSci Workshop:

iswc2018-09cc97c4.png

All this is to say that this is an absolutely critical topic and one where the standard “more research is needed” is very true. I’m happy to see this community thinking about it.

Final Thought

The Semantic Web community has produced a lot (see this slide from Nataha’s keynote:

iswc2018-d5af2fed.png

ISWC 2018 definitely added to that body of knowledge but more importantly I think did a fantastic job of reinforcing and exciting the community.

Really amazed by the community and the quality at #iswc2018. So happy to get to have dinner at #MonterreyAquarium! Thanks to the local organizing committee! @iswc2018 pic.twitter.com/vxHCdXvd5n

— Elisenda Bou (@elisenda_bou) October 11, 2018

Random Notes

Source: Think Links

Posted in Paul Groth, Staff Blogs

Who uses DBPedia anyway?

[this post is based on Frank Walraven‘s Master thesis]

Who uses DBPedia anyway? This was the question that started a research project for Frank Walraven. This question came up during one of the meetings of the Dutch DBPedia chapter, of which VUA is a member. If usage and users are better understood, this can lead to better servicing of those users, by for example prioritizing the enrichment or improvement of specific sections of DBPedia Characterizing use(r)s of a Linked Open Data set is an inherently challenging task as in an open Web world, it is difficult to know who are accessing your digital resources. For his Msc project research, which he conducted at the Dutch National Library supervised by Enno Meijers , Frank used a hybrid approach using both a data-driven method based on user log analysis and a short survey of know users of the dataset. As a scope Frank selected just the Dutch DBPedia dataset.

For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBPedia. This log file (see link below) consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research only a subset of the URI lookups were concerned.

As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine

indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encounterd IP-addresses, concluding that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.

The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBPedia URI request, Frank retriev

ed the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example the category “non-endangered animal”, the most popular categories where 1. Domestic & International movies, 2. Music, 3. Sports, 4. Dutch & International municipality information and 5. Books.

Frank also set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents Dutch DBPedia use, including the categories they were most interested in. The survey was distributed using the Dutch DBPedia websitea and via twitter however only attracted 5 respondents. This illustrates

the difficulty of the problem that users of the DBPedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the users used the DBPedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five resulting from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.Frank’s research shows that although it remains a challenging problem, using a combination of data-driven and user-driven methods, it is indeed possible to get an indication into the most-used categories on DBPedia. Within the Dutch DBPedia Chapter, we are currently considering follow-up research questions based on Frank’s research.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: Dagstuhl Seminar on Knowledge Graphs

Last week, I was at Dagstuhl for a seminar on knowledge graphs specifically focused on new directions for knowledge representation. Knowledge Graphs have exploded in practice since the release of Google’s Knowledge Graph in 2012. Examples include knowledge graphs at AirBnb, Zalando, and Thomson Reuters. Beyond commercial knowledge graphs, there are many successful academic/public knowledge graphs including WikiData, Yago, and Nell.

The emergence of these knowledge graphs has led to expanded research interest in constructing, producing and maintaining knowledge bases. As an indicator checkout the recent growth in papers using the term knowledge graph (~10x papers per year since 2012):

knowledgegraph-dagstuhl-20180910-f32c5b3e.png

The research in this area is found across fields of computer science ranging from the semantic web community to natural processing and machine learning and databases. This is reflected in the recent CFP for the new Automated Knowledge Base Construction Conference.

This particular seminar primarily brought together folks who had a “home” community in the semantic web but were deeply engaged with another community. For example, Prof. Maria-Esther Vidal who is well versed in the database literature. This was nice in that there was already quite a lot of common ground but people who could effectively communicate or at least point to what’s happening in other areas. This was different than many of the other Dagstuhl seminars I’ve been to (this was my 6th), which were much more about bringing together different areas. I think both styles are useful but it felt like we could go faster as the language barrier was lower.

Still about to shape the future of #KnowledgeGraphs at @dagstuhl pic.twitter.com/vCt33eKk5Z

— Heiko Paulheim (@heikopaulheim) September 13, 2018

The broad aim of the seminar was to come with research challenges coming from the experience that we’ve had over the last 10 years. There will be a follow-up report that should summarize the thoughts of the whole group. There were a lot of sessions and a lot of amazing discussions both during the day and in the evening facilitated by cheese & wine (a benefit of Dagstuhl) so it’s hard to summarize everything even just on a personal level but I wanted to pull out the things that have stuck with me now that I’m back at home:

1) Knowledge Graphs of Everything

We are increasingly seeing knowledge graphs that cover an entire category of entities. For example, Amazon’s product graph aims to be a knowledge graph of all products in the world, one can think of Google and Apple maps as databases of every location in the world, a database of every company that has ever had a web page, or a database of everyone in India. Two things stand out. One, is that these are large sets of instance data. I would contend their focus is not deeply modeling the domain in some expressive logic ala Cyc. Second, a majority of these databases are build by private companies. I think it’s an interesting question as to whether things like Wikidata can equal these private knowledge graphs in a public way.

Once you start thinking at this scale, a number of interesting questions arise: how you keep these massive graphs up to date; can you integrate these graphs, how do you manage access control and policies (“controlled access”); what can you do with this; can we extend these sorts of graphs to the physical system (e.g. in IoT); what about a knowledge graph of happenings (ie. events). Fundamentally, I think this “everything notion” is a useful framing device for research challenges.

2) Knowledge Graphs as a communication medium

A big discussion point during the seminar was the integration of symbolic and sub-symbolic representations. I think that’s obvious given the success of deep learning and importantly in the representation space – embeddings. I liked how Michael Witbrock framed symbols as a strong prior on something being the case. Indeed, using background knowledge has been shown to improve learning performance on several tasks (e.g. Baier et al. 2018, Marino et al. 2017).

But this topic in general got us thinking about the usefulness of knowledge graphs as an exchange mechanism for machines. There’s is a bit of semantic web dogma that expressing things in a variant of logic helps for machine to machine communication. This is true to some degree but you can image that machine’s might like to consume a massive matrix of numbers instead of human readable symbols with logical operators.

Given that, then, what’s the role of knowledge graphs? One can hypothesize that it is for the exchange of large scale information between humanity and machines and vis versa. Currently, when people communicate large amounts of data they turn towards structure (i.e. libraries, websites with strong information architectures, databases). Why not use the same approach to communicate with machines then. Thus, knowledge graphs can be thought of as a useful medium of exchange between what machines are generating and what humanity would like to consume.

On a somewhat less grand note, we discussed the role of integrating different forms of representation in one knowledge graph. For example, keeping images represented as images and audio represented as audio alongside facts within the same knowledge graph. Additionally, we discussed different mechanisms for attaching semantics to the symbols in knowledge graphs (e.g. latent embeddings of symbols). I tried to capture some of that thinking in a brief overview talk.

In general, as we think of knowledge graphs as a communication medium we should think how to both tweak and expand the existing languages of expression we use for them and the semantics of those languages.

3) Knowledge graphs as social-technical processes

The final kind of thing that stuck in my mind is that at the scale we are talking about much of the issues resolve around the notions of the complex interplay between humans and machines in producing, using and maintaining knowledge graphs. This was reflected in multiple threads:

  • Juan Sequeda’s thinking emerging from his practical experience on the need for knowledge / data engineers to build knowledge graphs and the lack of tooling for them. In some sense, this was a call to revisit the work of ontology engineering but now in the light of this larger scale and extensive adoption.
  • The facts established by the work of Wouter Beek and co on empirical semantics that in large scale knowledge graphs actually how people express information differs from the intended underlying semantics.
  • The notions of how biases and perspectives are reflected in knowledge graphs and the steps taken to begin to address these. A good example is the work of wikidata community to present the biases and gaps in its knowledge base.
  • The success of schema.org and managing the overlapping needs of communities. This stood out because of the launch of Google Dataset search service based on schema.org metadata.

While not related directly to knowledge graphs during the seminar the following piece on the relationship between AI systems and humans came was circulating:

Kate Crawford and Vladan Joler, “Anatomy of an AI System: The Amazon Echo As An Anatomical Map of Human Labor, Data and Planetary Resources,” AI Now Institute and Share Lab, (September 7, 2018) https://anatomyof.ai

There is critical need for more data about the interface between the knowledge graph and its maintainers and users.

As I mentioned, there was lots more that was discussed and I hope the eventual report will capture this. Overall, it was fantastic to spend a week with the people below – both fun and thought provoking.

Random ponters:

Whiteboard action! @dagstuhl #knowledgegraphs pic.twitter.com/cZT0NwWUp2

— Marieke van Erp (@merpeltje) September 12, 2018

Source: Think Links

Posted in Paul Groth, Staff Blogs

The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive

[This post describes the Master Project work of Information Science students Tim de Bruijn and John Brooks and is based on their theses]

Audiovisual archives adopt structured vocabularies for their metadata management. With Semantic Web and Linked Data now becoming more and more stable and commonplace technologies, organizations are looking now at linking these vocabularies to external sources, for example those of Wikidata, DBPedia or GeoNames.

However, the benefits of such endeavors to the organizations are generally underexplored. For their master project research, done in the form of an internship at the Netherlands Institute for Sound and Vision (NISV), Tim de Bruijn and John Brooks conducted a case study into the benefits of linking the “Common Thesaurus for Audiovisual Archives(or GTAA) and the general-purpose dataset Wikidata. In their approach, they identified various use cases for user groups that are both internal (Tim) as well as external (John) to the organization. Not only were use cases identified and matched to a partial alignment of GTAA and Wikidata, but several proof of concept prototypes that address these use cases were developed. 

 

For the internal users, three cases were elaborated, including a calendar service where personnel receive notifications when an author of a work has passed away 70 years ago, thereby changing copyright status of the work. This information is retrieved from the Wikidata page of the author, aligned with the GTAA entry (see fig 1 above).

A second internal case involves the new ‘story platform’ of NISV. Here Tim implemented a prototype enduser application to find stories related to the one currently shown to the user, based on persons occuring in that story (fig 2).

The external cases centered around the users of the CLARIAH Media Suite. For this extension, several humanities researchers were interviewed to identify worthwile extensions with Wikidata information. Based on the outcomes of these interviews, John Brooks developed the Wikidata retrieval service (fig 3).

The research presented in the two theses are a good example of User-Centric Data Science, where affordances provided by data linkages are aligned with various user needs. The various tools were evaluated with end users to ensure they match their actual needs. The research was reported in a research paper which will be presented at the MTSR2018 conference: (Victor de Boer, Tim de Bruijn, John Brooks, Jesse de Vos. The Benefits of Linking Metadata for Internal and External users of an Audiovisual Archive. To appear in Proceedings of MTSR 2018 [Draft PDF])

Find out more:

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Developing a Sustainable Weather Information System in Rural Burkina Faso

[This post describes the Information Sciences Master Project of Hameedat Omoine and is based on her thesis.] 

In the quest to improve the lives of farmers and improve agricultural productivity in rural Burkina Faso, meteorological data has been identified as one of the is key information needs for local farmers. Various online weather information services are available, but many are not tailored specifically to tis target user group. In a research case study, Hameedat Omoine designed a weather information system that collects not only weather but also related agricultural information and provides the farmers with this information to allow them to improve agricultural productivity and the livelihood of the people of rural Burkina Faso.

The research and design of the system was conducted at and in collaboration with 2CoolMonkeys, a Utrecht-based Open data and App-development company with expertise in ICT for Development (ICT4D).

Following the design science research methodology, Hameedat investigated the requirements for a weather information system, and the possible options for ensuring the sustainability of the system. Using a structured approach, she developed the application and evaluated it in the field with potential Burkinabe end users. The mobile interface of the application featured weather information and crop advice (seen in the  images above). A demonstration video is shown below

Hameedat developed multiple alternative models to investigate the sustainability of the application. For this she used the e3value approach and language. The image below shows a model for the case where a local radio station is involved.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

An Approach to a Sustainable Weather Information System in Rural Burkina Faso

[This post describes the Information Sciences Master Project of Hameedat Omoine and is based on her thesis.] 

In the quest to improve the lives of farmers and improve agricultural productivity in rural Burkina Faso, meteorological data has been identified as one of the is key information needs for local farmers. Various online weather information services are available, but many are not tailored specifically to tis target user group. In a research case study, Hameedat Omoine designed a weather information system that collects not only weather but also related agricultural information and provides the farmers with this information to allow them to improve agricultural productivity and the livelihood of the people of rural Burkina Faso.

The research and design of the system was conducted at and in collaboration with 2CoolMonkeys, a Utrecht-based Open data and App-development company with expertise in ICT for Development (ICT4D).

Following the design science research methodology, Hameedat investigated the requirements for a weather information system, and the possible options for ensuring the sustainability of the system. Using a structured approach, she developed the application and evaluated it in the field with potential Burkinabe end users. The mobile interface of the application featured weather information and crop advice (seen in the  images above). A demonstration video is shown below

Hameedat developed multiple alternative models to investigate the sustainability of the application. For this she used the e3value approach and language. The image below shows a model for the case where a local radio station is involved.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Trip Report: Provenance Week 2018

A couple of weeks ago I was at Provenance Week 2018 – a biennial conference that brings together various communities working on data provenance. Personally, it’s a fantastic event as it’s an opportunity to see the range of work going on from provenance in astronomy data to the newest work on database theory for provenance. Bringing together these various strands is important as there is work from across computer science that touches on data provenance.

James Cheney's TaPP keynote. Different flavors of provenance. #provenanceweek .. pic.twitter.com/OdFCqKQCGs

— Bertram Ludäscher (@ludaesch) July 11, 2018

The week is anchored by the International Provenance and Annotation Workshop (IPAW) and the Theory and Practice of Provenance (TaPP) and includes events focused on emerging areas of interest including incremental re-computation , provenance-based security and algorithmic accountability. There were 90 attendees up from ~60 in the prior events and here they are:

IMG_0626-2.jpg

The folks at Kings College London, led by Vasa Curcin, did a fantastic job of organizing the event including great social outings on-top of their department building and with a boat ride along the thames. They also catered to the world cup fans as well. Thanks Vasa!

2018-07-11 21.29.07

I had the following major takeaways from the conference:

Improved Capture Systems

The two years since the last provenance week have seen a number of improved systems for capturing provenance. In the systems setting, DARPAs Transparent Computing program has given a boost to scaling out provenance capture systems. These systems use deep operating system instrumentation to capture logs over the past several years these have become more efficient and scalable e.g. Camflow, SPADE. This connects with the work we’ve been doing on improving capture using whole system record-and-replay. You  can now run these systems almost full-time although they capture significant amounts of data (3 days = ~110 GB). Indeed, the folks at Galois presented an impressive looking graph database specifically focused on working with provenance and time series data streaming from these systems.

Beyond the security use case, sciunit.run was a a neat tool using execution traces to produce reproducible computational experiments.

There were also a number of systems for improving the generation of instrumentation to capture provenance. UML2PROV automatically generates provenance instrumentation from UML diagrams and source code using the provenance templates approach. (Also used to capture provenance in an IoT setting.) Curator implements provenance capture for micro-services using existing logging libraries. Similarly, UNICORE now implements provenance for its HPC environment. I still believe structured logging is one of the under rated ways of integrating provenance capture into systems.

Finally, there was some interesting work on reconstructing provenance. In particular, I liked Alexander Rasin‘s work on reconstructing the contents of a database from its environment to answer provenance queries:2018-07-10 16.34.08.jpg

Also, the IPAW best paper looked at using annotations in a workflow to infer dependency relations:

Kudos to Shawn and Tim for combining theory & practice (logic inference & #YesWorkflow) in powerful new ways! #IPAW2018 Best Paper is available here: https://t.co/te3ce7mV6Y https://t.co/WT0F0PcMz7

— Bertram Ludäscher (@ludaesch) July 27, 2018

Lastly, there was some initial work on extracting provenance of  health studies directly from published literature which I thought was a interesting way of recovering provenance.

Provenance for Accountability

Another theme (mirrored by the event noted above) was the use of provenance for accountability. This has always been a major use for provenance as pointed out by Bertram Ludäscher in his keynote:

The need for knowing where your data comes from all the way from 1929 @ludaesch #provenanceweek https://t.co/TIgPEOFjxb pic.twitter.com/2NpbSMI699

— Paul Groth (@pgroth) July 9, 2018

However, I think due to increasing awareness around personal data usage and privacy the need for provenance is being recognized. See, for example, the Royal Society’s report on Data management and use: Governance in the 21st century. At Provenance Week, there were several papers addressing provenance for GDPR, see:

I'd like to shamelessly pitch our GDPRov ontology as a superset of this work. The key difference here being justification is used as a legal concept. We use hasLegalBasis as a property. https://t.co/W4V9r2QXwA

— Harshvardhan Pandit (@coolharsh55) July 10, 2018

Also, the I was impressed with the demo from Imosphere using provenance for accountability and trust in health data:

Great to be part of #provenanceweek at @KingsCollegeLon, here's Anthony @ScampDoodle demonstrating the data provenance functionality within Atmolytics at yesterday's sessions. To learn more about the benefits of data provenance in analytics go to https://t.co/8NdmN2ECrP pic.twitter.com/ueDQUi7jSG

— Imosphere (@Imosphere) July 11, 2018

Re-computation & Its Applications

Using provenance to determine what to recompute seems to have a number of interesting applications in different domains. Paolo Missier showed for example how it can be used to determine when to recompute in next generation sequencing pipelines.

Our #provenanceWeek IPAW 2018 conference paper on using provenance to facilitate re-computation analysis in the ReComp project. Link to paper: here: https://t.co/zeZ9xROm2S
Link to presentation: https://t.co/w6cVwpdLGT

— Paolo Missier (@PMissier) July 9, 2018

I particular liked their notion of a re-computation front – what set of past executions do you need to re-execute in order to address the change in data.

Wrattler was a neat extension of the computational notebook idea that showed how provenance can be used to automatically propagate changes through notebook executions and support suggestions.

Marta Mattoso‘s team discussed the application of provenance to track the adjustments when performing steering of executions in complex HPC applications.

The work of Melanie Herschel‘s team on provenance for data integration points to the benefits of potentially applying recomputation using provenance to make the iterative nature of data integration speedier as she enumerated in her presentation at the recomputation worskhop.2018-07-12 15.01.42.jpg

You can see all the abstracts from the workshop here. I understand from Paolo that they will produce a report from the discussions there.

Overall, I left provenance week encouraged by the state of the community, the number of interesting application areas, and the plethora of research questions to work on.

Random Links

 

Source: Think Links

Posted in Paul Groth, Staff Blogs

Testimonials Digital Humanities minor at DHBenelux2018

At the DHBenelux 2018 conference, students from the VU minor “Digital Humanities and Social Analytics” presented their final DH in Practice work. In this video, the students talk about their experience in the minor and the internship projects. We also meet other participants of the conference talking about the need for interdisciplinary research.

 

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Big Data Europe Project ended

All good things come to an end, and that also holds for our great Horizon2020 project “Big Data Europe“, in which we collaborated with a broad range of techincal and domain partners to develop (Semantic) Big Data infrastructure for a variety of domains. VU was involved as work package leader in the Pilot and Evaluation work package and co-developed methods to test and apply the BDE stack in Health, Traffic, Security and other domains..

You can read more about the end of the project in this blog post at the BDE website.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

André Baart and Kasadaka win IXA High Potential Award

Andre and his prizeOn 19 June, André Baart was awarded the High Potential Award at the Amsterdam Science & Innovation en Impact Awards for his and W4RA‘s work on the Kasadaka platform.

Kasadaka (“talking box”) is an ICT for Development (ICT4D) platform to develop voice-based technologies for those who are not connected to the Internet, cannot not read and write, and speak underresourced languages.

As part of a longer-term project, the Kasadaka Voice platform and software development kit (VSDK), has been developed by André Baart as part of his BSc and MSc research at VU. In that context it has been extensively tested in the field, for example by Adama Tessougué, journalist and founder of radio Sikidolo in Konobougou, a small village in rural Mali. It was also evaluated in the context of the ICT4D course at VU, by 46 master students from Computer Science, Information Science and Artificial Intelligence. The Kasadaka is now in Sarawak Malaysia, where it will be soon deployed in a Kampong, by Dr. Cheah Waishiang, ICT4D researcher at the University of Malasia Sarawak (UNIMAS), and students from VU and UNIMAS.

André is currently pursuing his PhD in ICT4D at Universiteit van Amsterdam and still member of the W4RA team.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer