Last week, I hung out in Bethlehem, Pennsylvania for the the 14th International Semantic Web Conference. Bethlehem is famous for the Lehigh University Benchmark (LUBM) and Bethlehem Steel. This is the major conference focused on the intersection of semantics and web technologies. In addition to be technically super cool, it was a great chance for me to meet many friends and make some new ones.
Let’s begin with some stats:
- ~450 attendees
- The conference continues to be selective:
- Research track: 22% acceptance rate
- Empirical studies track: 29% acceptance rate
- In-use track: 40% acceptance rate
- Datasets and Ontologies: 22% acceptance rate
- There were 265 submissions across all tracks which is surprisingly the same number as last year.
- More stats and info in Stefan’s slides (e.g. move to Portugal if you want to get your papers in the conference.)
- Fancy visualizations courtesy of the STKO group
— Jodi Schneider (@jschneider) October 12, 2015
Before getting into what I thought were the major themes of the conference, a brief note. Reviewing is at the heart of any academic conference. While we can always try and improve review quality, it’s worth calling out good reviewing. The best reviewers were Maribel Acosta (research) and Markus Krötzsch (applied). As data sets and ontologies track co-chair, I can attest to how important good reviewers are. For this new track we relied heavily on reviewers being flexible and looking at these sorts of contributions differently. So thanks to them!
For me there were three themes of ISWC:
- The Spectrum of Entity Resolution
- The Spectrum of Linked Data Querying
- Buy more RAM
The Spectrum of Entity Resolution
Maybe its because I attended the NLP & DBpedia workshop or the conversation I had about string similarity with Michelle Cheatham, but one theme that I saw was the continued amalgamation of natural language processing (NLP) style entity resolution with database entity resolution (i.e. record linkage). This movement stems from the fact that an increasing amount of linked data is a combination of data extracted from semi-structured sources as well as from NLP. But in addition to that, NLP sources rely on some of these semi-structured datasources to do NLP.
Probably, the best example of that idea is the work that Andrew McCallum presented in his keynote on “epistemlogical knowledge bases”.
Epistemological Knowledgebase – Andrew McCallum pic.twitter.com/GGwOvNxzYk
— Paul Groth (@pgroth) October 20, 2015
Briefly, the idea is to reason with all the information coming from both basic low level NLP (e.g. basic NER, or even surface forms) as well as the knowledge base jointly (plus, anything else) to generate a knowledge base. One method to do this is universal schemas. For a good intro, check out Sebastien Riedel’s slides.
— Lora Aroyo (@laroyo) October 14, 2015
From McCallum, I like the following papers which gives a good justification and results of doing collective/joint inference.
- A Joint Model for Discovering and Linking Entities. Michael Wick, Sameer Singh, Harshal Pandya, Andrew McCallum. Third International Workshop on Automated Knowledge Base Construction (AKBC), 2013.
(Self promotion aside: check out Sara Magliacane’s work on Probabilistic Soft Logics for another way of doing joint inference.)
Following on from this notion of reasoning jointly, Hulpus, Prangnawarat and Hayes showed how to use the graph-based structure of linked data to to perform joint entity and word sense disambiguation from text. Likewise, Prokofyev et al. use the properties of a knowledge graph to perform better co-reference resolution. Essentially, they use this background knowledge to split the clusters of co-referrent entities produced by Stanford CoreNLP. On the same idea, but for more structured data, the TableEL system uses a joint model with soft constraints to perform entity linking for web tables, improving performance by up-to 75% on web tables. (code & data)
One approach to entity linking that I liked was from the Raphael Troncy’s crew titled “Reveal Entities From Texts With a Hybrid Approach” (paper, slides). (Shouldn’t it be “Revealing..”?). They showed that by using essentially the provenance of the data sources they are able to build an adaptive entity linking pipeline. Thus, one doesn’t necessarily have to do as much domain tuning to use these pipelines.
While not specifically about entity resolution, a paper worth pointing out is Type-Constrained Representation Learning in Knowledge Graphs from Denis Krompaß, Stephan Baier and Volker Tresp. They show how background knowledge about entity types can help improve link prediction tasks for generating knowledge graphs. Again, use the kitchen sink and you’ll perform better.
There were a couple of good resources presented for entity resolution tasks. Bryl, Bizer and Paulheim produced a dataset of surface forms for dbpedia entities. They were able to boost performance up to 20% for extracting accurate surface forms for entities through filtering. Another tool, LANCE looks great for systematically generating benchmark and test sets for instance matching (i.e. entity linking). Also, Michel Dumontier presented work that had a benchmark for entity linking from the life sciences domain.
Finally, as we get better at entity resolution, I think people will turn towards fusion (getting the best possible representation for a real world entity). Examples include:
- Collecting, Integrating, Enriching and Republishing Open City Data as Linked Data. Checkout, slide 5 and on.
- KR2RML – using the nested relational model to do munging.
- Effective Online Knowledge Graph Fusion
The Spectrum of Linked Data Querying
— Juan Sequeda (@juansequeda) October 12, 2015
So Linked Data Fragments from Ruben Verborgh was the huge breakout of the conference. Oscar Corcho’s excellent COLD keynote was a riff off thinking about the spectrum (from data dumps through to full sprawl queries) that was introduced by Reuben. Another example was the work of Maribel Acosta and Maria-Esther Vidal on “Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data”. They developed an adaptive client side spraql query engine for linked data fragments. This allows the server side to support a much simpler API by having a more intelligent client side. (An aside, kids this is how a technical talk should be done. Precise, clean, technical, understandable. Can’t wait to have the the video lecture for reference.)
Even the most centralized solution, the LODLaundromat which is a clean crawl of the entire web of data supports Linked Data Fragments. In some sense, by asking the server to do less you can handle more linked data, and thus do more powerful analysis. This is exemplified by the best paper LODLab byLaurens Rietveld, Wouter Beek, and Stefan Schlobach, which allowed for the reprouduction of 3 existing analysis of the web of data at scale.
I think Olaf Hartig, in his paper on LDQL, framed the problem best as (N, Q} (slides). First define the “crawl” of the web you want to query (N) and then define the query (Q). When we think about what and where are crawls are, we can think about what execution strategies and types of queries we can best support. Or put another way:
— Harald Sack (@lysander07) October 13, 2015
More Main Memory = better Triple Stores
Designing scalable graph / triple stores has always been a challenge. We’ve been trapped by the limits of RAM. But computer architecture is changing, and we now have systems that have a lot of main memory either in one machine or across multiple machines. This is a boon to triple stores and graph processing in general. See for example Leskovec team’s work from SIGMOD:
- Ringo: Interactive Graph Analytics on Big-Memory Machines by Y. Perez, R Sosic, A. Banerjee, R. Puttagunta, M. Raison, P. Shah, J. Leskovec. SIGMOD 2015.
We saw that theme at ISWC as well:
- The folks at Oxford presented RDFox: A Highly-Scalable RDF Store designed for huge main memory machines. It also has a sweet reasoner built in. Datalog in a triple store – sweet.
- Olivier Curé’s work on the beginnings of a scalable triple store over Apache Spark. (On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark)
- Cray’s URIKA appliance for RDF that again leveraged massive RAM and parallel threads. (I enjoyed the talk from David Mizell on it’s development in the scalable kb workshop.
- Stardog is also a completely in-memory store.
- ….but BlazeGraph uses GPUs – sweet – It’s always fun talking to Bryan Thompson.
Moral of the story: Buy RAM
- Semantic web people are surprisingly good at dancing …
- Congrats Peter Mika
- Beyond Owl
- What I said about ISWC 2014
- Surprised that the casino’s let you smoke indoors – ugh
- ☑ and ❌ seem to be a thing on presentations.
- Tobias Kuhn’s ISWC trip report
- Juan, if I come to Austin will you drink a bud light?
- Optional HTML paper submissions in 2016, why not?
- Ontologies pop up in interesting places: Financial Industry Business Ontology
- Lehigh University has a beautiful campus.
- Markdown + RDFa + nanopublications + prov – @jpmccu
- Drug-Drug interactions as nanopublications
- preprints are on the website
- Spies are everywhere!
- Thomson Reuters does linked data.
- The semantic web challenge is getting professional.
- PROV everywhere (1, 2)