Trip report: Web Science summer school, Southampton

Last week, I had the chance to participate in the Web Science summer school in Southampton. For one week, I was listening to keynotes and tutorials, participating in discussions, and coding away, experimenting with a few new web-dev technologies. I also had a short trip to beautiful Winchester, in the English countryside. A big thanks to Elena Simperl, and all of the organizers, for putting together this nice event, as well as the Network Institute for sponsoring my trip. Here’s a quick summary of what was going on.


WebSci summer school in Southampton starts off with @DameWendyDBE – so excited! #ageofdata

— Anca Dumitrache (@anouk_anca) July 21, 2014

The summer school kicked off with a keynote from Dame Wendy Hall on observing the Web. She discussed the history of Web Science and how it emerged as an inter-disciplinary field in the larger context of the history of the Web, and her work with Tim Berners-Lee. Openness and interlinking have been core attributes of the Web since its inception, and as the Web grows, we see both the emergence of large scale collaborative environments (e.g. Wikipedia) and the aggregation of users into curated communities (e.g. Twitter). This added component of social activity to the technical part of Web publishing created the need for an integrated approach of studying the Web.

Web Science is the theory and practice of social machines @DameWendyDBE #ageofdata

— Anca Dumitrache (@anouk_anca) July 21, 2014

The following day, the focus of the discussions shifted to the Semantic Web, starting with a keynote from Jim Hendler on the history of the Semantic Web, and an overview of the technologies associated with the field, from RDF to A particularly interesting aspect of this history is the slow adoption process from the academic to the production-ready environment. The maturation of semantic technologies (e.g. embedded markup, user friendly ontologies/schemes) definitely played an impact on this, as well as the growth of the Web, and the emergence of big data processing. The takeaway message is that web semantics have slowly become a necessary part of the Web.

Data has the most power when it’s aggregated – @jahendler keynote #ageofdata

— Anca Dumitrache (@anouk_anca) July 22, 2014

In keeping with this theme, Sir Nigel Shadbolt spoke about one of the most successful use cases for linked data – the field of open government. The Open Data Institute, where Shadbolt is currently a collaborator, has been the promoter for publishing semantic data belonging to the UK government.

Making data open ends up improving data quality, accuracy in the long run – @Nigel_Shadbolt keynote #ageofdata

— Anca Dumitrache (@anouk_anca) July 22, 2014

Wednesday brought a more unusual keynote, with science fiction writer Cory Doctorow giving us his take on open publishing, DRM and copyright. Doctorow provided an important perspective on the harm of overly restrictive publishing standards, from the point of view of the authors who supposedly have the most to benefit from these standards. He made the case that digital locks on content blocks the adaption of new features and technologies. Involuntarily, this can end up promoting security weaknesses that impact both the consumers and the publishers. While the perfect open system for publishing does not exist at the moment, this does not invalidate the need for moving towards more open copyright and publishing standards.

Urinary tract infection business model: getting new features is a long, painful process – @doctorow on copyright restrictions #ageofdata

— Anca Dumitrache (@anouk_anca) July 23, 2014

Chris Welty, my supervisor and founding member of our CrowdTruth team, also gave a keynote, discussing about the inner workings of IBM Watson, and the Jeopardy! challenge. By being able to compete (and win!) against humans in a question answering (or answer questioning, as per the Jeopardy! format) challenge, Watson introduced a new software paradigm, where large datasets of human-generated knowledge are modelled for machine understanding. In this case, the real challenge is not the knowledge acquisition, but the modelling of the reasoning process for question answering. This is fundamentally different than the way humans approach the same task, which explains why the mistakes that Watson makes — gaps in the understanding of how various content features influence the answer (see the infamous What is Toronto?????) — are completely different than the ones humans make, which are usually due to gaps in knowledge.

#AgeofData, Chris Welty: machine intelligence is NOT human intelligence: make mistakes of different type #IBMWatson

— Lora Aroyo (@laroyo) July 24, 2014

The final keynote was another familiar figure, Prof. Guus Schreiber from the VU University Amsterdam, with a talk about knowledge and ontology engineering. As a co-chair of the W3C RDF working group, he gave us a useful primer on how ontologies should be used for knowledge sharing. The key message was that vocabulary reusing and aligning is the way to go, in keeping with the Web principles of knowledge sharing and interlinking. This comes with one caveat however — having a unified vocabulary for all human knowledge is just not possible (as we have also seen in our CrowdTruth work with language ambiguity), so the task of knowledge representation is constantly evolving.

Papers about your own idiosyncratic “university ontology” should be rejected at conferences – @GuusSchreiber keynote #AgeOfData

— Anca Dumitrache (@anouk_anca) July 25, 2014

Tutorials + hands-on

The summer school also offered a series of tutorials coupled with hands-on sessions on a wide variety of topics related to Web Science. We started off with Claudia Wagner‘s tutorial on data analytics and data science in general. She gave an overview of most basic statistical analysis techniques, quickly going through topics from linear algebra to probability distributions to regression. It was also encouraging to see others promote R as a good tool for statistical programming, as this is currently one of my favourite languages to work with.

Next, Max Van Kleek talked us through methods for data visualization, as well as doing a short primer on D3. I particularly enjoyed the nods to Edward Tufte, as well as this famous graph by Charles Joseph Minard, charting the advance of Napoleon’s troops into Russia.

The second day had tutorials on data publishing and Web observatories. First, Barry Norton talked to us about the differences between Semantic Web data, and data in typical table formats (e.g. CSV), XML, SOAP/REST or JSON. In short, having the added dimension of relations between entities gives the ability to express hierarchies, that are also extensible and easy to edit, while also giving unique identifiers to entities so that interlinking with existing web data becomes easier. This talk was coupled with a SPARQL tutorial on a dataset from the British Museum. Ramine Tinati also talked about representing and understanding of big data, this time from the perspective of Web observatories (mentioned also in the opening keynote by Wendy Hall).

Why not CSVs or RDBMS? @BarryNorton arguing against Zoidberg about the need for data semantics #ageofdata

— Anca Dumitrache (@anouk_anca) July 22, 2014

Finally, Lora Aroyo, the leader of our CrowdTruth team, gave a tutorial on crowdsourcing and human computing. We went through the history of human computing, from the Math Rosies of World War II computing weapon trajectories, to the Games With a Purpose revolution pioneered by Luis von Ahn with the ESP game for image tagging. Nowadays, crowdsourcing is studied as a solution for computationally difficult tasks that involve unquantifiable human knowledge (e.g. language understanding). We also touched on some of the core tenets of the CrowdTruth approach — disagreement as an inherent property of crowdsourcing systems, ways to measure and understand its effects, and the shift from a rigid Ground Truth to the less prescriptive CrowdTruth approach, which attempts to account for the variation in human semantics. We also ran through a hands-on session where we were asked to design tasks for some of the CrowdTruth use cases relating to text and image processing. This generated some good discussion on ways to fragment micro-tasks in order to engage the workers, and methods for performing verification of the workers’ output.

Resistance is futile #ageofdata #hcomp #websci w/ @laroyo

— Elena Simperl (@esimperl) July 23, 2014

Poster presentations

As a means to get to know each other and get familiar with the work that the participants in the summer school are doing, we presented posters of our latest research work to our peers. I showed a poster on the CrowdTruth approach and how it works in the medical domain, which received some good feedback and went on to win the Best Poster award at the end of the summer school. Most seemed to appreciate the balance between written content and graphics, as well as the discussion generated around it.

Project work

The final product of our time at the summer school was a group project where we were asked to explore a Web Science topic of our choice, in the hopes of setting up future collaborations. I worked together with Seyi Feyisetan and Fabio Benedetti at designing a Web observatory for studying tech communities, and ways to facilitate communication between them through the cross-posting of relevant content. We studied the use case of comparing Hacker News, a news aggregating website for IT-related topics, and Stack Overflow (which probably needs no introduction). We performed an exploratory data analysis on the most frequent topics discussed in these communities, the results of which are detailed in the slides below:

Assorted thoughts

The summer school program covered a wide array of topics, some of which I was already familiar with, and some that were completely new. Through a nice selection of keynotes and tutorials, we received thorough overview of the current state of the art in Web Science. I also appreciated the diverse crowd of students — from social scientists to machine learning enthusiasts — which I believe is one of the main driving forces behind Web Science innovation. Probably the only thing I would have changed is having a more flexible schedule, and being able to spend more time on the topics that I was unfamiliar with. We were also lucky enough to enjoy a full week in England without any rain! This made our excursion to the medieval city of Winchester all the more lovely. All in all, it was a great experience.

For more details of what went through, you can check out this very thorough document of open notes taken during the event, set up by Gianfranco Cecconi.


