Discovering the underlying structure of controversial issues with topic modeling

[This post is by Tibor Vermeij about his Master project]

For the Master project of the Information Sciences programme at the Vrije Universiteit, Tibor Vermeij investigated a solution to discover the structure of controversial issues on the web. The project was done in collaboration with the Controcurator project.

Detecting controversy computationally has been getting more and more attention. Because there is a lot of data available digitally, controversy detection methods that make use of machine learning and natural language processing techniques have become more common. However, a lot of studies try to detect the controversy of articles, blog post or tweets individually. The relation between controversial entities on the web is not often explored.

To explore the structure of controversial issues a combination of topic modeling and hierarchical clustering was used. With topic modeling, the content discussed in a set of Guardian articles was discovered. The resulting topics were used as input for a Hierarchical agglomerative clustering algorithm to find the relations between articles.

The clusters were evaluated with a user study. A Questionnaire was sent out that tested the performance of the pipeline in three categories: The similarity of articles within a cluster, the cohesion of the clusters and the hierarchy and the relation between controversy of single articles compared to the controversy of their corresponding clusters.

The questionnaire showed promising results. The approach can be used to get an indication of the general content of the articles. Articles within the same cluster were more similar compared to articles of different clusters, which means that the chosen clustering method resulted in coherent topics in the controversial clusters that were retrieved. Opinions on controversy itself showed a high amount of variance between participants, re-enforcing the subjectiveness of human controversy estimation. While the deviation between the individual assessments was quite high, averaged rater scores were comparable to calculated scores which suggest a correlation between the controversy of articles within the same cluster.

The full thesis can be found here

The presentation can be found here

