Historical Toponym Disambiguation

[This blog post is based on the Master thesis Information Sciences of Bram Schmidt, conducted at the KNAW Humanities cluster and IISG. It reuses text from his thesis]

Place names (toponyms) are very ambiguous and may change over time. This makes it hard to link mentions of places to their corresponding modern entity and coordinates, especially in a historical context. We focus on historical Toponym Disambiguation approach of entity linking based on identified context toponyms.

The thesis specifically looks at the American Gazetteer. These texts contain fundamental information about major places in its vicinity. By identifying and exploiting these tags, we aim to estimate the most likely position for the historical entry and accordingly link it to its corresponding contemporary counterpart.

Example of a toponym in the Gazetteer

Therefore, in this case study, Bram Schmidt examined the toponym recognition performance of state-of-the-art Named Entity Recognition (NER) tools spaCy and Stanza concerning historical texts and we tested two new heuristics to facilitate efficient entity linking to the geographical database of GeoNames.

Experiments with different geo-distance heuristics show that indeed this can be used to disambiguate place names.

We tested our method against a subset of manually annotated records of the gazetteer. Results show that both NER tools do function insufficiently in their task to automatically identify relevant toponyms out of the free text of a historical lemma. However, exploiting correctly identified context toponyms by calculating the minimal distance among them proves to be successful and combining the approaches into one algorithm shows improved recall score.

Bram’s thesis was co-supervised by Marieke van Erp and Romke Stapel. His thesis can be found here [pdf]

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Context-based Toponym Disambiguation regarding Historical Gazetteer

[This blog post is based on the Master thesis Information Sciences of Bram Schmidt, which he conducted at the KNAW Humanities cluster and IISG. It reuses text from his thesis]

Place names (toponyms) are very ambiguous and may change over time. This makes it hard to link mentions of places to their corresponding modern entity and coordinates, especially in a historical context. We focus on historical Toponym Disambiguation approach of entity linking based on identified context toponyms.

The thesis specifically looks at the American Gazetteer. These texts contain fundamental information about major places in its vicinity. By identifying and exploiting these tags, we aim to estimate the most likely position for the historical entry and accordingly link it
to its corresponding contemporary counterpart.

Example of a toponym in the Gazetteer

Therefore, in this case study, Bram Schmidt examined the toponym recognition performance of state-of-the-art Named Entity Recognition (NER) tools spaCy and Stanza concerning historical texts and we tested two new heuristics to facilitate efficient entity linking to the geographical database of GeoNames.

Experiments with different geo-distance heuristics show that indeed this can be used to disambiguate place names.

We tested our method against a subset of manually annotated records of the gazetteer.
Results show that both NER tools do function insufficiently in their task
to automatically identify relevant toponyms out of the free text of a historical lemma. However, exploiting correctly identified context toponyms by calculating the minimal distance among them proves to be successful and combining the approaches into one algorithm shows improved recall score.

Bram’s thesis was co-supervised by Marieke van Erp and Romke Stapel. His thesis can be found here [pdf]

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

InTaVia project started

From November 1 2020, we are collaborating on connecting tangible and intangible heritage through knowledge graphs in the new Horizon2020 project “InTaVia“.

To facilitate access to rich repositories of tangible and intangible asset, new technologies are needed to enable their analysis, curation and communication for a variety of target groups without computational and technological expertise. In face of many large, heterogeneous, and unconnected heritage collections we aim to develop supporting technologies to better access and manage in/tangible CH data and topics, to better study and analyze them, to curate, enrich and interlink existing collections, and to better communicate and promote their inventories.

tangible and intagible heritage (img from project proposal)

Our group will contribute to the shared research infrastructure and will be responsible for developing a generic solution for connecting linked heritage data to various visualization tools. We will work on various user-facing services and develop an application shell and front-end for this connection
be responsible for evaluating the usability of the integrated InTaVia platform for specific users. This project will allow for novel user-centric research on topics of Digital Humanities, Human-Computer interaction and Linked Data service design.

screenshot of the virtual kickoff meeting

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Web Site Marketing And Appearance Engine Optimization

Search engine optimization is one various tactics that assist bring massive sums of FREE traffic to your website. With this tactic your goal would have your website appear on ideas of every internet search engine especially Google. An individual a multitude of methods to get the site on the top of these motors like google. Many people say that search engine optimization is a waste of time. This is usually basically because don’t know how you can use it in the right way.

Now time is factored in. It does require time and dedication to tracking websites and keyphrases. the downside for the corporate webmaster is that work on the digital marketing albuquerque time schedule with excellent functions your balance.

Witness each of this, an opportunistic spring water salesman sat at a close table. Simply because business owner held his head associated with hands with stress and grief, the spring water salesman agreed to “strike while the albuquerque search engine optimization iron was hot.” Together with own agenda in mind, he approached the entrepreneur and offered his own services.

In other words, they tend to attempt to paint the Mona Lisa within every site but forget to ask anyone into the museum. I’ve.E. All tech and no trading. Remember, we did talk for your importance of having the car moving didn’t we?

Name your files ideally. Just as it is in order to put keywords in your metatags as well as the copy of your site, it’s also very important to have keywords in the names of the files. If have web-site for the Electronic Widgets Company, your “About Us” page could have the file name “aboutus.html” but you would then not get keywords wearing it. It would be simpler to have “electronic-widgets-aboutus.html” or “about-electronic-widgets.html”. If you are a sub-directory also ask them if its name incorporates an important keyword. Never make documents such as “page1.hml” or “page2.html”, you are throwing away an possiblity to place your keyword.

The off-page optimization wasn’t as deep as the on-page search engine optimization. Off-page optimization mainly consisted acquiring backlinks as part of your site. Backlinks were just links that were coming from other sites and linking to be able to your online. There were many ways that marketers could obtain these links. Article marketing, blog commenting, press releases, and video marketing were only an of the methods that Internet marketers alburquerque seo were using to build backlinks into their sites.

If you’ve got been burned once by inadequate yahoo marketing, experience probably been burned as before. That’s because there is a learning curve in discerning what types of activities and corporations you can trust, and what kinds usually takes your money & operated. It can become a very discouraging process to uncover the “right” company, the best part is business owners simply give up.

These are just simple for you to cut cost on your up coming Postcard Mailing Campaign. There’s no reason stop marketing. Ab muscles slightly market, especially when times are tough. This investment can keep you over competition and keep your head above water while others sink.

Source: Data2Semantics

Posted in Data2Semantics, Projects

Automating Authorship Attribution

[This blog post was written by Nizar Hirzalla and describes his VU Master AI project conducted at the Koninklijke Bibliotheek (KB), co-supervised by Sara Veldhoen]

Authorship attribution is the process of correctly attributing a publication to its corresponding author, which is often done manually in real-life settings. This task becomes inefficient when there are many options to choose from due to authors having the same name. Authors can be defined by characteristics found in their associated publications, which could mean that machine learning can potentially automate this process. However, authorship attribution tasks introduce a typical class imbalance problem due to a vast number of possible labels in a supervised machine learning setting. To complicate this issue even more, we also use problematic data as input data as this mimics the type of available data for many institutions; data that is heterogeneous and sparse of nature.

Inside the KB (photo S. ter Burg)

The thesis searches for answers regarding how to automate authorship attribution with its known problems and this type of input data, and whether automation is possible in the first place. The thesis considers children’s literature and publications that can have between 5 and 20 potential authors (due to having the same exact name). We implement different types of machine learning methodologies for this method. In addition, we consider all available types of data (as provided by the National Library of the Netherlands), as well as the integration of contextual information.

Furthermore, we consider different types of computational representations for textual input (such as the title of the publication), in order to find the most effective representation for sparse text that can function as input for a machine learning model. These different types of experiments are preceded by a pipeline that consists out of pre-processing data, feature engineering and selection, converting data to other vector space representations and integrating linked data. This pipeline shows to actively improve performance when used with the heterogeneous data inputs.

Implemented neural network architectures for TFIDF (left) and Word2Vec (right) based text classification

Ultimately the thesis shows that automation can be achieved in up to 90% of the cases, and in a general sense can significantly reduce costs and time consumption for authorship attribution in a real-world setting and thus facilitate more efficient work procedures. While doing so, the thesis also finds the following key notions:

  1. Between comparison of machine learning methodologies, two methodologies are considered: author classification and similarity learning. Author classification grants the best raw performance (F1. 0.92), but similarity learning provides the most robust predictions and increased explainability (F1. 0.88). For a real life setting with end users the latter is recommended as it presents a more suitable option for integration of machine learning with cataloguers, with only a small hit to performance.
  2. The addition of contextual information actively increases performance, but performance depends on the type of information inclusion. Publication metadata and biographical author information are considered for this purpose. Publication metadata shows to have the best performance (predominantly the publisher and year of publication), while biographical author information in contrast negatively affects performance.
  3. We consider BERT, word embeddings (Word2Vec and fastText) and TFIDF for representations of textual input. BERT ultimately grants the best performance; up to 200% performance increase when compared to word embeddings. BERT is a sophisticated language model with an applied transformer, which leads to more intricate semantic meaning representation of text that can be used to identify associated authors. 
  4. Based on surveys and interviews, we also find that end users mostly attribute importance to author related information when engaging in manual authorship attribution. Looking more in depth into the machine learning models, we can see that these primarily use publication metadata features to base predictions upon. We find that such differences in perception of information should ultimately not lead to negative experiences, as multiple options exist for harmonizing both parties’ usage of information.
Summary of the final performances of the best performing models from the differing implemented methodologies

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Eight Indispensable Seo Tips

Those prolific songwriters Leiber and Stoller knew what she were doing when they wrote the rock and roll classic, “Kansas Elegant.” For those staying at any Power and lightweight hotel, is found no doubt that area is you’ll a strong musical custom.

The is actually that do not know what they’re doing. They don’t understand how search engines work. They don’t do analytics. They don’t know whatever we know about keywords. They sure recognize how to make a site from scratch and what amount more effective and flexible it is than vomiting a WordPress site.

The second thing you should ensure you have is a great contract that guarantees results and all fees shell out should be based on satisfaction. Any company that isn’t willing set their money where there mouth is isn’t worth taking possibility on. Also, read all the fine print VERY Properly! Some seo firms will say they guarantee results but products and solutions read little print you’ll recognize that all they guarantee is that if you type your web page in to search engine that you’ll be in BobsSEO suggestions 10. Quick cash reason web site wouldn’t been released in that case is if yourrrve been already blocked. Being banned means that your site has been removed inside the search engines data base for spamming or other unethical plans.

Visit downtown kansas city because of the annual Christmas parade, tree lighting ceremony and festival that begins the Friday after Thanksgiving and continues for the weekend. You can experience meeting Santa, entertainment, games, food and crafts as well as a light parade. The annual tree lighting ceremony starts the season in downtown kansas city every year or so.

It might be his third year as coach for that Chiefs but he is thought to consider leaving for the Chiefs owner Clark Hunt is getting tired and losing patience with his coaching staff who continues giving him a bad performance since 2007. The team owner’s stand though is questionable while he refuses to reveal anything to your media. Avoiding controversies certainly. But every time he does grant a public conference, he expresses support for Coach Edward and his staff.

Why do you need search engine marketing (SEM)? Search marketing is considered one of the right forms of digital marketing tactics. SEM can deliver an a large amount of website traffic to your own site in this short period time. The cost to advertise is quite affordable and you also do not have to have an in-depth pocket to utilise it.

The lack of TE Tony Gonzalez any definite blow to the team, and you can be confident Cassel wishes he any future Hall of Fame target to pass to every sunday. The Chiefs have their franchise QB in place, and now you must to fit the remaining portion of the puzzle pieces in.

There have an of products out there – all at varying costs. I recommend you visit your favorite search engine and you should search for ‘seo tools’ and see what you for a response. Then I would definitely test each one’s demo version thoroughly before inside any one of these. A good tool is worth the investment but is going to great options out there that you’ll want to shop around and try before invest in.

Source: Data2Semantics

Posted in Data2Semantics, Projects

How Make Use Of Of Aiptek Video Recorder For Online Marketing

The creation of Internet has improved the main things. More noticeably, specifically for business people, it has changed the traditional buying shapes. More people today perform their research for something online before getting the result. Because of so many e-commerce stores out there in just about each market, in spite of how small, a lot of individuals purchase what you need online, too.

I’ll not really wrong only say that half all over the world population is starting to become on internet and it’s the way attain the general public. So, the fundamental thing to start with the how to practice digital marketing is optimum website of yours.

The most significant advice which can give is to undertake your homework. There are many elements of what are the channels of digital marketing and one always be be researched continuously so that you helps to make a kept informed decision due to comes a person to implement the procedure.

They do not own to be needs consumers are asking at. I did not find people asking for unbiased internet marketing guru product reviews. But I noticed a void in through the web and I filled this kind of.

69% of households are doing kind of search to find a local product or service daily. Guilt-ridden after gorging could be down the road looking an individual on their property computer or parked right outside on the web doing an area search to the I call up. Are they finding you? Are you popping up when they search or maybe your player? That hot lead out front just drove away because do cant you create a strong enough online standing.

Advancements in digital marketing has lasted one within the most cost effective marketing activity there is considered. Website updates, search engine optimisation, promoting and direct mail can be the sensible of reaching your market when budgets are wet. You can launch targeted campaigns that are measurable. Websites and email campaigns can be tracked and data captured which will help you exactly what works. Don’t rule the actual likes of Twitter just because you think your clients aren’t making use of it. It’s becoming the developing online tool out there and could become an invaluable part of the marketing combine.

Source: Data2Semantics

Posted in Data2Semantics, Projects

SEMANTiCS 2020 Open Access proceedings

This year’s SEMANTiCS conference was a weird one. As so many other conferences, we had to improvise to deal with the COVID-19 restrictions around travel and event organization. With the help of many people behind the scenes -including the wonderful program chairs Paul Groth and Eva Blomqvist- , we did have a relatively normal reviewing process for the Research and Innovation track. In the end, 8 papers were accepted for publication in this year’s proceedings. The authors were then asked to present their work in pre-recorded videos. These were shown in a very nice webinar, together with contributions from industry. All in all, we feel this downscaled version of Semantics was quite successful.

The Open Access proceedings are published in the Springer LNCS series and are now available at https://www.springer.com/gp/book/9783030598327

All presentation videos can be watched at https://2020-eu.semantics.cc/ (program/recordings->videos).

And stay tuned for announcements of SEMANTiCS 2021!!

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Listening to AI: ARIAS workshop report

Last week, I attended the second workshop of the ARIAS working group of AI and the Arts. ARIAS is a platform for research on Arts and Sciences and as such seeks to build a bridge between these disciplines. The new working group is looking at the interplay between Arts and AI specifically. Interestingly, this is not only about using AI to make art, but also to explore what art can do for AI (research). The workshop fell under the thematic theme for ARIAS “Art of Listening to the Matter” and consisted of a number of keynote talks and workshop presentations/discussions.

The workshop at the super-hip Butcher’s Tears in Amsterdam, note the 1.5m COVID-distance.

UvA university professor Tobias Blanke kicked off the meeting with an interesting overview of the different ‘schools’ of AI and how they relate to the humanities. Quite interesting was the talk by Sabine Niederer (a professor of visual methodologies at HvA) and Andy Dockett . They presented the results of an experiment feeding Climate Fiction (cli-fi) texts to the famous GPT algorithm. The results were then aggregated, filtered and visualized in a number of rizoprint-like pamflets.

My favourite talk of the day was by writer and critic Flavia Dzodan. Her talk was quite incendiary as it presented a post-colonial perspective on the whole notion of data science. Her point being that data science only truly started with the ‘discoveries’ of the Americas, the subsequent slave-trade and the therefor required counting of people. She then proceeded by pointing out some of the more nefarious examples of identification, classification and other data-driven ways of dealing with humans, especially those from marginalized groups. Her activist/artistic angle to this problem was to me quite interesting as it tied together themes around representation, participation that appear in the field of ICT4D and those found in AI and (Digital) Humanities. Food for thought at least.

The afternoon was reserved for talks from three artists that wanted to highlight various views on AI and art. Femke Dekker, S. de Jager and Martina Raponi all showed various art projects that in some way used AI technology and reflected on the practice and philosophical implications. Again, here GPT popped up a number of times, but also other methods of visual analysis and generative models.

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer

Exploring West African Folk Narrative Texts Using Machine Learning

It is so nice when two often very distinct research lines come together. In my case, Digital Humanities and ICT for Development rarely meet directly. But they sure did come together when Gossa Lô started with her Master AI thesis. Gossa, a long-time collaborator in the W4RA team, chose to focus on the opportunities for Machine Learning and Natural Language Processing for West-African folk tales. Her research involved constructing a corpus of West-African folk tales, performing various classification and text generation experiments and even included a field trip to Ghana to elicit information about folk tale structures. The work -done as part of an internship at Bolesian.ai– resulted in a beautiful Master AI thesis, which was awarded a very high grade.

As a follow up, we decided to try to rewrite the thesis into an article and submit it to a DH or ICT4D journal. This proved more difficult. Both DH and ICT4D are very multidisciplinary in nature and the combination of both proved a bit too much for many journals, with our article being either too technical, not technical enough, or too much out of scope.

But now, the article ” Exploring West African Folk Narrative Texts Using Machine Learning ” has been published (Open Access) in a special issue of Information on Digital Humanities!

Experiment 1: RNN network architecture of word-level (left) and character-level (right) models
T-SNE visualisation of the 2nd experiment

The paper examines how machine learning (ML) and natural language processing (NLP) can be used to identify, analyze, and generate West African folk tales. Two corpora of West African and Western European folk tales were compiled and used in three experiments on cross-cultural folk tale analysis:

  1. In the text generation experiment, two types of deep learning text generators are built and trained on the West African corpus. We show that although the texts range between semantic and syntactic coherence, each of them contains West African features.
  2. The second experiment further examines the distinction between the West African and Western European folk tales by comparing the performance of an LSTM (acc. 0.79) with a BoW classifier (acc. 0.93), indicating that the two corpora can be clearly distinguished in terms of vocabulary. An interactive t-SNE visualization of a hybrid classifier (acc. 0.85) highlights the culture-specific words for both.
  3. The third experiment describes an ML analysis of narrative structures. Classifiers trained on parts of folk tales according to the three-act structure are quite capable of distinguishing these parts (acc. 0.78). Common n-grams extracted from these parts not only underline cross-cultural distinctions in narrative structures, but also show the overlap between verbal and written West African narratives.
Example output of the word-level model text generator on translated W-African folk tale fragments

All resources, including data and code are found at https://github.com/GossaLo/afr-neural-folktales

Share This:

Source: Victor de Boer

Posted in Staff Blogs, Victor de Boer