Importing data into Wikidata - Current challenges and ideas for future development

By Navino Evans

November 30, 2017

This post is a follow up to the session I hosted at WikidataCon 2017 along with Stuart Prior from WikimediaUK (standing in for John Cummings, Wikimedian in Residence at UNESCO)

Data partnerships with reputable third parties are really important for the future of Wikidata. If we get this right it will alleviate a lot of the manual work needed keeping general facts about the world up to date, whilst simultaneously giving greater authority to Wikidata in terms of completeness and accuracy (not to mention consistency!). For example, if there is community consensus that UNESCO is the 'main' authority on World Heritage sites and their data is freely available, it seems entirely logical to automate the process of importing their data as much as possible so that we’re always up to date according to our main source.

Fruits of successful data import: Map of World heritage sites, colour coded by inclusion Criteria

Of course, the mechanics of "keeping up to date" will have to vary from dataset to dataset, but the process should be pinned down so that each import moves through the same flow of stages, and progress is reported in a standard way in a centralised location. This approach is essential for making the importing easier to carry out, but it will also prevent us from constantly re-inventing the wheel as different segments of the community work through problems independently.

Great strides have been made in this direction with the publication of the data import hub and related pages. What’s needed now is a unified community effort to refine this documentation, as well as the data import process itself.

This is a job that will never be finished, but there is still a considerable amount of work needed to get to a ‘reasonable’ starting point. By that I mean that an external organisation can arrive at the data import hub and be gradually directed down a path that is clear and easy to follow, guided by help from the Wikidata community at key stages in the process.

Below is a summary of the WikidataCon session that we hosted that was intended as a platform to discuss some of the challenges faced, and work towards creating an action plan for improving the data import process and documentation in a form that the whole community can easily contribute to.

Overview of the data import discussion session at WikidataCon 2017

Session notes, including links to related discussions and resources -->

You can read more about what was said in the link to the notes page above, but the main message boiled down to the fact that the current process is too difficult/technical for most editors to help with and there are many barriers to entry that make it hard for external organisations to donate data to Wikidata.

What we really needed to get out of the session was the input of the people present. We focused on a few key problems that we’d found particularly difficult to deal with during previous data import projects:

It’s hard to keep Wikidata ‘in sync’ with an external data source
There’s no easy way to report metrics related to a particular data import
Lots of data processing is needed by people with advanced spreadsheet and/or coding skills
It’s hard to find which data has been changed since a previous import

To discuss these problems, we split into groups according to which issue people felt best placed to contribute to. We ended up covering problems 1,2 and 3 (one group per problem).

Note: As it happens problem 4 was anyway mostly solved straight after WikidataCon by a new tool released by Magnus Manske! It allows you to see recent changes to a data set based on a SPARQL query.

The groups were asked to discuss and suggest solutions to their chosen problem. Because a lot of the issues we face now will one day be solved by more advanced tools, the suggested solutions were split into two sections:

What can we do to help right now?
What should we work towards in the future?

Summary of conclusions from discussion groups

What can we do now?

We need to develop the import process and documentation as a community, so all discussion and task management should happen in a centralised place.

We should reach out to experts in other organisations (e.g. Europeana) for help with our import processes.

We need to get more people using the data import hub and giving feedback on areas that need improvement. It’s crucial that we take extra care to document our import experiences and share the problems and solutions found. Lessons learned need to be reviewed and distilled back in to the documentation.

We should put together good resources for learning the relatively basic spreadsheet skills needed for a large proportion of data import tasks. This should help grow a larger group of volunteer editors able to jump in and help with mass imports.

A wide range of metrics can already be reported, but we need to find out exactly what data partner organisations need, and all existing metrics resources should be rounded up and added to the data import guide.

All tools that are of use need adding to the data import guide and keeping up to data with new developments

What should we work towards in the future?

Documentation needs to be completely user centric, and broken down into different tiers by skill level. For example, there is a very different level of information needed if you are helping as a software developer, or as a casual editor who knows how to use a couple of tools.

There should be a range of guides that are domain specific, which are both easy to find and easy to understand (at the skill level they are pitched to). All documentation and guides should be developed with a constant feedback mechanism, learning from the pain/mistakes of the past to improve the model.

The data import hub should be developed into a separate tool so it's easy people to interact with it without needing wiki text skills. All metrics reporting, synchronisation tasks etc should all be listed in this central place for each data import. This would effectively be a combination of many of the existing tools like Mix’n’match, QuickStatements and recent changes (note: GLAM pipe, which is currently under development, seems to show a lot of promise as an all in one solution of the future!).

Create a "one click" metrics reporting button available straight from the data import hub, which creates an automatic report on the exposure of the data (e.g. page views from Wikipedia infoboxes, number of articles it’s used in etc.)

Develop a staging area where people can test their imports before 'going live' on Wikidata. This would ideally be a complete mirror of Wikidata, but with the additional data being tested for import showing up in the interface. Once you are happy it all looks good, you can click "Publish" to go ahead and update Wikidata for real.

Where do we go from here?

There is a huge amount that can be done right away improving the documentation, which John Cummings and I will be working on over the coming months. But as previously mentioned, this will need input from a wide spectrum of community members, so it seems to me that an essential first step is to create a central area for recording all tasks and discussion related to the data import process and documentation.

The most sensible platform I can think of would be some sort of project on Phabricator, with a tree of subtasks covering different parts of overall process. I’ll post a link here once this or something similar has taken shape.

Once in place, we can begin to make much faster progress as we’ll all be working toward the same goal rather than painstakingly repeating the same work. Having a structure like this is also essential for drafting the help of community members with different skills.

If you have any thoughts or suggestions to share, please do post them in the comments below or contact me on Twitter @NavinoEvans.