SLiDInG 6

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the […]

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the day and can be found here.

I gave a talk on the successes and challenges of FAIR data. My slides are embedded below.

DUCS not LOD

The follow is an excerpt from a blog by Keir Winesmith, Head of Digital at the San Francisco Museum of Modern Art (@SFMOMAlab) Linked Open Data may sound good and noble, but it’s the wrong way around. It is a truth universally acknowledged, that an organization in possession of good Data, must want it Open (and […]

The follow is an excerpt from a blog by Keir Winesmith, Head of Digital at the San Francisco Museum of Modern Art (@SFMOMAlab)

Linked Open Data may sound good and noble, but it’s the wrong way around. It is a truth universally acknowledged, that an organization in possession of good Data, must want it Open (and indeed, Linked).

Well, I call bullshit. Most cultural heritage organizations (like most organizations) are terrible at data. And most of those who are good at collecting it, very rarely use it effectively or strategically.

Instead of Linked Open Data (LOD), Keir argues for DUCS:

I propose an alternative anagram, and an alternative order of importance.

  • D. Data. Step one, collect the data that is most likely to help you and your organization make better decisions in the future. For example collection breadth, depth, accuracy, completeness, diversity, and relationships between objects and creators.
  • U. Utilise. Actually use the data to inform your decisions, and test your hypotheses, within the bounds of your mission.
  • C. Context. Provide context for your data, both internally and externally. What’s inside? How is represented? How complete is it? How accurate? How current? How was it gathered?
  • S. Share. Now you’re ready to share it! Share it with context. Share it with the communities that are included in it first, follow the cultural heritage strategy of “nothing about me, without me”. Reach out to the relevant students, scholars, teachers, artists, designers, anthropologists, technologists, and whomever could use it. Get behind it and keep it up to date.

I’m against LOD, if it doesn’t follow DUCS first.

If you’re going to do it, do it right.

Source: Against Linked Open Data – Keir Winesmith – Medium

Interoperability and FAIRness through a novel combination of Web technologies

New paper [1] on using Semantic Web technologies to publish existing data according to the FAIR data principles [2]. Abstract: Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein […]

New paper [1] on using Semantic Web technologies to publish existing data according to the FAIR data principles [2].

Abstract: Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved at the level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs.

[1] [doi] Mark D. Wilkinson, Ruben Verborgh, Luiz Olavo {Bonino da Silva Santos}, Tim Clark, Morris A. Swertz, Fleur D. L. Kelpin, Alasdair J. G. Gray, Erik A. Schultes, Erik M. van Mulligen, Paolo Ciccarese, Arnold Kuzniar, Anand Gavai, Mark Thompson, Rajaram Kaliyaperumal, Jerven T. Bolleman, and Michel Dumontier. Interoperability and FAIRness through a novel combination of Web technologies. PeerJ Computer Science, 3:e110, apr 2017.
[Bibtex]
@article{Wilkinson2017-FAIRness,
abstract = {Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved atthe level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs.},
author = {Wilkinson, Mark D. and Verborgh, Ruben and {Bonino da Silva Santos}, Luiz Olavo and Clark, Tim and Swertz, Morris A. and Kelpin, Fleur D.L. and Gray, Alasdair J.G. and Schultes, Erik A. and van Mulligen, Erik M. and Ciccarese, Paolo and Kuzniar, Arnold and Gavai, Anand and Thompson, Mark and Kaliyaperumal, Rajaram and Bolleman, Jerven T. and Dumontier, Michel},
doi = {10.7717/peerj-cs.110},
issn = {2376-5992},
journal = {PeerJ Computer Science},
month = {apr},
pages = {e110},
publisher = {PeerJ Inc.},
title = {{Interoperability and FAIRness through a novel combination of Web technologies}},
url = {https://peerj.com/articles/cs-110},
volume = {3},
year = {2017}
}
[2] [doi] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino {da Silva Santos}, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C. {‘t Hoen}, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3:160018, 2016.
[Bibtex]
@article{Wilkinson2016,
abstract = {There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the {FAIR} Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the {FAIR} Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the {FAIR} Principles, and includes the rationale behind them, and some exemplar implementations in the community.},
author = {Wilkinson, Mark D and Dumontier, Michel and Aalbersberg, IJsbrand Jan and Appleton, Gabrielle and Axton, Myles and Baak, Arie and Blomberg, Niklas and Boiten, Jan-Willem and {da Silva Santos}, Luiz Bonino and Bourne, Philip E and Bouwman, Jildau and Brookes, Anthony J and Clark, Tim and Crosas, Merc{\`{e}} and Dillo, Ingrid and Dumon, Olivier and Edmunds, Scott and Evelo, Chris T and Finkers, Richard and Gonzalez-Beltran, Alejandra and Gray, Alasdair J.G. and Groth, Paul and Goble, Carole and Grethe, Jeffrey S and Heringa, Jaap and {'t Hoen}, Peter A.C and Hooft, Rob and Kuhn, Tobias and Kok, Ruben and Kok, Joost and Lusher, Scott J and Martone, Maryann E and Mons, Albert and Packer, Abel L and Persson, Bengt and Rocca-Serra, Philippe and Roos, Marco and van Schaik, Rene and Sansone, Susanna-Assunta and Schultes, Erik and Sengstag, Thierry and Slater, Ted and Strawn, George and Swertz, Morris A and Thompson, Mark and van der Lei, Johan and van Mulligen, Erik and Velterop, Jan and Waagmeester, Andra and Wittenburg, Peter and Wolstencroft, Katherine and Zhao, Jun and Mons, Barend},
doi = {10.1038/sdata.2016.18},
issn = {2052-4463},
journal = {Scientific Data},
month = mar,
pages = {160018},
publisher = {Macmillan Publishers Limited},
title = {{The FAIR Guiding Principles for scientific data management and stewardship}},
url = {http://www.nature.com/articles/sdata201618},
volume = {3},
year = {2016}
}

Shapeshifting LOD Cloud

A new version of the Linked Open Data (LOD) cloud has been produced and it shows quite a shift from the previous version. It is great to see the LOD cloud continue to grow both in scale and diversity. (You can click on the image to get to an interactive version of the cloud with […]

A new version of the Linked Open Data (LOD) cloud has been produced and it shows quite a shift from the previous version. It is great to see the LOD cloud continue to grow both in scale and diversity.

(You can click on the image to get to an interactive version of the cloud with links to the DataHub entries.)

LOD Cloud January 2017

LOD Cloud January 2017

Previously DBPedia and GeoNames were the centre of the LOD universe. While DBPedia still remains an important linking dataset, it is now clear that there are clusterings within application domains. This is most significant in the life sciences.

LOD Cloud August 2014

LOD Cloud August 2014

Attribution: “Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/”

Research Blog: Facilitating the discovery of public datasets

Google are doing some interesting work on making datasets, in particular scientific datasets, more discoverable with schema.org markup. This is closely related to the bioschemas community project.
Source: Research Blog: Facilitating the discovery of pu…

Google are doing some interesting work on making datasets, in particular scientific datasets, more discoverable with schema.org markup. This is closely related to the bioschemas community project.

Source: Research Blog: Facilitating the discovery of public datasets

ISWC 2016 Trip Report

It has now been almost two months since ISWC 2016 where I was the Resources Track chair with Marta Sabou. This has given me time to reflect on the conference, in between a hectic schedule of project meetings, workshops, conferences, and a PhD viva. The most enjoyable part of the conference for me was the […]

It has now been almost two months since ISWC 2016 where I was the Resources Track chair with Marta Sabou. This has given me time to reflect on the conference, in between a hectic schedule of project meetings, workshops, conferences, and a PhD viva.

The most enjoyable part of the conference for me was the CoLD Workshop Debate on the State of Linked Data. The workshop organisers had arranged for six prominent proponents of the Linked Data to argue that we have failed and that Linked Data will die away.

  1. Ruben Verborgh argued that Linked Data will be destroyed by the need to centralise data, poor infrastructure, and the research community. (Aside: There was certainly concern on the final point as there were only three females in the room.)
  2. Axel Polleres took the moto, “Let’s make RDF great again!” Axel’s central argument was around the fact that most open data is actually published in CSV format and lots can be achieved with 3* open data.

  3. Paul Groth argued that we should be concentrating on making our data processable by machines. What we currently have is a format that is aimed at both but satisfies neither.

  4. Chris Bizer covered cost incentives. While there is an incentive to provide some basic schema markup on pages, i.e. getting picked up by search engines, there is no financial incentive to provide the links to other resources. My take on this is that there is a disincentive as it would take traffic away from your (eCommerce) site and therefore lose you revenue.
  5. Avi Bernstein then did a fantastic impression of a Wee Free minister and telling us that we had all sinned and were following the wrong path; all fire and brimstone.
  6. Juan Reutter argued that we needed to provide a workable ecosystem.

So the question is, has the Linked Data community failed? I think the debate highlighted that the community had made many contributions in a short space of time but that it is time to get this into the main stream. Perhaps our community is not the best for doing the required sales job, but we have had some success, e.g. EBI RDF platform, Open PHACTS Drug Discovery Platform, BBC Olympic Web Site.

The main conference was underpinned by three fantastic and varied keynotes. First was Kathleen McKeown who gave us insights into the extraction of knowledge from different forms of text. Second was Christian Bizer who’s main message was that we as a community need to take structured data in whatever form it comes; just like search engines have exploited metadata and page structure for a long time. Finally was Hiroaki Kitano from the Sony Corporation. This has got to be the densest keynote I have ever heard with more ideas per minute than a dance tune has beats. His challenge to the community was that we should aim to have an AI system win a scientific nobel prize by 2050. The system should develop a hypothesis, test it, and generate a ground breaking conclusion worthy of the prize.

There were many great and varied talks during the conference. It really is worth looking through the programme to find those of interest to you (all the papers are linked and available). As ever the poster and demo session, advertised in the minute madness session, demonstrated the breadth and cutting edge work going on in the community. As did the lightning talk session.

The final day of the conference was particularly weird for me. As the chair of a session I ended up sharing a bottle of fine Italian wine with a presenter during his talk, it would have been rude not to; and experiencing an earthquake during a presentation on an ontology for modelling the soil beneath our cities, in particular the causes of damage to that soil.

The conference afforded some opportunities for fun as well. A few of the organising committee managed to get visit the k-computer; the worlds fifth fastest super-computer which is cooled with water. The computer was revealed in a very James Bond, “Now I’m going to have to kill you!” reveal of the evil enemy’s master plan. There was also a highly entertaining Samurai sword fighting demonstration during the conference banquet.

During the conference, my Facebook feed was filled with exclamations about the complexity of the toilets. Following the conference, it was filled with exclamations of returning to lands of uncivilised toilets. Make of this what you will.

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

Will the real Kevin Macleod please line up?

Last week I attended the Digitising Scotland Project Colloquium at Raasay House (featured image above) on the Isle of Raasay. The colloquium was a gathering of historians and computer scientists to discuss the challenges of linking the vital records of the people of Scotland between 1851 and 1974. The Digitising Scotland Project is having the birth, marriage, […]

Last week I attended the Digitising Scotland Project Colloquium at Raasay House (featured image above) on the Isle of Raasay. The colloquium was a gathering of historians and computer scientists to discuss the challenges of linking the vital records of the people of Scotland between 1851 and 1974.

The Digitising Scotland Project is having the birth, marriage, and death records of Scotland transcribed from the scans of the original hand written registration books. This process is not without its own challenges, try reading this birth record of a famous Scottish artist and architect, but the focus of the colloquium was on what happens after the records have been transcribed.

Each Scottish vital record identifies several individuals, e.g. on a birth record you will have the baby, their parents, the informant, and the registrar. The same individuals will appear on multiple records relating to events in their own life, e.g. an individual will have a birth record, potentially one or more marriage records, and a death record, assuming that they have not emigrated. They can also appear in the records of other individuals, e.g. as a mother on a birth record, the mother-of-the-bride on a marriage record, or the doctor on a death record. The challenge is how to identify the same individual across all the records, when all you have is a name (first and last) and potentially the age.

The problem is compounded in an area like Skye, which was one of the focus regions of the Digitising Scotland project, because there is a relatively small distribution of names on which to draw upon. For example, a name like Kevin Macleod will appear on multiple records. In some cases the name will correspond to a single Kevin Macleod, in other cases it will be a closely related Kevin Macleod, e.g. Kevin Macleod the father of Kevin Macleod, and in others the two Kevin Macleods will not be related at all. The challenge is how to develop a computer algorithm that is capable of making these distinctions.

The colloquium was a great opportunity for historians and computer scientists to discuss the challenges and help each other to develop a solution. However, first we had to agree on a common understanding for terms such as “record” and “individual”.

Overall, we made great progress on exchanging ideas and techniques. We heard how similar challenges are being addressed in a related project focusing on North Orkney, how historians approach the record linkage challenge, and about work for automatically classifying causes of death to their ICD10 code and jobs to HISCO. There was also time to socialise and enjoy some of the scenery of Raasay, which is a beautiful island the size of Manhattan but with a population of only 160.

View from the meeting room

View from the meeting room

Sunset over Portree, Skye

Sunset over Portree, Skye