Dataset Description

Interoperability and FAIRness through a novel combination of Web technologies

New paper [1] on using Semantic Web technologies to publish existing data according to the FAIR data principles [2].

Abstract: Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved at the level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs.

[1] Unknown bibtex entry with key [Wilkinson2017-FAIRness]
[Bibtex]
[2] Unknown bibtex entry with key [Wilkinson2016]
[Bibtex]

Supporting Dataset Descriptions in the Life Sciences

Seminar talk given at the EBI on 5 April 2017.

Abstract: Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.

In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I’ve developed to support dataset publishers in creating metadata description and validating them against a chosen specification.

Smart Descriptions & Smarter Vocabularies (SDSVoc) Report

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below.

The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very comprehensive report covering the workshop. A charter is being drawn up for a W3C working group to develop the next iteration of the DCAT vocabulary.

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] Unknown bibtex entry with key [Gray2016SWAT4LSTutorial]
[Bibtex]

HCLS Community Profile for Dataset Descriptions

My latest publication [1] describes the process followed in developing the W3C Health Care and Life Sciences Interest Group (HCLSIG) community profile for dataset descriptions which was published last year. The diagram below provides a summary of the data model for describing datasets which covers 61 metadata terms drawn from 18 vocabularies.Overview of the HCLS Community Profile for Dataset Descriptions

[1] Unknown bibtex entry with key [Dumontier2016HCLS]
[Bibtex]