Schema.org Dataset Descriptions Meeting

On Monday 16 May, several interested individuals (including myself) from ELIXIR, bioCADDIE, Bioschemas and the W3C HCLS Community Profile met with representatives from Google involved in the schema.org activity on describing datasets.

Finding datasets, and understanding their content, is a challenging task for humans and currently not possible to automate. Schema.org is an initiative from the major web search engines to help with the discovery of web resource.

There are multiple parallel activities in the life sciences community working on developing ways to publish metadata about datasets. This is due to the wide variety of use cases that dataset descriptions need to satisfy, including data discovery, data citation, and provenance tracking. bioCADDIE has worked on an extensive analysis of  use cases mapping data models from existing efforts. We agreed this could be used as a basis to improve the existing schema.org dataset type and come up with a new Bioschemas dataset specification.

The outcomes of the meeting with Google was to focus on the find-ability of datasets with an emphasis on data citation. For data citation the following key properties are important (as stated in the FORCE 11 data citation principles).

data-citation

Image from https://www.force11.org/node/4771

The next steps will be to develop some pilot projects to both publish and use dataset descriptions for discovery and citation.

ELIXIR-UK members are heavily involved in Bioschemas.org and will be involved in these efforts as part of the ELIXIR interoperability platform activities in collaboration with NIH and ELIXIR-EBI.

My thanks go to Rafael Jimenez for his help with the preparation of this post which will also appear in the ELIXIR-UK newsletter.