An Identifier Scheme for the Digitising Scotland Project

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project […]

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland’s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers.

The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Centre – Scotland).

My coauthors are:

  • Özgür Akgün, University of St Andrews
  • Ahamd Alsadeeqi, Heriot-Watt University
  • Peter Christen, Australian National University
  • Tom Dalton, University of St Andrews
  • Alan Dearle, University of St Andrews
  • Chris Dibben, University of Edinburgh
  • Eilidh Garret, University of Essex
  • Graham Kirby, University of St Andrews
  • Alice Reid, University of Cambridge
  • Lee Williamson, University of Edinburgh

The work reported in this talk is the result of the Digitising Scotland Raasay Retreat. Also at the retreat were:

  • Julia Jennings, University of Albany
  • Christine Jones
  • Diego Ramiro-Farinas, Centre for Human and Social Sciences (CCHS) of the Spanish National Research Council (CSIC)

Seminar: PhD Progression Talks

A double bill of PhD progression talks (abstracts below):

Venue: 3.07 Earl Mountbatten Building, Heriot-Watt University, Edinburgh

Time and Date: 11:15, 8 May 2017

Evaluating Record Linkage Techniques

Ahmad Alsadeeqi

Many computer algorithms have been developed to automatically link historical records based on a variety of string matching techniques. These generate an assessment of how likely two records are to be the same. However, it remains unclear how to assess the quality of the linkages computed due to the absence of absolute knowledge of the correct linkage of real historical records – the ground truth. The creation of synthetically generated datasets for which the ground truth linkage is known helps with the assessment of linkage algorithms but the data generated is too clean to be representative of historical records.

We are interested in assessing data linkage algorithms under different data quality scenarios, e.g. with errors typically introduced by a transcription process or where books can be nibbled by mice. We are developing a data corrupting model that injects corruptions into datasets based on given corruption methods and probabilities. We have classified different forms of corruptions found in historical records into four types based on the effect scope of the corruption. Those types are character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. gender swap – male changed to female due to false entry), record level (e.g. missing records due to different reasons like loss of certificate), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). This will give us the ability to evaluate record linkage algorithms over synthetically generated datasets with known ground truth and with data corruptions matching a given profile.

Computer-Aided Biomimetics: Knowledge Extraction

Ruben Kruiper

Biologically inspired design concerns copying ideas from nature to various other domains, e.g. natural computing. Biomimetics is a sub-field of biologically inspired design and focuses specifically on solving technical/engineering problems. Because engineers lack biological knowledge the process of biomimetics is non-trivial and remains adventitious. Therefore, computational tools have been developed that aim to support engineers during a biomimetics process by integrating large amounts of relevant biological knowledge. Existing tools work apply NLP techniques on biological research papers to build dedicated knowledge bases. However, these existing tools impose an engineering view on biological data. I will talk about the support that ‘Computer-Aided Biomimetics’ tools should provide, introducing a theoretical basis for further research on the appropriate computational techniques.

Supporting Dataset Descriptions in the Life Sciences

Seminar talk given at the EBI on 5 April 2017. Abstract: Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has […]

Seminar talk given at the EBI on 5 April 2017.

Abstract: Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.

In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I’ve developed to support dataset publishers in creating metadata description and validating them against a chosen specification.

Smart Descriptions & Smarter Vocabularies (SDSVoc) Report

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below. The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very […]

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below.

The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very comprehensive report covering the workshop. A charter is being drawn up for a W3C working group to develop the next iteration of the DCAT vocabulary.

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

Seminar: Managing Domain-Aware Lexical Knowledge

Date: 11:15, 10 October 2016

Venue: F.17. Colin Maclaurin Building, Heriot-Watt University

Title: Managing Domain-Aware Lexical Knowledge

Speaker: David Leoni, Heriot-Watt University

Abstract: The talk will describe the implementation of Diversicon, a new open source system for extending and integrating terminologies as found in Wordnet databases. Issues on knowledge formats, standards, and open source development will be discussed. As a practical use case, we connected Diversicon to the the S-Match semantic matcher tool in order to support domain-aware semantic matching (http://semanticmatching.org).

Seminar: Language-integrated Provenance

Wher ProvenanceDate: 11:15, 3 October 2016

Venue: F.17. Colin Maclaurin Building, Heriot-Watt University

Title: Language-integrated Provenance

Speaker: Stefan Fehrenbach,  Informatics, University of Edinburgh

Abstract: Provenance, or information about the origin or derivation of data, is important for assessing the trustworthiness of data and identifying and correcting mistakes. Most prior implementations of data provenance have involved heavyweight modifications to database systems and little attention has been paid to how the provenance data can be used outside such a system. We present extensions to the Links programming language that build on its support for language-integrated query to support provenance queries by rewriting and normalizing monadic comprehensions and extending the type system to distinguish provenance metadata from normal data. We show that the two most common forms of provenance can be implemented efficiently and used safely as a programming language feature with no changes to the database system.

Bio: Stefan is a second year PhD student at the University of Edinburgh where he works with James Cheney on language support for provenance.

Seminar: Data Integration Support for Offshore Decommissioning Waste Management

Oli RigDate: 11:15, 26 September 2016

Venue: F.17. Colin Maclaurin Building, Heriot-Watt University

Title: Data Integration Support for Offshore Decommissioning Waste Management

Speaker: Abiodun Akinyemi, School of Energy, Geoscience, Infrastructure and Society (EGIS), Heriot-Watt University

Abstract: Offshore decommissioning activities represent a significant business opportunity for UK contracting and consulting companies, albeit they constitute liability to the owners of the assets – because of the cost – and UK government – because of tax relief. The silver lining is that waste reuse can bring some reprieve as savings from the sales of decommissioned facility items can reduce the overall removal cost to an asset owner. However, characterizing an asset inventory to determine which decommissioned facility items can be reused is prone to errors because of the difficulty involved in integrating asset data from different sources in a meaningful way. This research investigates a data integration framework, which enables rapid assessment of items to be decommissioned, to inform circular economy principles. It evaluates existing practices in the domain and devises a mechanisms for higher productivity using the semantic web and ISO 15926.

Bio: Abiodun Akinyemi is a PhD student at the School of Energy, Geoscience, Infrastructure and Society at Heriot-Watt University. He has an MPhil in Engineering from the University of Cambridge and has worked on Asset Information Management in the oil and gas industry for over 8 years.

Open PHACTS Closing Symposium

For the last 5 years I have had the pleasure of working with the Open PHACTS project. Sadly, the project is now at an end. To celebrate we are having a two day symposium to look over the contributions of the project and its future legacy. The project has been hugely successful in developing an […]

For the last 5 years I have had the pleasure of working with the Open PHACTS project. Sadly, the project is now at an end. To celebrate we are having a two day symposium to look over the contributions of the project and its future legacy.

The project has been hugely successful in developing an integrated data platform to enable drug discovery research (see a future post for details to support this claim). The result of the project is the Open PHACTS Foundation which will now own the drug discovery platform and sustain its development into the future.

Here are my slides on the state of the data in the Open PHACTS 2.0 platform.