SLiDInG 6

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the […]

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the day and can be found here.

I gave a talk on the successes and challenges of FAIR data. My slides are embedded below.

Seminar: Using Interactive Visualisations to Analyse the Structure and Treatment of Topics in Learning Materials

Title: Using Interactive Visualisations to Analyse the Structure and Treatment of Topics in Learning Materials

Speaker: Tanya Howden, Heriot-Watt University

Date: 11:30 on 14 May 2018

Location: CM F.17, Heriot-Watt University

Abstract: With the amount of information available online growing, it is becoming more and more difficult to find what you are looking for, particularly when you’re in an area that you have very little background in. For example, if you were learning about neural networks for the first time, the number of responses you get from a simple Google search can be overwhelming – how do you know where to start?! This is only one of the many challenges faced when searching for appropriate learning materials.

In this talk, I will be discussing the motivations behind my research interests before introducing and demonstrating a prototype that has been created with the aim to give learners a more engaging environment with unified organisation and access to different materials on one subject.

UK Ontology Network 2018

This week I went to the UK Ontology Network meeting hosted at Keele University. There was an interesting array of talks in the programme showing the breadth of work going on in the UK. I gave a talk on the Bioschemas Community  (slides below) and Leyla Garcia presented a poster providing more details of the […]

This week I went to the UK Ontology Network meeting hosted at Keele University. There was an interesting array of talks in the programme showing the breadth of work going on in the UK.

I gave a talk on the Bioschemas Community  (slides below) and Leyla Garcia presented a poster providing more details of the current Bioschema Profiles.

The UK Ontology Network is going through a reflection phase and would like interested parties to complete the following online survey.

 

An Identifier Scheme for the Digitising Scotland Project

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project […]

The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland’s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers.

The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Centre – Scotland).

My coauthors are:

  • Özgür Akgün, University of St Andrews
  • Ahamd Alsadeeqi, Heriot-Watt University
  • Peter Christen, Australian National University
  • Tom Dalton, University of St Andrews
  • Alan Dearle, University of St Andrews
  • Chris Dibben, University of Edinburgh
  • Eilidh Garret, University of Essex
  • Graham Kirby, University of St Andrews
  • Alice Reid, University of Cambridge
  • Lee Williamson, University of Edinburgh

The work reported in this talk is the result of the Digitising Scotland Raasay Retreat. Also at the retreat were:

  • Julia Jennings, University of Albany
  • Christine Jones
  • Diego Ramiro-Farinas, Centre for Human and Social Sciences (CCHS) of the Spanish National Research Council (CSIC)

Seminar: PhD Progression Talks

A double bill of PhD progression talks (abstracts below):

Venue: 3.07 Earl Mountbatten Building, Heriot-Watt University, Edinburgh

Time and Date: 11:15, 8 May 2017

Evaluating Record Linkage Techniques

Ahmad Alsadeeqi

Many computer algorithms have been developed to automatically link historical records based on a variety of string matching techniques. These generate an assessment of how likely two records are to be the same. However, it remains unclear how to assess the quality of the linkages computed due to the absence of absolute knowledge of the correct linkage of real historical records – the ground truth. The creation of synthetically generated datasets for which the ground truth linkage is known helps with the assessment of linkage algorithms but the data generated is too clean to be representative of historical records.

We are interested in assessing data linkage algorithms under different data quality scenarios, e.g. with errors typically introduced by a transcription process or where books can be nibbled by mice. We are developing a data corrupting model that injects corruptions into datasets based on given corruption methods and probabilities. We have classified different forms of corruptions found in historical records into four types based on the effect scope of the corruption. Those types are character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. gender swap – male changed to female due to false entry), record level (e.g. missing records due to different reasons like loss of certificate), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). This will give us the ability to evaluate record linkage algorithms over synthetically generated datasets with known ground truth and with data corruptions matching a given profile.

Computer-Aided Biomimetics: Knowledge Extraction

Ruben Kruiper

Biologically inspired design concerns copying ideas from nature to various other domains, e.g. natural computing. Biomimetics is a sub-field of biologically inspired design and focuses specifically on solving technical/engineering problems. Because engineers lack biological knowledge the process of biomimetics is non-trivial and remains adventitious. Therefore, computational tools have been developed that aim to support engineers during a biomimetics process by integrating large amounts of relevant biological knowledge. Existing tools work apply NLP techniques on biological research papers to build dedicated knowledge bases. However, these existing tools impose an engineering view on biological data. I will talk about the support that ‘Computer-Aided Biomimetics’ tools should provide, introducing a theoretical basis for further research on the appropriate computational techniques.

Supporting Dataset Descriptions in the Life Sciences

Seminar talk given at the EBI on 5 April 2017. Abstract: Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has […]

Seminar talk given at the EBI on 5 April 2017.

Abstract: Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.

In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I’ve developed to support dataset publishers in creating metadata description and validating them against a chosen specification.

Smart Descriptions & Smarter Vocabularies (SDSVoc) Report

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below. The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very […]

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below.

The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very comprehensive report covering the workshop. A charter is being drawn up for a W3C working group to develop the next iteration of the DCAT vocabulary.

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

Seminar: Managing Domain-Aware Lexical Knowledge

Date: 11:15, 10 October 2016

Venue: F.17. Colin Maclaurin Building, Heriot-Watt University

Title: Managing Domain-Aware Lexical Knowledge

Speaker: David Leoni, Heriot-Watt University

Abstract: The talk will describe the implementation of Diversicon, a new open source system for extending and integrating terminologies as found in Wordnet databases. Issues on knowledge formats, standards, and open source development will be discussed. As a practical use case, we connected Diversicon to the the S-Match semantic matcher tool in order to support domain-aware semantic matching (http://semanticmatching.org).