Seminar: PhD Progression Talks

A double bill of PhD progression talks (abstracts below):

Venue: 3.07 Earl Mountbatten Building, Heriot-Watt University, Edinburgh

Time and Date: 11:15, 8 May 2017

Evaluating Record Linkage Techniques

Ahmad Alsadeeqi

Many computer algorithms have been developed to automatically link historical records based on a variety of string matching techniques. These generate an assessment of how likely two records are to be the same. However, it remains unclear how to assess the quality of the linkages computed due to the absence of absolute knowledge of the correct linkage of real historical records – the ground truth. The creation of synthetically generated datasets for which the ground truth linkage is known helps with the assessment of linkage algorithms but the data generated is too clean to be representative of historical records.

We are interested in assessing data linkage algorithms under different data quality scenarios, e.g. with errors typically introduced by a transcription process or where books can be nibbled by mice. We are developing a data corrupting model that injects corruptions into datasets based on given corruption methods and probabilities. We have classified different forms of corruptions found in historical records into four types based on the effect scope of the corruption. Those types are character level (e.g. an f is represented as an s – OCR Corruptions), attribute level (e.g. gender swap – male changed to female due to false entry), record level (e.g. missing records due to different reasons like loss of certificate), and group of records level (e.g. coffee spilt over a page, lost parish records in fire). This will give us the ability to evaluate record linkage algorithms over synthetically generated datasets with known ground truth and with data corruptions matching a given profile.

Computer-Aided Biomimetics: Knowledge Extraction

Ruben Kruiper

Biologically inspired design concerns copying ideas from nature to various other domains, e.g. natural computing. Biomimetics is a sub-field of biologically inspired design and focuses specifically on solving technical/engineering problems. Because engineers lack biological knowledge the process of biomimetics is non-trivial and remains adventitious. Therefore, computational tools have been developed that aim to support engineers during a biomimetics process by integrating large amounts of relevant biological knowledge. Existing tools work apply NLP techniques on biological research papers to build dedicated knowledge bases. However, these existing tools impose an engineering view on biological data. I will talk about the support that ‘Computer-Aided Biomimetics’ tools should provide, introducing a theoretical basis for further research on the appropriate computational techniques.

Smart Descriptions & Smarter Vocabularies (SDSVoc) Report

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below. The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very […]

In December 2016 I presented at the Smart Descriptions and Smarter Vocabularies workshop on the Health Care and Life Sciences Community Profile for describing datasets, and our validation tool (Validata). Presentations included below.

The purpose of the workshop was to understand current practice in describing datasets and where the DCAT vocabulary needs improvement. Phil Archer has written a very comprehensive report covering the workshop. A charter is being drawn up for a W3C working group to develop the next iteration of the DCAT vocabulary.

Research Blog: Facilitating the discovery of public datasets

Google are doing some interesting work on making datasets, in particular scientific datasets, more discoverable with schema.org markup. This is closely related to the bioschemas community project.
Source: Research Blog: Facilitating the discovery of pu…

Google are doing some interesting work on making datasets, in particular scientific datasets, more discoverable with schema.org markup. This is closely related to the bioschemas community project.

Source: Research Blog: Facilitating the discovery of public datasets

New Paper: Reproducibility with Administrative Data

Our journal article [1] looks at encouraging good practice to enable reproducible analysis of data analysis workflows. This is a result of a collaboration between social scientists and a computer scientist with the ADRC-Scotland. Abstract: Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational […]

Our journal article [1] looks at encouraging good practice to enable reproducible analysis of data analysis workflows. This is a result of a collaboration between social scientists and a computer scientist with the ADRC-Scotland.

Abstract: Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational purposes but often contain information that is suitable for social science research. In this paper we outline the concept of reproducible research in relation to micro-level administrative social science data. Our central claim is that a planned and organised workflow is essential for high quality research using micro-level administrative social science data. We argue that it is essential for researchers to share research code, because code sharing enables the elements of reproducible research. First, it enables results to be duplicated and therefore allows the accuracy and validity of analyses to be evaluated. Second, it facilitates further tests of the robustness of the original piece of research. Drawing on insights from computer science and other disciplines that have been engaged in e-Research we discuss and advocate the use of Git repositories to provide a useable and effective solution to research code sharing and rendering social science research using micro-level administrative data reproducible.

[1] [doi] C. J. Playford, V. Gayle, R. Connelly, and A. J. Gray, “Administrative social science data: The challenge of reproducible research,” Big Data & Society, vol. 3, iss. 2, 2016.
[Bibtex]
@Article{Playford2016BDS,
abstract = {Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational purposes but often contain information that is suitable for social science research. In this paper we outline the concept of reproducible research in relation to micro-level administrative social science data. Our central claim is that a planned and organised workflow is essential for high quality research using micro-level administrative social science data. We argue that it is essential for researchers to share research code, because code sharing enables the elements of reproducible research. First, it enables results to be duplicated and therefore allows the accuracy and validity of analyses to be evaluated. Second, it facilitates further tests of the robustness of the original piece of research. Drawing on insights from computer science and other disciplines that have been engaged in e-Research we discuss and advocate the use of Git repositories to provide a useable and effective solution to research code sharing and rendering social science research using micro-level administrative data reproducible.},
author = {Christopher J Playford and Vernon Gayle and Roxanne Connelly and Alasdair JG Gray},
title = {Administrative social science data: The challenge of reproducible research},
journal = {Big Data \& Society},
year = {2016},
OPTkey = {},
volume = {3},
number = {2},
OPTpages = {},
month = dec,
url = {http://journals.sagepub.com/doi/full/10.1177/2053951716684143},
doi = {10.1177/2053951716684143},
OPTnote = {},
OPTannote = {}
}

New Paper: Reproducibility with Administrative Data

Our journal article [1] looks at encouraging good practice to enable reproducible analysis of data analysis workflows. This is a result of a collaboration between social scientists and a computer scientist with the ADRC-Scotland. Abstract: Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational […]

Our journal article [1] looks at encouraging good practice to enable reproducible analysis of data analysis workflows. This is a result of a collaboration between social scientists and a computer scientist with the ADRC-Scotland.

Abstract: Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational purposes but often contain information that is suitable for social science research. In this paper we outline the concept of reproducible research in relation to micro-level administrative social science data. Our central claim is that a planned and organised workflow is essential for high quality research using micro-level administrative social science data. We argue that it is essential for researchers to share research code, because code sharing enables the elements of reproducible research. First, it enables results to be duplicated and therefore allows the accuracy and validity of analyses to be evaluated. Second, it facilitates further tests of the robustness of the original piece of research. Drawing on insights from computer science and other disciplines that have been engaged in e-Research we discuss and advocate the use of Git repositories to provide a useable and effective solution to research code sharing and rendering social science research using micro-level administrative data reproducible.

[1] [doi] C. J. Playford, V. Gayle, R. Connelly, and A. J. Gray, “Administrative social science data: The challenge of reproducible research,” Big Data & Society, vol. 3, iss. 2, 2016.
[Bibtex]
@Article{Playford2016BDS,
abstract = {Powerful new social science data resources are emerging. One particularly important source is administrative data, which were originally collected for organisational purposes but often contain information that is suitable for social science research. In this paper we outline the concept of reproducible research in relation to micro-level administrative social science data. Our central claim is that a planned and organised workflow is essential for high quality research using micro-level administrative social science data. We argue that it is essential for researchers to share research code, because code sharing enables the elements of reproducible research. First, it enables results to be duplicated and therefore allows the accuracy and validity of analyses to be evaluated. Second, it facilitates further tests of the robustness of the original piece of research. Drawing on insights from computer science and other disciplines that have been engaged in e-Research we discuss and advocate the use of Git repositories to provide a useable and effective solution to research code sharing and rendering social science research using micro-level administrative data reproducible.},
author = {Christopher J Playford and Vernon Gayle and Roxanne Connelly and Alasdair JG Gray},
title = {Administrative social science data: The challenge of reproducible research},
journal = {Big Data \& Society},
year = {2016},
OPTkey = {},
volume = {3},
number = {2},
OPTpages = {},
month = dec,
url = {http://journals.sagepub.com/doi/full/10.1177/2053951716684143},
doi = {10.1177/2053951716684143},
OPTnote = {},
OPTannote = {}
}

ISWC 2016 Trip Report

It has now been almost two months since ISWC 2016 where I was the Resources Track chair with Marta Sabou. This has given me time to reflect on the conference, in between a hectic schedule of project meetings, workshops, conferences, and a PhD viva. The most enjoyable part of the conference for me was the […]

It has now been almost two months since ISWC 2016 where I was the Resources Track chair with Marta Sabou. This has given me time to reflect on the conference, in between a hectic schedule of project meetings, workshops, conferences, and a PhD viva.

The most enjoyable part of the conference for me was the CoLD Workshop Debate on the State of Linked Data. The workshop organisers had arranged for six prominent proponents of the Linked Data to argue that we have failed and that Linked Data will die away.

  1. Ruben Verborgh argued that Linked Data will be destroyed by the need to centralise data, poor infrastructure, and the research community. (Aside: There was certainly concern on the final point as there were only three females in the room.)
  2. Axel Polleres took the moto, “Let’s make RDF great again!” Axel’s central argument was around the fact that most open data is actually published in CSV format and lots can be achieved with 3* open data.

  3. Paul Groth argued that we should be concentrating on making our data processable by machines. What we currently have is a format that is aimed at both but satisfies neither.

  4. Chris Bizer covered cost incentives. While there is an incentive to provide some basic schema markup on pages, i.e. getting picked up by search engines, there is no financial incentive to provide the links to other resources. My take on this is that there is a disincentive as it would take traffic away from your (eCommerce) site and therefore lose you revenue.
  5. Avi Bernstein then did a fantastic impression of a Wee Free minister and telling us that we had all sinned and were following the wrong path; all fire and brimstone.
  6. Juan Reutter argued that we needed to provide a workable ecosystem.

So the question is, has the Linked Data community failed? I think the debate highlighted that the community had made many contributions in a short space of time but that it is time to get this into the main stream. Perhaps our community is not the best for doing the required sales job, but we have had some success, e.g. EBI RDF platform, Open PHACTS Drug Discovery Platform, BBC Olympic Web Site.

The main conference was underpinned by three fantastic and varied keynotes. First was Kathleen McKeown who gave us insights into the extraction of knowledge from different forms of text. Second was Christian Bizer who’s main message was that we as a community need to take structured data in whatever form it comes; just like search engines have exploited metadata and page structure for a long time. Finally was Hiroaki Kitano from the Sony Corporation. This has got to be the densest keynote I have ever heard with more ideas per minute than a dance tune has beats. His challenge to the community was that we should aim to have an AI system win a scientific nobel prize by 2050. The system should develop a hypothesis, test it, and generate a ground breaking conclusion worthy of the prize.

There were many great and varied talks during the conference. It really is worth looking through the programme to find those of interest to you (all the papers are linked and available). As ever the poster and demo session, advertised in the minute madness session, demonstrated the breadth and cutting edge work going on in the community. As did the lightning talk session.

The final day of the conference was particularly weird for me. As the chair of a session I ended up sharing a bottle of fine Italian wine with a presenter during his talk, it would have been rude not to; and experiencing an earthquake during a presentation on an ontology for modelling the soil beneath our cities, in particular the causes of damage to that soil.

The conference afforded some opportunities for fun as well. A few of the organising committee managed to get visit the k-computer; the worlds fifth fastest super-computer which is cooled with water. The computer was revealed in a very James Bond, “Now I’m going to have to kill you!” reveal of the evil enemy’s master plan. There was also a highly entertaining Samurai sword fighting demonstration during the conference banquet.

During the conference, my Facebook feed was filled with exclamations about the complexity of the toilets. Following the conference, it was filled with exclamations of returning to lands of uncivilised toilets. Make of this what you will.

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

HCLS Tutorial at SWAT4LS 2016

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented. The 61 metadata properties from 18 vocabularies reused in the HCLS Community […]

On 5 December 2016 I presented a tutorial [1] on the Heath Care and Life Sciences Community Profile (HCLS Datasets) at the 9th International Semantic Web Applications and Tools for the Life Sciences Conference (SWAT4LS 2016). Below you can find the slides I presented.

The 61 metadata properties from 18 vocabularies reused in the HCLS Community Profile are available in this spreadsheet (.ods).

[1] M. Dumontier, A. J. G. Gray, and S. M. Marshall, “Describing Datasets with the Health Care and Life Sciences Community Profile,” in Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016), Amsterdam, The Netherlands, 2016.
[Bibtex]
@InProceedings{Gray2016SWAT4LSTutorial,
abstract = {Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. The goal of this tutorial is to explain elements of the HCLS community profile and to enable users to craft and validate descriptions for datasets of interest.},
author = {Michel Dumontier and Alasdair J. G. Gray and M. Scott Marshall},
title = {Describing Datasets with the Health Care and Life Sciences Community Profile},
OPTcrossref = {},
OPTkey = {},
booktitle = {Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2016)},
year = {2016},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = dec,
address = {Amsterdam, The Netherlands},
OPTorganization = {},
OPTpublisher = {},
note = {(Tutorial)},
url = {http://www.swat4ls.org/workshops/amsterdam2016/tutorials/t2/},
OPTannote = {}
}

Celebrating 50 years of Computer Science at HWU

This year sees a double celebration in the Department of Computer Science at Heriot-Watt University – it is 50 years since we launched the first BSc Computer Science degree in Scotland, and 50 years since Heriot-Watt was granted university status. To celebrate we had a series of events last week including an open day and […]

Old hardware

Display of old equipment used within computer science.

This year sees a double celebration in the Department of Computer Science at Heriot-Watt University – it is 50 years since we launched the first BSc Computer Science degree in Scotland, and 50 years since Heriot-Watt was granted university status. To celebrate we had a series of events last week including an open day and dinner for former staff and students.

During the open day we had a variety of displays and activities to highlight the current research taking place in the department. There was a display of some of the old equipment that has been used in the department. While this mostly focused on storage mediums, it also included my first computer – a BBC model B. Admittedly there was a lot of games played on it in my youth.

Pepper robot

Demonstration of the Pepper robot that is being used by the Interaction lab to improve speech interactions.

Each of the labs in the department had displays, including the new Pepper robot in the Interaction Lab and one of the Nao robots from the Robotics Lab. The Interactive and Trustworthy Technologies Lab were demonstrating the interactive games they have developed to help with rehabilitation after falls and knee replacements. The Semantic Web Lab were demonstrating the difficulties of reconstructing a family tree using vital records information.

At the dinner in the evening we had two guest speakers. Alex Balfour, the first head of department and instigator of the degree programme, and Ian Ritchie, entrepreneur and former graduate. Both gave entertaining speeches reflecting their time in the department, and their experiences of the Mountbatten Building, now the Apex Hotel in the Grassmarket where we had the dinner.

See these pages for more about the history of computer science at Heriot-Watt.

Genealogy reconstruction game

Current PhD students attempting to reconstruct a family tree from their entries in the birth, marriage, and death records.

rehab-game

Game to help rehabilitation patients perform their physiotherapy exercises correctly.

Celebrating 50 years of Computer Science at HWU

This year sees a double celebration in the Department of Computer Science at Heriot-Watt University – it is 50 years since we launched the first BSc Computer Science degree in Scotland, and 50 years since Heriot-Watt was granted university status. To celebrate we had a series of events last week including an open day and […]

Old hardware

Display of old equipment used within computer science.

This year sees a double celebration in the Department of Computer Science at Heriot-Watt University – it is 50 years since we launched the first BSc Computer Science degree in Scotland, and 50 years since Heriot-Watt was granted university status. To celebrate we had a series of events last week including an open day and dinner for former staff and students.

During the open day we had a variety of displays and activities to highlight the current research taking place in the department. There was a display of some of the old equipment that has been used in the department. While this mostly focused on storage mediums, it also included my first computer – a BBC model B. Admittedly there was a lot of games played on it in my youth.

Pepper robot

Demonstration of the Pepper robot that is being used by the Interaction lab to improve speech interactions.

Each of the labs in the department had displays, including the new Pepper robot in the Interaction Lab and one of the Nao robots from the Robotics Lab. The Interactive and Trustworthy Technologies Lab were demonstrating the interactive games they have developed to help with rehabilitation after falls and knee replacements. The Semantic Web Lab were demonstrating the difficulties of reconstructing a family tree using vital records information.

At the dinner in the evening we had two guest speakers. Alex Balfour, the first head of department and instigator of the degree programme, and Ian Ritchie, entrepreneur and former graduate. Both gave entertaining speeches reflecting their time in the department, and their experiences of the Mountbatten Building, now the Apex Hotel in the Grassmarket where we had the dinner.

See these pages for more about the history of computer science at Heriot-Watt.

Genealogy reconstruction game

Current PhD students attempting to reconstruct a family tree from their entries in the birth, marriage, and death records.

rehab-game

Game to help rehabilitation patients perform their physiotherapy exercises correctly.