Presentation

Crusade for Big Data Keynote

Today I gave the keynote presentation (slides below) at the Crusade for Big Data in the AAL domain workshop as part of the EU Ambient Assisted Living Forum. I gave an overview of the way that the Open PHACTS project has overcome various Big Data challenges to provide a production quality data integration platform that is being used to answer real pharmacology business questions.

The workshop then broke out into five breakout groups to discuss open challenges facing the AAL community that are posed by Big Data. The breakout groups were:

Privacy and Ethics
Business models for sustainability
Data reuse and interoperability
Data quality
Feedback to the users

The organisers of the workshop (Femke Ongenae and Femke De Backere) will be sharing the outcomes of the brainstorming by proposing several working groups to focus on the issues in the area of AAL.

Data Integration in a Big Data Context: An Open PHACTS Case Study from Alasdair Gray

Data Integration in a Big Data Context

Today I had the pleasure of visiting the Urban Big Data Centre (UDBC) to give a seminar on Data Integration in a Big Data context (slides below). The idea for the seminar came about due to my collaboration with Nick Bailey (Associate Director of the UBDC) in the Administrative Research Data Centre for Scotland (ADRC-S).

In the seminar I wanted to highlight the challenges of data integration that arise in a Big Data context and show examples from my past work that would be relevant to those in the UBDC. In the presentation, I argue that RDF provides a good approach for data integration but it does not solve the basic challenges of messy data and generating mappings between datasets. It does however lay these challenges bare on the table, as Frank van Harmelen highlighted in his SWAT4LS keynote in 2013.

The first use case is drawn from my work on the EU SemSorGrid4Env project where we were developing an integrated view for emergency response planning. The particular use case shown is that of coastal flooding on the south coast of England. Although this project finished in 2011, I am still involved with developing RDF and SPARQL continuous data extensions; see the W3C RDF Stream Processing Community Group for details.

The second use case is drawn from my work on the EU Open PHACTS project. I showed the approach we developed for supporting user controlled views of the integrated data through Scientific Lenses. However, I also talked about the successes of the project and the fact that is currently being actively used for pharmacology research and receiving over 20million hits a month.

I finished the talk with an overview of the Administrative Data Research Centre for Scotland (ADRC-S) and my work on linking birth, marriage, and death records. I am hoping that we can adopt the lenses approach together with incorporating feedback on the linkages from the researchers who will use the integrated views.

In the discussions following the talk, the notion of FAIR data came up. This is the idea that data should be Findable, Accessible, Interoperable, and Reusable by both humans and machines. RDF is one approach that could lead to this. The other area of discussion was around community initiatives for converting existing open datasets into an RDF format. I advocated adopting the approach followed by the Bio2RDF community who share the tasks of creating and maintaining such scripts for biological datasets. An important part of this jigsaw is tracking the provenance of the datasets, for which the W3C Health Care and Life Sciences Community Profile for Dataset Descriptions could be beneficial (there is nothing specific to the HCLS community in the profile).

Data Integration in a Big Data Context from Alasdair Gray

SICSA Databases for the Environmental and Social Sciences

Today I attended the SICSA Databases for the Environmental and Social Sciences event hosted by Andy Cobley from the University of Dundee. I gave the below talk on the challenges of linking data.

Many areas of scientific discovery rely on combining data from multiples data sources. However there are many challenges in linking data. This presentation highlights these challenges in the context of using Linked Data for environmental and social science databases.

CIM Best Paper

Our paper [1] presenting a framework for terminology mappings won one of two best paper awards at the First Workshop on Context, Interpretation and Meaning (CIM2014). The other award went to the paper by Amy Guy from the University of Edinburgh.

Kerstin Forsberg from AstraZeneca presented the paper. You can find her slides on slideshare and embedded below.

A Justification-based Semantic Framework for Representing, Evaluating and Utilizing Terminology Mappings from Kerstin Forsberg

[1] Sajjad Hussain, Hong Sun, Gokce Banu Laleci Erturkmen, Mustafa Yuksel, Charles Mead, Alasdair J. G. Gray, and Kerstin Forsberg. A Justification-based Semantic Framework for Representing , Evaluating and Utilizing Terminology Mappings. In Context Interpretation and Meaning, Riva del Garda, Italy, oct 2014.
[Bibtex]

@inproceedings{Hussain2014CIM,
Abstract = {Use of medical terminologies and mappings across them are consid- ered to be crucial pre-requisites for achieving interoperable eHealth applica- tions. However, experiences from several research projects have demonstrated that the mappings are not enough. Also the context of the mappings is needed to enable interpretation of the meaning of the mappings. Built upon these experi- ences, we introduce a semantic framework for representing, evaluating and uti- lizing terminology mappings together with the context in terms of the justifica- tions for, and the provenance of, the mappings. The framework offers a plat- form for i) performing various mappings strategies, ii) representing terminology mappings together with their provenance information, and iii) enabling termi- nology reasoning for inferring both new and erroneous mappings. We present the results of the introduced framework using the SALUS project where we evaluated the quality of both existing and inferred terminology mappings among standard terminologies.},
Address = {Riva del Garda, Italy},
Author = {Hussain, Sajjad and Sun, Hong and Erturkmen, Gokce Banu Laleci and Yuksel, Mustafa and Mead, Charles and Gray, Alasdair J G and Forsberg, Kerstin},
Booktitle = {Context Interpretation and Meaning},
Title = {{A Justification-based Semantic Framework for Representing , Evaluating and Utilizing Terminology Mappings}},
url = {http://www.macs.hw.ac.uk/~fm206/cim14/cim20140_submission_2.pdf},
Month = oct,
Year = {2014}
}

ISWC2014 In-use Paper

Slides for my ISWC2014 In-use track paper [1] are available below.

Paper abstract:

When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.

In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

[1] Unknown bibtex entry with key [iswc2014]
[Bibtex]

Alasdair J G Gray

Connecting the dots of the world's data