At the end of June, a group of individuals from across Europe came together in Leiden for the first FAIRport–Elixir Bring Your Own Data (BYOD) workshop. None of us quite knew what would happen but we were all excited that such an event was taking place. The result was better than we expected.
This first BYOD workshop combined the expertise of Linked Data experts and data experts from MycoBase and the Human Protein Atlas (HPA). The participants were pretty evenly split between data providers (who had some, but not a lot of RDF knowledge) and trainers (experts in semantic web technologies). The idea of the workshop was to give the data providers a hands on training event (part tutorial, part hackathon) for making their data available in a findable, accessible, interoperable and reusable manner (FAIR). The chosen approach to make the data FAIR was RDF1. The goal was to develop showcases that would demonstrate the added value of interoperable data for facilitating questions across multiple resources.
The first day of the workshop was mainly devoted to introducing the ideas of publishing data as interoperable RDF and understanding the datasets represented by the data providers. Before the afternoon was over we’d split into two teams to work up the showcase studies using either MycoBase or HPA and the experts knowledge of related datasets. The day ended with a social meal by the canal in central Leiden.
The next day started with the teams feverishly working up their ideas. There was a general buzz around the room with the experts calling on each other’s knowledge across the teams to bring together working demonstrations. The day closed with a show and tell of what had been accomplished.
The MycoBase showcase focused on discovering which compounds fermented. About 10,000 fungal strains in MycoBase were represented in RDF resulting in 2.5 million triples. These were linked to the ChEMBL database by exploiting the Open PHACTS Discovery Platform API to resolve chemical names present in MycoBase to their ChEMBL URI. This formed the key linkage to integrate the two data sets, pulling in key facts (e.g. molecular weight, log p value and hydrogen bound count) from the ChEMBL database.
The Human Protein Atlas (HPA) team worked up two possible showcases. The first involved discovering for a given HPA protein the pathways, sourced from wikipathways, in which the protein occurs. The second involved linking with the genes present in FANTOM5 and included a resolution step involving the Bio2RDF version of EntrezGene. These connections were possible due to the lengthy modelling discussions and the development of an RDF generating script that converted part of the HPA relational database into an RDF representation.
Overall the workshop was a great success. The data providers felt they had learnt about RDF and were happy with the progress that had been made. While it was recognised that modelling the data in RDF was hard, the interoperability possibilities were a great incentive. The trainers were pleased with the ad hoc training approach, although they had some key suggestions for training material for the next BYOD workshop. The facilitators also played a key role in ensuring that there was an appropriate amount of tutorial time and making us catch our planes. Both teams left vowing to continue working up their showcases to completion and aiming to produce a paper about their work. A summer of work lies ahead.
For me, the key measure of success for the workshop will be if the data providers are now able to find their way into the world of semantic data publishing without further workshops. Only time will tell.
1 RDF, Resource Description Framework, is a model for representing data using largely the same technologies that made the world wide web such a success. While on the world wide web the focus is on linking documents, RDF is aimed at linking data with data, and data with ontologies, in a computer readable format. For tutorials click here or here.