Research

BridgeDb GSoC 2019 Student

During the summer, BridgeDb has had a Google Summer of Code student working on extending the system to work with secondary identifiers; these are alternative identifiers for a given resource.

The student Manas Awasthi has maintained a blog of his experiences. Below are some excerpts of his activity.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
May 28 · 3 min read
Google Summer of Code, an annual Google program which encourages open source contribution from students. The term I was introduced to by my seniors in my freshman year. Having no clue about open source, I started gathering knowledge about ‘How to contribute to open source projects?’ Then I came across version control, being a freshman it was an unknown territory for me. I started using Github for my personal projects which gave me a better understanding of how to use it. Version Control Service was off the checklists. By the time all this was done Google Summer of Code 2018 was announced.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Jun 12 · 3 min read
The Coding Period: The First Two Weeks
The coding period of Google Summer of Code started on 27th of May, at the time of publishing it’s been more than 2 weeks, here I am writing this blog to discuss what I have done over the past two weeks, and what a ride it has been already. Plenty of coding along with plenty of learning. From the code base to the test suite.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Jun 22 · 3 min read
The Coding Period: Week 3 — Week 4
Hola Amigos!!! Let’s discuss my progress through week 3 and 4 of GSoC’s coding period. So the major part of what I was doing this week was to add support for the secondary identifier (err!!! whats that) to BridgeDb.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Aug 21 · 3 min read
Hey Folks, this is probably the last blog in this series outlining my journey as a GSoC student. In this blog I’ll go through the functionality I have added over the summer and why the end-users should use it.

FAIRplus Newsletter 2

Below is the opening exert from the second FAIRplus Newsletter:

Though FAIRplus has been running for just six months, there is already a lot to talk about. Our two task-focused ‘Squads’ have booted up and begun the FAIRification of the first set of four pilot datasets, our industry partners in EFPIA organised the first ‘Bring Your Own Data’ workshop in London, and we’ve been busy explaining our goals and answering many questions from our stakeholders.

You can read about these activities in this second FAIRplus newsletter. On top of that, we bring you an update on upcoming events, news from our partners and also a new section ‘Track our progress’ where you can check for yourself how we are progressing towards our goals and what Deliverables and reports we’ve recently submitted.

Finally, we’ve launched our own LinkedIn page. Besides regular updates on our activities, it will also feature job opportunities and news from the FAIRplus partners.

The next FAIRplus Newsletter will come out in November 2019. In it we’ll present the FAIRplus Fellowship programme, report on the FAIR workshop in October and more.

We wish you a relaxing Summer and look forward to meeting you at our events!

Biohackathon 2018 -Paris

Bioschemas at the Biohackathon

Last November I had the privilege to be one of 150 participants at the Biohackathon organised by ELIXIR. The hackathon was organised into 29 topics, many of which were related to Bioschemas and one directly focused on Bioschemas. For the Bioschemas topic we had up to 30 people working around three themes.

The first theme was to implement markup for the various life sciences resources present. Representatives from ELIXIR Core Data Resources and node resources from the UK and Switzerland were there to work on this thanks to the staff exchange and travel fund. By the end of the week we had new live deploys for 11 additional resources and examples for many more.

The second theme was to refine the types and profiles that Bioschemas has been developing based on the experiences of deploying the markup. Prior to the hackathon, Bioschemas had moved from a minimal Schema.org extension of a single BioChemEntity type to collection of types for the different life science resources, e.g. Gene, Protein, and Taxon. Just before the hackathon a revised set of types and profiles were released. This proved to be useful for discussion, but it very quickly became clear that there was need for further refinement. During the hackathon we started new profiles for DNA, Experimental Studies, and Phenotype, and the Chemical profile was split into MolecularEntity and ChemicalSubstance. Long discussions were held about the types and their structure with early drafts for 17 types being proposed. These are now getting to a state where they are ready for further experimentation.

The third theme was to develop tooling to support Bioschemas. Due to the intensity of the discussions on the types and profiles, there was no time to work on this topic. However, the prototype Bioschemas Generator was extensively tested during the first theme and improvements fed back to the developer. There were also refinements made to the GoWeb tool.

Overall, it was a very productive hackathon. The venue proved to be very conducive to fostering the right atmosphere. During the evenings there were opportunities to socialise or carry on the discussions. Below are two of the paintings that were produced during one of the social activities that capture the Bioschemas discussions.

And there was the food. Wow! Wonderful meals, three times a day.

ISWC 2018

ISWC 2018 Trip Report

Keynotes

There were three amazing and inspiring keynote talks, all very different from each other.

The first was given by Jennifer Golbeck (University of Maryland). While Jennifer did her PhD on the Semantic Web in the early days of social media and Linked Data, she now focuses on user privacy and consent. These are highly relevant topics to the Semantic Web community and something that we should really be considering when linking people’s personal data. While the consequences of linking scientific data might not be as scary, there are still ethical issues to consider if we do not get it right. Check out her TED talk for an abridged version of her keynote.

She also suggested that when reading a companies privacy policy, you should replace the word “privacy” with “consent” and see how it seems then.

The talk also struck an accord with the launch of the SOLID framework by Tim Berners-Lee. There was a good sales pitch of the SOLID framework from Ruben Verborgh in the afternoon of the Decentralising the Semantic Web Workshop.

The second was given by Natasha Noy (Google). Natasha talked about the challenges of being a researcher and engineering tools that support the community. Particularly where impact may only be detect 6 to 10 years down the line. She also highlighted that Linked Data is only a small fraction of the data in the world (the tip of the iceberg), and it is not appropriate to expect all data to become Linked Data.

Her most recent endeavour has been the Google Dataset Search Tool. This has been a major engineering and social endeavour; getting schema.org markup embedded on pages and building a specialist search tool on top of the indexed data. More details of the search framework are in this blog post. The current search interface is limited due to the availability of metadata; most sites only make title and description available. However, we can now start investigating how to return search results for datasets and what additional data might be of use. This for me is a really exciting area of work.

Later in the day I attended a talk on the LOD Atlas, another dataset search tool. While this gives a very detailed user interface, it is only designed for Linked Data researchers, not general users looking for a dataset.

The third keynote was given by Vanessa Evers (University of Twente, The Netherlands). This was in a completely different domain, social interactions with robots, but still raised plenty of questions for the community. For me the challenge was how to supply contextualised data.

Knowledge Graph Panel

The other big plenary event this year was the knowledge graph panel. The panel consisted of representatives from Microsoft, Facebook, eBay, Google, and IBM, all of whom were involved with the development of Knowledge Graphs within their organisation. A major concern for the Semantic Web community is that most of these panelists were not aware of our community or the results of our work. Another concern is that none of their systems use any of our results, although it sounds like several of them use something similar to RDF.

The main messages I took from the panel were

  • Scale and distribution were key

  • Source information is going to be noisy and challenging to extract value from

  • Metonymy is a major challenge

This final point connects with my work on contextualising data for the task of the user [1, 2] and has reinvigorated my interest in this research topic.

Final Thoughts

This was another great ISWC conference, although many familiar faces were missing.

There was a great and vibrant workshop programme. My paper [3] was presented during the Enabling Open Semantic Science workshop (SemSci 2018) and resulted in a good deal of discussion. There were also great keynotes at the workshop from Paul Groth (slides) and Yolanda Gil which I would recommend anyone to look over.

I regret not having gone to more of the Industry Track sessions. The one I did make was very inspiring to see how the results of the community are being used in practice, and to get insights into the challenges faced.

The conference banquet involved a walking dinner around the Monterey Bay Aquarium. This was a great idea as it allowed plenty of opportunities for conversations with a wide range of conference participants; far more than your standard banquet.

Here are some other takes on the conference:

I also managed to sneak off to look for the sea otters.

[1] Unknown bibtex entry with key [BatchelorBCDDDEGGGGHKLOPSSTWWW14]
[Bibtex]
[2] Unknown bibtex entry with key [Gray14]
[Bibtex]
[3] Alasdair J. G. Gray. Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources. In Enabling Open Semantic Science, Monterey, California, USA, oct 2018. Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb
[Bibtex]
@InProceedings{Gray2018:jupyter:SemSci2018,
abstract = {In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.},
author = {Alasdair J G Gray},
title = {Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources},
OPTcrossref = {},
OPTkey = {},
booktitle = {Enabling Open Semantic Science},
year = {2018},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = oct,
address = {Monterey, California, USA},
OPTorganization = {},
OPTpublisher = {},
note = {Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb},
url = {http://ceur-ws.org/Vol-2184/paper-02/paper-02.html},
OPTannote = {}
}

First steps with Jupyter Notebooks

At the 2nd Workshop on Enabling Open Semantic Sciences (SemSci2018), colocated at ISWC2018, I presented the following paper (slides at end of this post):

Title: Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources

Abstract: In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.

The paper covers my first attempt at using a computational notebook to publish a data analysis for reproducibility. The paper provokes more questions than it answers and this was the case in the workshop too.

One of the really great things about the paper is that you can launch the notebook, without installing any software, by clicking on the binder button below. You can then rerun the entire notebook and see whether you get the same results that I did when I ran the analysis over the various datasets.