Linked Data Reconciliation in GraphDB

Using DBpedia to Enhance your Data in GraphDBFollowing my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconcili…

Using DBpedia to Enhance your Data in GraphDB

Following my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconciliation.

In this tutorial we will begin with a .csv of car manufacturers and enhance this with DBpedia. This .csv can be downloaded from here if you want to follow along.

Contents

Setting Up
Constructing the Graph
Reconciling your Data
Exploring the New Graph
Conclusion

Setting Up

First things first, we need to load our tabular data into OntoRefine in GraphDB. Head to the import tab, select “Tabular (OntoRefine)” and upload cars.csv if you are following along.

Click “Next” to start creating the project.

On this screen you need to untick “Parse next 1 line(s) as column headers” as this .csv does not have a header row. Rename the project in the top right corner and click “Create Project”.

You should now have this screen (above) showing one column of car manufacturer names. The column has a space in it which is annoying when running SPARQL queries across so lets rename it.

Click the little arrow next to “Column 1”, open “Edit Column” and then click “Rename this Column”. I called it “carNames” and will use this in the queries below so remember if you name it something different.

If you ever make a mistake, remember there is and undo/redo tab.

Constructing the Graph

In the top right of the interface there is an orange button titled “SPARQL”. Click this to open the SPARQL interface from which you can query your tabular data.

In the above screenshot I have run the query we want. I have have pasted it here so you can see it all and I go through it in detail below.

I use a CONSTRUCT query here. If you are new to SPARQL entirely then I recommend reading my tutorial on Constructing SPARQL Queries first. I then wrote a second tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

I start this query by defining my prefixes as usual. I am wanting to construct a graph around these car manufacturers so I design that in my CONSTRUCT clause. I am building a fairly simple graph for this tutorial so lets just run through it very quickly.

I want to have entities representing car manufacturers that have a type, label and location. This location is the headquarters of the car manufacturer. In most cases, all entities should have both a type and a human-readable label so I have ensured this here.

Each location is also an entity with an attached type, label and population.

Unlike my superhero tutorial, the .csv only contains the car company names and not all the data we want in our graph. We therefore need to reconcile our data with information in an open linked dataset. In this tutorial we will use DBpedia, the linked data representation of Wikipedia.

To get the information needed to build the graph declared in our CONSTRUCT we first grab all the names in our .csv and assign them to the variable ?cname. String literals must be language tagged to reconcile with the data in DBpedia so I BIND the English language tag “en” to each string literal. This explanation is what the lines below do:

If you didn’t name the column “carNames” above, you will have to modify the <urn:col:carNames> predicate here.
  ?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

Following this we use the SERVICE tag to send the query to DBpedia (this is called a federated query). We find every entity with the label matching our language tagged strings from the original .csv.

Once I have those entities, I need to find their locations. DBpedia is a very messy dataset so we have to use an alternative path in the query (represented by the “pipe” | symbol). This finds locations connected by any of the alternate paths given (in this case dbo:location and dbo:locationCountry) and assigns them to the variable ?location.

That explanation is referring to these lines:

    ?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

Next we want to retrieve the information about each country. The first pattern in the location ensures the entity has the type dbo:Country so that we don’t find loads of irrelevant locations.

Following this we grab the label and again use alternate property paths to extract each countries population.

It is important to note that some countries have two different populations attached by these two predicates.

We finally FILTER the country labels to only return those that are in English as that is the language our original dataset is in. Data reconciliation can also be used to extend your data into other languages if it happens to fit a multilingual linked open dataset.

That covers the final few lines of our query:

    ?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")

Next we need to insert this graph we have constructed into a GraphDB repository.

Click “SPARQL endpoint” and copy your endpoint (will be different) to be used later.

Reconciling the Data

If you have not done already, create a repository and head to the SPARQL tab.

You can see in the top right of this screenshot that I’m using a repository called “cars”.

In this query panel you want to copy the CONSTRUCT query we built and modify it a little. The full query is here:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE { SERVICE <http://localhost:7200/rdf-bridge/yourID> {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}}
}

The first thing we do is replace CONSTRUCT with INSERT as we now want to ingest the returned graph into our repository.

The next and final thing we must do is nest the entire WHERE clause into a second SERVICE tag. This time however, the service endpoint is the endpoint you copied at the end of the construction section.

This constructs the graph and inserts it into your repository!

It should be a much larger graph but the messiness of DBpedia strikes again! Many car manufacturers are connected to the string label of their location and not the entity. Therefore, the locations do not have a population and are consequently not returned.

We started with a small .csv of car manufacturer names so lets explore this graph we now have.

Exploring the New Graph

If we head to the “Explore” tab and view Japan for example, we can see our data.

Japan has the attached type dbo:Country, label, population and has seven car manufacturers.

There is no point in linking data if we cannot gain further insight so lets head to the “SPARQL” tab of the workbench.

In this screenshot we can see the results of the below query. This query returns each country alongside the number of people per car manufacturer in that country.

There is nothing new in this query if you have read my SPARQL introduction. I have used the MAX population as some countries have two attached populations due to DBpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?name ((MAX(?pop) / COUNT(DISTINCT ?companies)) AS ?result)
WHERE {
?companies rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
dbp:populationCensus ?pop ;
rdfs:label ?name .
}
GROUP BY ?name
ORDER BY DESC (?result)

In the screenshot above you can see that the results (ordered by result in descending order) are:

  • Indonesia
  • Pakistan
  • India
  • China

India of course has a much larger population than Indonesia but also has a lot more car manufacturers (as shown below).

If you were a car manufacturer in Asia, Indonesia might be a good market to target for export as it has a high population but very little local competition.

Conclusion

We started with a small list of car manufacturer names but, by using GraphDB and DBpedia, we managed to extend this into a small graph that we could gain actual insight from.

Of course, this example is not entirely useful but perhaps you have a list of local areas or housing statistics that you want to reconcile with mapping or government linked open data. This can be done using the above approach to help you or your business gain further insight that you could not have otherwise identified.


Linked Data Reconciliation in GraphDB was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Seminar: Building intelligent systems (that can explain)

Title: Building intelligent systems (that can explain)

Speaker: Ilaria Tiddi, Research Associate in the Knowledge Representation and Reasoning group of the Vrije Universiteit of Amsterdam (NL)

Date: 10:00 on 6 March 2019

Location: EM3.03, Heriot-Watt University

Abstract: Explanations have been subject of study in a variety of fields (e.g. philosophy, psychology and social science), experiencing a new wave of popularity in Artificial Intelligence thanks to the success of machine learning (see DARPA’s eXplainable AI). Yet, the events of recent times have shown that the effectiveness of intelligent systems is still limited due to their inability to explain their decisions to human users, hence losing in understandability and trustworthiness. In this talk, I will give an overview of my research, aiming at developing systems able to automatically generate explanations using external background knowledge. In particular, I will show how such systems can be based on the existing research on explanations, combined with AI techniques and the large-scale knowledge sources available nowadays.

Bio: I am a Research Associate in the Knowledge Representation and Reasoning group of the Vrije Universiteit of Amsterdam (NL).  My research focuses on creating transparent AI systems that generate explanations through a combination of machine learning, semantic technologies, and knowledge from large, heterogeneous knowledge graphs. As part of my research activities, I am member of the CEUR-WS Editorial Board and the Knowledge Capture conference (K-CAP) Steering Committee, while I have organised workshop series (Recoding Black Mirror, Application of Semantic Web Technologies in Robotics, Linked Data 4 Knowledge Discovery) and Summer Schools (the 2015 and 2016 Semantic Web Summer School).

Twitter: @IlaTiddi

Website : https://kmitd.github.io/ilaria/

Seminar: Scan-vs-BIM for monitoring in construction

Title:Scan-vs-BIM for monitoring in construction

Speaker: Frédéric Bosché, Associate Professor in Construction Informatics,
Director of the Institute for Sustainable Building Design (ISBD), and Leader of the CyberBuild Lab. Heriot-Watt University

Date: 11:15 on 4 March 2019

Location: CM F.17, Heriot-Watt University

Abstract: When Laser Scanning and Building Information Modelling (BIM) technologies were emerging, the construction industry showed significant interest in what were to be eventually called “Scan-to-BIM”: the process of using a laser scanned point clouds to develop BIM models of existing assets. However, with the use of BIM for design, another important use of these technologies is what some have called “Scan-vs-BIM”: the comparison of reality capture 3D point clouds (capturing the as-is states of constructions) to BIM models (representing the as-designed states of constructions). Scan-vs-BIM offers significant opportunities for further automation in construction project delivery for example for progress or quality control.

This talk will present the Scan-vs-BIM concept, illustrate its process and benefits. The talk will then expand on the subject of using the output of Scan-vs-BIM processing to enhance dimensional quality control with a view to evolve dimensional quality control from a traditionally point-based measurement process to a surface-based measurement process.

Bio: Frédéric graduated holds a PhD in Civil Engineering, but also worked as a PostDoc in the Computer Vision group of ETH Zurich, Switzerland for 2.5 years. He is currently Associate Professor in the School of the Energy, Geoscience, Infrastructure and Society (EGIS). Frédéric leads the CyberBuild Lab (http://cyberbuild.hw.ac.uk/), and his research covers two main areas:

  1. Processing of reality capture data to enhance asset construction and life cycle management.
  2. Development and use of virtual and mixed reality technology, to support collaborative and engaging design, construction and engineering works, as well as training.

Frédéric has published over 70 peer-reviewed papers in internationally-recognised journal and conferences, and his research has received a few international research and innovation awards, including two CIOB International Research & Innovation awards in 2016, and the IAARC Tucker-Hasegawa Award in 2018 for “distinguished contributions to the field of automation and robotics in construction”. Frédéric is a member of the Executive Committee of the International Association for Automation and Robotics in Construction (IAARC), and he is Associate Editor of Automation in Construction (Elsevier).

Seminar: The Challenges of Automated Ontology Debugging: Experiences and Ideas

Title: The Challenges of Automated Ontology Debugging: Experiences and Ideas

Speaker: Juan Casanova, University of Edinburgh

Date: 11:15 on 18 February 2019

Location: CM F.17, Heriot-Watt University

Abstract: Some of the principal attractive aspects of semantic automated reasoning methods (logic) are, at the same time, what fights against it becoming widely spread and easily usable for the management of large amounts of data coming from multiple sources (ontologies). Ontology debugging is a fundamental subfield to master if automated ontology-based technologies are to be in charge of large data and knowledge management systems.

I am still a PhD student, and relatively new to the field, but during my work on ontology debugging techniques I feel I have come to identify a few fundamental challenges that we need to be aware of, such as the need for additional information, how big the issue of (local) inconsistency in ontologies can be, and the problem of efficiently finding relevant justifications for inferences.

In this talk, I’ll be briefly explaining my work on automated fault detection using meta-ontologies, in the context of which I have identified and battled with these challenges, and I’ll be presenting my opinions on where these challenges are coming from and what could be done to tackle them. It is likely that some of you will disagree with some of my claims or think that they are obvious, and that is precisely why I think this talk should incentivize some useful and interesting discussion.

Biohackathon 2018 -Paris

Last November I had the privilege to be one of 150 participants at the Biohackathon organised by ELIXIR. The hackathon was organised into 29 topics, many of which were related to Bioschemas and one directly focused on Bioschemas. For the Bioschemas topic we had up to 30 people working around three themes. The first theme […]

Bioschemas at the Biohackathon

Last November I had the privilege to be one of 150 participants at the Biohackathon organised by ELIXIR. The hackathon was organised into 29 topics, many of which were related to Bioschemas and one directly focused on Bioschemas. For the Bioschemas topic we had up to 30 people working around three themes.

The first theme was to implement markup for the various life sciences resources present. Representatives from ELIXIR Core Data Resources and node resources from the UK and Switzerland were there to work on this thanks to the staff exchange and travel fund. By the end of the week we had new live deploys for 11 additional resources and examples for many more.

The second theme was to refine the types and profiles that Bioschemas has been developing based on the experiences of deploying the markup. Prior to the hackathon, Bioschemas had moved from a minimal Schema.org extension of a single BioChemEntity type to collection of types for the different life science resources, e.g. Gene, Protein, and Taxon. Just before the hackathon a revised set of types and profiles were released. This proved to be useful for discussion, but it very quickly became clear that there was need for further refinement. During the hackathon we started new profiles for DNA, Experimental Studies, and Phenotype, and the Chemical profile was split into MolecularEntity and ChemicalSubstance. Long discussions were held about the types and their structure with early drafts for 17 types being proposed. These are now getting to a state where they are ready for further experimentation.

The third theme was to develop tooling to support Bioschemas. Due to the intensity of the discussions on the types and profiles, there was no time to work on this topic. However, the prototype Bioschemas Generator was extensively tested during the first theme and improvements fed back to the developer. There were also refinements made to the GoWeb tool.

Overall, it was a very productive hackathon. The venue proved to be very conducive to fostering the right atmosphere. During the evenings there were opportunities to socialise or carry on the discussions. Below are two of the paintings that were produced during one of the social activities that capture the Bioschemas discussions.

And there was the food. Wow! Wonderful meals, three times a day.

Seminar: Environmental Health Research in the Era of the ‘Exposome’

Title: Environmental Health Research in the Era of the ‘Exposome’

Speaker: Miranda Loh, Sc.D., Senior Scientist at the Institute of Occupational Medicine

Date: 11:15 on February 2019

Location: CM F.17, Heriot-Watt University

Abstract: In 2015, an estimated 9 million premature deaths were caused by pollution, with air pollution as the leading environmental risk factor. The potential environmental burden of disease could be even larger, as there are still many unknown causes of disease. Much of this uncertainty around the cause of diseases comes from poor description of environmental and occupational exposures in epidemiological studies. Current research into characterising the exposome, the sum total of all exposures through an individual’s lifetime, aims at improving exposure science and our understanding of the relationships between environment and health. There has been great interest in the exposome community in using sensors and smart technologies to further assessment of environmental, behavioural, and health information for individuals. This seminar will explore current interests in the use of technology in exposome research.

Seminar: Exploiting Semantic Web Technologies in Open-domain Conversational Agents

Title: Exploiting Semantic Web Technologies in Open-domain Conversational Agents

Speaker: Alessandro Suglia, Heriot-Watt University

Date: 11:15 on 10 December 2018

Location: CM F.17, Heriot-Watt University

Abstract: The Amazon Alexa Prize is an international competition organised to foster the development of sophisticated open-domain conversational agents. During the competition, the systems should be able to support a conversation with a user about several topics ranging from movies to the news in an engaging and coherent way. In order to understand the entities mentioned by the user and to be able to provide interesting information about those, we relied on several Semantic Web Technologies such as the Amazon Neptune cluster for high-performance SPARQL query execution on a customised large-scale knowledge base composed of Wikidata and a fragment of the DBpedia ontology. In this talk, I will provide an overview of the system and I will describe how our system leverages the power of Linked Data in several components of the architecture.

ISWC 2018

ISWC 2018 Trip Report Keynotes There were three amazing and inspiring keynote talks, all very different from each other. The first was given by Jennifer Golbeck (University of Maryland). While Jennifer did her PhD on the Semantic Web in the early days of social media and Linked Data, she now focuses on user privacy and […]

ISWC 2018 Trip Report

Keynotes

There were three amazing and inspiring keynote talks, all very different from each other.

The first was given by Jennifer Golbeck (University of Maryland). While Jennifer did her PhD on the Semantic Web in the early days of social media and Linked Data, she now focuses on user privacy and consent. These are highly relevant topics to the Semantic Web community and something that we should really be considering when linking people’s personal data. While the consequences of linking scientific data might not be as scary, there are still ethical issues to consider if we do not get it right. Check out her TED talk for an abridged version of her keynote.

She also suggested that when reading a companies privacy policy, you should replace the work “privacy” with “consent” and see how it seems then.

The talk also struck an accord with the launch of the SOLID framework by Tim Berners-Lee. There was a good sales pitch of the SOLID framework from Ruben Verborgh in the afternoon of the Decentralising the Semantic Web Workshop.

The second was given by Natasha Noy (Google). Natasha talked about the challenges of being a researcher and engineering tools that support the community. Particularly where impact may only be detect 6 to 10 years down the line. She also highlighted that Linked Data is only a small fraction of the data in the world (the tip of the iceberg), and it is not appropriate to expect all data to become Linked Data.

Her most recent endeavour has been the Google Dataset Search Tool. This has been a major engineering and social endeavour; getting schema.org markup embedded on pages and building a specialist search tool on top of the indexed data. More details of the search framework are in this blog post. The current search interface is limited due to the availability of metadata; most sites only make title and description available. However, we can now start investigating how to return search results for datasets and what additional data might be of use. This for me is a really exciting area of work.

Later in the day I attended a talk on the LOD Atlas, another dataset search tool. While this gives a very detailed user interface, it is only designed for Linked Data researchers, not general users looking for a dataset.

The third keynote was given by Vanessa Evers (University of Twente, The Netherlands). This was in a completely different domain, social interactions with robots, but still raised plenty of questions for the community. For me the challenge was how to supply contextualised data.

Knowledge Graph Panel

The other big plenary event this year was the knowledge graph panel. The panel consisted of representatives from Microsoft, Facebook, eBay, Google, and IBM, all of whom were involved with the development of Knowledge Graphs within their organisation. A major concern for the Semantic Web community is that most of these panelists were not aware of our community or the results of our work. Another concern is that none of their systems use any of our results, although it sounds like several of them use something similar to RDF.

The main messages I took from the panel were

  • Scale and distribution were key

  • Source information is going to be noisy and challenging to extract value from

  • Metonymy is a major challenge

This final point connects with my work on contextualising data for the task of the user [1, 2] and has reinvigorated my interest in this research topic.

Final Thoughts

This was another great ISWC conference, although many familiar faces were missing.

There was a great and vibrant workshop programme. My paper [3] was presented during the Enabling Open Semantic Science workshop (SemSci 2018) and resulted in a good deal of discussion. There were also great keynotes at the workshop from Paul Groth (slides) and Yolanda Gil which I would recommend anyone to look over.

I regret not having gone to more of the Industry Track sessions. The one I did make was very inspiring to see how the results of the community are being used in practice, and to get insights into the challenges faced.

The conference banquet involved a walking dinner around the Monterey Bay Aquarium. This was a great idea as it allowed plenty of opportunities for conversations with a wide range of conference participants; far more than your standard banquet.

Here are some other takes on the conference:

I also managed to sneak off to look for the sea otters.

[1] [doi] Colin R. Batchelor, Christian Y. A. Brenninkmeijer, Christine Chichester, Mark Davies, Daniela Digles, Ian Dunlop, Chris T. A. Evelo, Anna Gaulton, Carole A. Goble, Alasdair J. G. Gray, Paul T. Groth, Lee Harland, Karen Karapetyan, Antonis Loizou, John P. Overington, Steve Pettifer, Jon Steele, Robert Stevens, Valery Tkachenko, Andra Waagmeester, Antony J. Williams, and Egon L. Willighagen. Scientific Lenses to Support Multiple Views over Linked Chemistry Data. In The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, page 98–113, 2014.
[Bibtex]
@inproceedings{BatchelorBCDDDEGGGGHKLOPSSTWWW14,
abstract = {When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.},
author = {Colin R. Batchelor and
Christian Y. A. Brenninkmeijer and
Christine Chichester and
Mark Davies and
Daniela Digles and
Ian Dunlop and
Chris T. A. Evelo and
Anna Gaulton and
Carole A. Goble and
Alasdair J. G. Gray and
Paul T. Groth and
Lee Harland and
Karen Karapetyan and
Antonis Loizou and
John P. Overington and
Steve Pettifer and
Jon Steele and
Robert Stevens and
Valery Tkachenko and
Andra Waagmeester and
Antony J. Williams and
Egon L. Willighagen},
title = {Scientific Lenses to Support Multiple Views over Linked Chemistry
Data},
booktitle = {The Semantic Web - {ISWC} 2014 - 13th International Semantic Web Conference,
Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part {I}},
pages = {98--113},
year = {2014},
url = {http://dx.doi.org/10.1007/978-3-319-11964-9_7},
doi = {10.1007/978-3-319-11964-9_7},
}
[2] [doi] Alasdair J. G. Gray. Dataset Descriptions for Linked Data Systems. IEEE Internet Computing, 18(4):66–69, 2014.
[Bibtex]
@article{Gray14,
abstract = {Linked data systems rely on the quality of, and linking between, their data sources. However, existing data is difficult to trace to its origin and provides no provenance for links. This article discusses the need for self-describing linked data.},
author = {Alasdair J. G. Gray},
title = {Dataset Descriptions for Linked Data Systems},
journal = {{IEEE} Internet Computing},
volume = {18},
number = {4},
pages = {66--69},
year = {2014},
url = {http://dx.doi.org/10.1109/MIC.2014.66},
doi = {10.1109/MIC.2014.66},
}
[3] Alasdair J. G. Grayg. Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources. In Enabling Open Semantic Science, Monterey, California, USA, 2018. Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb
[Bibtex]
@InProceedings{Gray2018:jupyter:SemSci2018,
abstract = {In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.},
author = {Alasdair J G Grayg},
title = {Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources},
OPTcrossref = {},
OPTkey = {},
booktitle = {Enabling Open Semantic Science},
year = {2018},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = oct,
address = {Monterey, California, USA},
OPTorganization = {},
OPTpublisher = {},
note = {Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb},
url = {http://ceur-ws.org/Vol-2184/paper-02/paper-02.html},
OPTannote = {}
}

First steps with Jupyter Notebooks

At the 2nd Workshop on Enabling Open Semantic Sciences (SemSci2018), colocated at ISWC2018, I presented the following paper (slides at end of this post): Title: Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources Abstract: In recent years there has been a reproducibility crisis in science. Computational notebooks, such as […]

At the 2nd Workshop on Enabling Open Semantic Sciences (SemSci2018), colocated at ISWC2018, I presented the following paper (slides at end of this post):

Title: Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources

Abstract: In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.

The paper covers my first attempt at using a computational notebook to publish a data analysis for reproducibility. The paper provokes more questions than it answers and this was the case in the workshop too.

One of the really great things about the paper is that you can launch the notebook, without installing any software, by clicking on the binder button below. You can then rerun the entire notebook and see whether you get the same results that I did when I ran the analysis over the various datasets.