Comparison of Linked Data Triplestores: A New Contender

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.

After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox.

In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark.

They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up
How I Tested RDFox
First Impressions
Results
Conclusion

source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable.

In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark.

This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well.

Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others.

For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command.

RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance.

This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.

Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.

You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:

SELECT ?s ?o
WHERE {
?s example:predicate ?o .
}

This would return the triples:

person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .

Which make the use of alternate property paths less necessary.

With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms).

Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:

Others = AnzoGraph, Stardog and Virtuoso

Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).

RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.

It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.

Query 1:

This query is very simple and just counts the number of relationships in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.

Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB

Query 2:

This query returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic.

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

You will notice that this query is slightly different to query 3 in the original article.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
{
FILTER(EXISTS{ SELECT ?v WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
}
LIMIT 1000
})
}
}

RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

The results of query 2 and 3 are very similar of course as query 2 is nested within query 2.

Query 4:

The two queries above were similar but query 4 is a lot more mathematical.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

AnzoGraph was the quickest to complete query 4 with RDFox in second place.

Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso

Virtuoso was the third fastest to complete this query in a time of 519.5ms.

As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark.

Query 5:

This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab .
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:

Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso

AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores.

If we instead run full text index queries, they are significantly faster than RDFox.

Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores.

Query 6:

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
:countryOfBirth ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.

RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark.

Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average.

Query 7:

Finally, this query finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.

Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso

In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:

RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.

I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily.

If you want to try it for yourself, you can request a license here.

Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.


Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.
After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox. In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark. They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up How I Tested RDFox First Impressions Results Conclusion
source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable. In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark. This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well. Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others. For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command. RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance. This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.
Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.
You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:
SELECT ?s ?o
WHERE {
  ?s example:predicate ?o .
}
This would return the triples:
person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .
Which make the use of alternate property paths less necessary. With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms). Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:
Others = AnzoGraph, Stardog and Virtuoso
Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).
RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.
It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.
Query 1: This query is very simple and just counts the number of relationships in the graph.
SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.
Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB
Query 2: This query returns a list of 1000 settlement names which have airports with identification numbers.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
  { ?v2 a dbo:Settlement ;
        rdfs:label ?v .
    ?v6 a dbo:Airport . }
  { ?v6 dbo:city ?v2 . }
  UNION
    { ?v6 dbo:location ?v2 . }
  { ?v6 dbp:iata ?v5 . }
  UNION
    { ?v6 dbo:iataLocationIdentifier ?v5 . }
  OPTIONAL { ?v6 foaf:homepage ?v7 . }
  OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic. Query 3: This query nests query 2 to grab information about the 1,000 settlements returned above.
You will notice that this query is slightly different to query 3 in the original article.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
  ?v2 a dbo:Settlement;
      rdfs:label ?v.
  ?v6 a dbo:Airport.
  { ?v6 dbo:city ?v2. }
  UNION
  { ?v6 dbo:location ?v2. }
  { ?v6 dbp:iata ?v5. }
  UNION
  { ?v6 dbo:iataLocationIdentifier ?v5. }
  OPTIONAL { ?v6 foaf:homepage ?v7. }
  OPTIONAL { ?v6 dbp:nativename ?v8. }
  {
    FILTER(EXISTS{ SELECT ?v WHERE {
      ?v2 a dbo:Settlement;
          rdfs:label ?v.
      ?v6 a dbo:Airport.
      { ?v6 dbo:city ?v2. }
      UNION
      { ?v6 dbo:location ?v2. }
      { ?v6 dbp:iata ?v5. }
      UNION
      { ?v6 dbo:iataLocationIdentifier ?v5. }
      OPTIONAL { ?v6 foaf:homepage ?v7. }
      OPTIONAL { ?v6 dbp:nativename ?v8. }
    }
    LIMIT 1000
    })
  }
}
RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
The results of query 2 and 3 are very similar of course as query 2 is nested within query 2. Query 4: The two queries above were similar but query 4 is a lot more mathematical.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {  
  {SELECT (CEIL(?a + ?b) AS ?x) WHERE {
    {SELECT (AVG(?abslat) AS ?a) WHERE {
    ?s1 geo:lat ?lat .
    BIND(ABS(?lat) AS ?abslat)
    }}
    {SELECT (SUM(?rv) AS ?b) WHERE {
    ?s2 dbo:volume ?volume .
    BIND((RAND() * ?volume) AS ?rv)
    }}
  }}
  
  {SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
      {SELECT ?c WHERE {
        BIND(MINUTES(NOW()) AS ?c)
      }}
      {SELECT (AVG(?width) AS ?d) WHERE {
        ?s3 dbo:width ?width .
        FILTER(?width > 50)
      }}
  }}
}
AnzoGraph was the quickest to complete query 4 with RDFox in second place.
Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso
Virtuoso was the third fastest to complete this query in a time of 519.5ms. As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark. Query 5: This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.
I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ex:notglutenfree rdfs:label ?label ;
                   rdfs:comment ?sab ;
                   dbo:abstract ?lab .
} WHERE {
  {?s1 rdfs:label ?label .
  FILTER (REGEX(lcase(?label), 'venus'))
  } UNION
  {?s2 rdfs:comment ?sab .
  FILTER (REGEX(lcase(?sab), 'sleep'))
  } UNION
  {?s3 dbo:abstract ?lab .
  FILTER (REGEX(lcase(?lab), 'gluten'))
  }
}
As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:
Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso
AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores. If we instead run full text index queries, they are significantly faster than RDFox.
Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores. Query 6: This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.
Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{ 
?soccerplayer a dbo:SoccerPlayer ;
   :position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
   :countryOfBirth ?countryOfBirth ;
   dbo:team ?team .
   ?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam . 
   ?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
   ?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer
If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.
RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark. Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average. Query 7: Finally, this query finds all people born in Berlin before 1900.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
 ?person dbo:birthPlace :Berlin .
 ?person dbo:birthDate ?birth .
 ?person foaf:name ?name .
 ?person dbo:deathDate ?death .
 FILTER (?birth < "1900-01-01"^^xsd:date)
 }
 ORDER BY ?name
Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.
Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso
In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:
RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.
I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily. If you want to try it for yourself, you can request a license here.
Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.

Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Constructing More Advanced SPARQL Queries

CONSTRUCT queries, VALUES and more property paths.It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both …

CONSTRUCT queries, VALUES and more property paths.

It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both my Transforming Tabular Data into Linked Data tutorial and the Linked Data Reconciliation article.

So, to finally correct this - I will cover them here!

Contents

SELECT vs CONSTRUCT
First Basic Example
- VALUES
- Alternative Property Paths
Second Basic Example
Example From the Reconciliation Article
Example From the Benchmark (Sneak Preview)

SELECT vs CONSTRUCT

In my last tutorial, I basically ran through SELECT queries from the most basic to some more complex. So what’s the difference?

With selects we are trying to match patterns in the knowledge graph to return results. With constructs we are specifying and building a new graph to return.

In the two tutorials linked (in the intro) I was constructing graphs from tabular data to then insert into a triplestore. I will discuss sections of these later but you should be able to follow the full queries after going through this tutorial.

We usually use CONSTRUCT queries at Wallscope to build a graph for the front-end team. Essentially, we create a portable sub-graph that contains all of the information needed to build a section of an application. Then instead of many select queries to the full database, these queries can be run over the much smaller sub-graph returned by the construct query.

First Basic Example

For this first example I will be querying my Superhero dataset that you can download here.

Each superhero entity in this dataset is connected to their height with the predicate dbo:height as shown here:

Using this basic SELECT query:

PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?hero ?height
WHERE {
?hero dbo:height ?height .
}

Now lets modify this query slightly into a CONSTRUCT that is almost the same:

PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?hero dbo:height ?height
} WHERE {
?hero dbo:height ?height .
}

As you can see, this returns the same information but in the form: subject, predicate, object.

This is obviously trivial and not entirely useful but we can play with this graph in the construct with only one condition:

All variables in the CONSTRUCT must be in the WHERE clause.

Basically, like in a SELECT query, the WHERE clause matches patterns in the knowledge graph and returns any variables. The difference with a CONSTRUCT is that these variables are then used to build the graph described in the CONSTRUCT clause.

Hopefully that is clear, but it makes more sense if we change the graph description.

For example, if we decided that we wanted to use schema instead of DBpedia’s ontology, we could switch to it in the first clause:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?hero schema:height ?height
} WHERE {
?hero dbo:height ?height .
}

This then returns the superheroes attached to their heights with the schema:height predicate as the variables are matched in the WHERE clause and then recombined in the CONSTRUCT clause.

This simple predicate switching is not entirely useful on it’s own (unless you really need to switch ontology for some reason) but is a good first step to understand this type of query.

To create some more useful CONSTRUCT queries, I’ll first go through VALUES and another type of property path.

VALUES

I’m sure there are many use-cases in which the VALUES clause is incredibly useful but I can’t say that I use it often. Essentially, it allows data to be provided within the query.

If you are searching for a particular sport in a dataset for example, you could match all entities that are sports and then filter the results for it. This gets more complex however if you are looking for a few particular sports and you may want to provide the few sports within the query.

With VALUES you can constrain your query by creating a variable (can also create multiple variables) and assigning it some data.

I tend to use this with federated queries to grab data (usually for insertion into my database) about a few particular entities.

Let’s go through a practical example of this:

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}
}

In this example I am interested in the five largest countries in the British Isles to compare populations. For reference (I’m from Scotland and had to check I was correct so imagine others may find this useful also):

source

I am using DBpedia for this example so I have assigned the five country entities to the variable ?countries and selected them to be returned.

It should therefore be easy enough to grab the corresponding populations you’d think. I add the SERVICE clause to make this a federated query (covered previously). This just sends the countries defined within the query to DBpedia and returns their corresponding populations.

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus ?pop .
}
}

Here are the results:

You will notice however that Ireland is missing from the results! You will often find this kind of problem with linked open data, the structure is not always consistent throughout.

To find Ireland’s population we need to switch the predicate from dbp:populationCensus to dbo:populationTotal like so:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbo:populationTotal ?pop .
}
}

which returns Ireland alongside its population… but none of the others:

This is of course a problem but before we can construct a solution, let’s run through alternate property paths.

Alternative Property Paths

In my last SPARQL tutorial we covered sequential property paths which (once the benchmark query templates come out) you may notice I am a big fan of.

Another type of property path that I use fairly often is called the Alternative Property Path and is made use of with the pipe (|) character.

If we look back at the problem above in the VALUES section, we can get some populations with one predicate and the rest with another. The alternate property path allows us to match patterns with either! For example, if we modify the population query above we get:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This is such a simple change but so powerful as we now return every country alongside their population with one relatively basic query:

This SELECT is great if we are just looking to find some results but what if we want to store this data in our knowledge graph?

Second Example

It would be a hassle to have to use this alternative property path every time we want to work with country populations. In addition, if users were not aware of this inconsistency, they could find and report incorrect results.

This is why we CONSTRUCT the result graph we want without the inconsistencies. In this case I have chosen dbo:populationTotal as I simply prefer it and use that to connect countries and their populations:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?country dbo:populationTotal ?pop
} WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This query returns the countries and their populations like we saw in the previous section but then connects each country to their population with dbo:populationTotal as described in the CONSTRUCT clause. This returns consistent triples:

This is useful if we wish to store this data as the fact it’s consistent will help avoid the problems mentioned above. I used this technique in one of my previous articles so lets take a look.

Example From Reconciliation Tutorial

This example is copied directly from my data reconciliation tutorial here. In that article I discuss this query in a lot more detail.

In brief, what I was doing here was grabbing car manufacturer names from tabular data and enhancing that information to store and analyse.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

There is little point repeating myself here so if interested, please take a look. What I am trying to display here is that I have used both the alternative property path (twice!) and the CONSTRUCT clause previously in an example use-case.

Construct queries are perfectly suited to ensuring any data you store is well typed, structured and importantly consistent.

I have been short on time since starting my new project but I am still working on the benchmark in development.

Example From The Benchmark (Sneak Preview)

The benchmark repository is not yet public as I don’t want opinions to be formed before it is fleshed out a little more.

I thought it would be good however to give a real (not made for a tutorial) example query that uses what this article teaches:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?city dbo:populationTotal ?pop
} WHERE {
{
SELECT ?city (MAX(?apop) AS ?pop) {
?user schema:location ?city .

SERVICE <https://dbpedia.org/sparql> {
?city dbo:populationTotal | dbp:populationCensus ?apop .
}
}
GROUP BY ?city
}
}

You will notice that this does not contain the CONSTRUCT clause but INSERT instead. You will see me do this switch in both the articles I linked in the introduction. Basically this does nothing too different, the graph that is constructed is inserted into your knowledge graph instead of just returned. The same can be done with the DELETE clause to remove patterns from your knowledge graph.

This query is very similar to the examples throughout this article (by design of course) but grabs countries populations from DBpedia and inserts them into the graph. This is just one point within the query cycle at which the graph changes structure in the benchmark.

Finally, the MAX population is grabbed because some countries in DBpedia have two different populations attached to them…

Conclusion

Hopefully this is useful for some of you! We have covered why and how to use construct queries along with values and alternative property paths.

At the end of May I am going to the DBpedia community meeting in Leipzig so my next linked data article will likely cover things I learned at that event or progress on the benchmark development.

In the meantime I will be releasing my next Computer Vision article and another dive into natural conversation.


Constructing More Advanced SPARQL Queries was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Linked Data Reconciliation in GraphDB

Using DBpedia to Enhance your Data in GraphDBFollowing my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconcili…

Using DBpedia to Enhance your Data in GraphDB

Following my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconciliation.

In this tutorial we will begin with a .csv of car manufacturers and enhance this with DBpedia. This .csv can be downloaded from here if you want to follow along.

Contents

Setting Up
Constructing the Graph
Reconciling your Data
Exploring the New Graph
Conclusion

Setting Up

First things first, we need to load our tabular data into OntoRefine in GraphDB. Head to the import tab, select “Tabular (OntoRefine)” and upload cars.csv if you are following along.

Click “Next” to start creating the project.

On this screen you need to untick “Parse next 1 line(s) as column headers” as this .csv does not have a header row. Rename the project in the top right corner and click “Create Project”.

You should now have this screen (above) showing one column of car manufacturer names. The column has a space in it which is annoying when running SPARQL queries across so lets rename it.

Click the little arrow next to “Column 1”, open “Edit Column” and then click “Rename this Column”. I called it “carNames” and will use this in the queries below so remember if you name it something different.

If you ever make a mistake, remember there is and undo/redo tab.

Constructing the Graph

In the top right of the interface there is an orange button titled “SPARQL”. Click this to open the SPARQL interface from which you can query your tabular data.

In the above screenshot I have run the query we want. I have have pasted it here so you can see it all and I go through it in detail below.

I use a CONSTRUCT query here. If you are new to SPARQL entirely then I recommend reading my tutorial on Constructing SPARQL Queries first. I then wrote a second tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

I start this query by defining my prefixes as usual. I am wanting to construct a graph around these car manufacturers so I design that in my CONSTRUCT clause. I am building a fairly simple graph for this tutorial so lets just run through it very quickly.

I want to have entities representing car manufacturers that have a type, label and location. This location is the headquarters of the car manufacturer. In most cases, all entities should have both a type and a human-readable label so I have ensured this here.

Each location is also an entity with an attached type, label and population.

Unlike my superhero tutorial, the .csv only contains the car company names and not all the data we want in our graph. We therefore need to reconcile our data with information in an open linked dataset. In this tutorial we will use DBpedia, the linked data representation of Wikipedia.

To get the information needed to build the graph declared in our CONSTRUCT we first grab all the names in our .csv and assign them to the variable ?cname. String literals must be language tagged to reconcile with the data in DBpedia so I BIND the English language tag “en” to each string literal. This explanation is what the lines below do:

If you didn’t name the column “carNames” above, you will have to modify the <urn:col:carNames> predicate here.
  ?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

Following this we use the SERVICE tag to send the query to DBpedia (this is called a federated query). We find every entity with the label matching our language tagged strings from the original .csv.

Once I have those entities, I need to find their locations. DBpedia is a very messy dataset so we have to use an alternative path in the query (represented by the “pipe” | symbol). This finds locations connected by any of the alternate paths given (in this case dbo:location and dbo:locationCountry) and assigns them to the variable ?location.

That explanation is referring to these lines:

    ?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

Next we want to retrieve the information about each country. The first pattern in the location ensures the entity has the type dbo:Country so that we don’t find loads of irrelevant locations.

Following this we grab the label and again use alternate property paths to extract each countries population.

It is important to note that some countries have two different populations attached by these two predicates.

We finally FILTER the country labels to only return those that are in English as that is the language our original dataset is in. Data reconciliation can also be used to extend your data into other languages if it happens to fit a multilingual linked open dataset.

That covers the final few lines of our query:

    ?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")

Next we need to insert this graph we have constructed into a GraphDB repository.

Click “SPARQL endpoint” and copy your endpoint (will be different) to be used later.

Reconciling the Data

If you have not done already, create a repository and head to the SPARQL tab.

You can see in the top right of this screenshot that I’m using a repository called “cars”.

In this query panel you want to copy the CONSTRUCT query we built and modify it a little. The full query is here:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE { SERVICE <http://localhost:7200/rdf-bridge/yourID> {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}}
}

The first thing we do is replace CONSTRUCT with INSERT as we now want to ingest the returned graph into our repository.

The next and final thing we must do is nest the entire WHERE clause into a second SERVICE tag. This time however, the service endpoint is the endpoint you copied at the end of the construction section.

This constructs the graph and inserts it into your repository!

It should be a much larger graph but the messiness of DBpedia strikes again! Many car manufacturers are connected to the string label of their location and not the entity. Therefore, the locations do not have a population and are consequently not returned.

We started with a small .csv of car manufacturer names so lets explore this graph we now have.

Exploring the New Graph

If we head to the “Explore” tab and view Japan for example, we can see our data.

Japan has the attached type dbo:Country, label, population and has seven car manufacturers.

There is no point in linking data if we cannot gain further insight so lets head to the “SPARQL” tab of the workbench.

In this screenshot we can see the results of the below query. This query returns each country alongside the number of people per car manufacturer in that country.

There is nothing new in this query if you have read my SPARQL introduction. I have used the MAX population as some countries have two attached populations due to DBpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?name ((MAX(?pop) / COUNT(DISTINCT ?companies)) AS ?result)
WHERE {
?companies rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
dbp:populationCensus ?pop ;
rdfs:label ?name .
}
GROUP BY ?name
ORDER BY DESC (?result)

In the screenshot above you can see that the results (ordered by result in descending order) are:

  • Indonesia
  • Pakistan
  • India
  • China

India of course has a much larger population than Indonesia but also has a lot more car manufacturers (as shown below).

If you were a car manufacturer in Asia, Indonesia might be a good market to target for export as it has a high population but very little local competition.

Conclusion

We started with a small list of car manufacturer names but, by using GraphDB and DBpedia, we managed to extend this into a small graph that we could gain actual insight from.

Of course, this example is not entirely useful but perhaps you have a list of local areas or housing statistics that you want to reconcile with mapping or government linked open data. This can be done using the above approach to help you or your business gain further insight that you could not have otherwise identified.


Linked Data Reconciliation in GraphDB was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

ISWC 2018

ISWC 2018 Trip Report Keynotes There were three amazing and inspiring keynote talks, all very different from each other. The first was given by Jennifer Golbeck (University of Maryland). While Jennifer did her PhD on the Semantic Web in the early days of social media and Linked Data, she now focuses on user privacy and […]

ISWC 2018 Trip Report

Keynotes

There were three amazing and inspiring keynote talks, all very different from each other.

The first was given by Jennifer Golbeck (University of Maryland). While Jennifer did her PhD on the Semantic Web in the early days of social media and Linked Data, she now focuses on user privacy and consent. These are highly relevant topics to the Semantic Web community and something that we should really be considering when linking people’s personal data. While the consequences of linking scientific data might not be as scary, there are still ethical issues to consider if we do not get it right. Check out her TED talk for an abridged version of her keynote.

She also suggested that when reading a companies privacy policy, you should replace the work “privacy” with “consent” and see how it seems then.

The talk also struck an accord with the launch of the SOLID framework by Tim Berners-Lee. There was a good sales pitch of the SOLID framework from Ruben Verborgh in the afternoon of the Decentralising the Semantic Web Workshop.

The second was given by Natasha Noy (Google). Natasha talked about the challenges of being a researcher and engineering tools that support the community. Particularly where impact may only be detect 6 to 10 years down the line. She also highlighted that Linked Data is only a small fraction of the data in the world (the tip of the iceberg), and it is not appropriate to expect all data to become Linked Data.

Her most recent endeavour has been the Google Dataset Search Tool. This has been a major engineering and social endeavour; getting schema.org markup embedded on pages and building a specialist search tool on top of the indexed data. More details of the search framework are in this blog post. The current search interface is limited due to the availability of metadata; most sites only make title and description available. However, we can now start investigating how to return search results for datasets and what additional data might be of use. This for me is a really exciting area of work.

Later in the day I attended a talk on the LOD Atlas, another dataset search tool. While this gives a very detailed user interface, it is only designed for Linked Data researchers, not general users looking for a dataset.

The third keynote was given by Vanessa Evers (University of Twente, The Netherlands). This was in a completely different domain, social interactions with robots, but still raised plenty of questions for the community. For me the challenge was how to supply contextualised data.

Knowledge Graph Panel

The other big plenary event this year was the knowledge graph panel. The panel consisted of representatives from Microsoft, Facebook, eBay, Google, and IBM, all of whom were involved with the development of Knowledge Graphs within their organisation. A major concern for the Semantic Web community is that most of these panelists were not aware of our community or the results of our work. Another concern is that none of their systems use any of our results, although it sounds like several of them use something similar to RDF.

The main messages I took from the panel were

  • Scale and distribution were key

  • Source information is going to be noisy and challenging to extract value from

  • Metonymy is a major challenge

This final point connects with my work on contextualising data for the task of the user [1, 2] and has reinvigorated my interest in this research topic.

Final Thoughts

This was another great ISWC conference, although many familiar faces were missing.

There was a great and vibrant workshop programme. My paper [3] was presented during the Enabling Open Semantic Science workshop (SemSci 2018) and resulted in a good deal of discussion. There were also great keynotes at the workshop from Paul Groth (slides) and Yolanda Gil which I would recommend anyone to look over.

I regret not having gone to more of the Industry Track sessions. The one I did make was very inspiring to see how the results of the community are being used in practice, and to get insights into the challenges faced.

The conference banquet involved a walking dinner around the Monterey Bay Aquarium. This was a great idea as it allowed plenty of opportunities for conversations with a wide range of conference participants; far more than your standard banquet.

Here are some other takes on the conference:

I also managed to sneak off to look for the sea otters.

[1] [doi] Colin R. Batchelor, Christian Y. A. Brenninkmeijer, Christine Chichester, Mark Davies, Daniela Digles, Ian Dunlop, Chris T. A. Evelo, Anna Gaulton, Carole A. Goble, Alasdair J. G. Gray, Paul T. Groth, Lee Harland, Karen Karapetyan, Antonis Loizou, John P. Overington, Steve Pettifer, Jon Steele, Robert Stevens, Valery Tkachenko, Andra Waagmeester, Antony J. Williams, and Egon L. Willighagen. Scientific Lenses to Support Multiple Views over Linked Chemistry Data. In The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, page 98–113, 2014.
[Bibtex]
@inproceedings{BatchelorBCDDDEGGGGHKLOPSSTWWW14,
abstract = {When are two entries about a small molecule in different datasets the same? If they have the same drug name, chemical structure, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.
In this paper, we present an approach to enable applications to choose the equivalence criteria to apply between datasets. Thus, supporting multiple dynamic views over the Linked Data. For chemical data, we show that multiple sets of links can be automatically generated according to different equivalence criteria and published with semantic descriptions capturing their context and interpretation. This approach has been applied within a large scale public-private data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.},
author = {Colin R. Batchelor and
Christian Y. A. Brenninkmeijer and
Christine Chichester and
Mark Davies and
Daniela Digles and
Ian Dunlop and
Chris T. A. Evelo and
Anna Gaulton and
Carole A. Goble and
Alasdair J. G. Gray and
Paul T. Groth and
Lee Harland and
Karen Karapetyan and
Antonis Loizou and
John P. Overington and
Steve Pettifer and
Jon Steele and
Robert Stevens and
Valery Tkachenko and
Andra Waagmeester and
Antony J. Williams and
Egon L. Willighagen},
title = {Scientific Lenses to Support Multiple Views over Linked Chemistry
Data},
booktitle = {The Semantic Web - {ISWC} 2014 - 13th International Semantic Web Conference,
Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part {I}},
pages = {98--113},
year = {2014},
url = {http://dx.doi.org/10.1007/978-3-319-11964-9_7},
doi = {10.1007/978-3-319-11964-9_7},
}
[2] [doi] Alasdair J. G. Gray. Dataset Descriptions for Linked Data Systems. IEEE Internet Computing, 18(4):66–69, 2014.
[Bibtex]
@article{Gray14,
abstract = {Linked data systems rely on the quality of, and linking between, their data sources. However, existing data is difficult to trace to its origin and provides no provenance for links. This article discusses the need for self-describing linked data.},
author = {Alasdair J. G. Gray},
title = {Dataset Descriptions for Linked Data Systems},
journal = {{IEEE} Internet Computing},
volume = {18},
number = {4},
pages = {66--69},
year = {2014},
url = {http://dx.doi.org/10.1109/MIC.2014.66},
doi = {10.1109/MIC.2014.66},
}
[3] Alasdair J. G. Grayg. Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources. In Enabling Open Semantic Science, Monterey, California, USA, 2018. Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb
[Bibtex]
@InProceedings{Gray2018:jupyter:SemSci2018,
abstract = {In recent years there has been a reproducibility crisis in science. Computational notebooks, such as Jupyter, have been touted as one solution to this problem. However, when executing analyses over live SPARQL endpoints, we get different answers depending upon when the analysis in the notebook was executed. In this paper, we identify some of the issues discovered in trying to develop a reproducible analysis over a collection of biomedical data sources and suggest some best practice to overcome these issues.},
author = {Alasdair J G Grayg},
title = {Using a Jupyter Notebook to perform a reproducible scientific analysis over semantic web sources},
OPTcrossref = {},
OPTkey = {},
booktitle = {Enabling Open Semantic Science},
year = {2018},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTpages = {},
month = oct,
address = {Monterey, California, USA},
OPTorganization = {},
OPTpublisher = {},
note = {Executable version: https://mybinder.org/v2/gh/AlasdairGray/SemSci2018/master?filepath=SemSci2018%20Publication.ipynb},
url = {http://ceur-ws.org/Vol-2184/paper-02/paper-02.html},
OPTannote = {}
}

SLiDInG 6

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the […]

Today, the Semantic Web Lab hosted the 6th Scottish Linked Data Interest Group workshop at Heriot-Watt University. The event was sponsored by the SICSA Data Science Theme. The event was well attended with 30 researchers from across Scotland (and Newcastle) coming together for a day of flash talks and discussions. Live minutes were captured during the day and can be found here.

I gave a talk on the successes and challenges of FAIR data. My slides are embedded below.

DUCS not LOD

The follow is an excerpt from a blog by Keir Winesmith, Head of Digital at the San Francisco Museum of Modern Art (@SFMOMAlab) Linked Open Data may sound good and noble, but it’s the wrong way around. It is a truth universally acknowledged, that an organization in possession of good Data, must want it Open (and […]

The follow is an excerpt from a blog by Keir Winesmith, Head of Digital at the San Francisco Museum of Modern Art (@SFMOMAlab)

Linked Open Data may sound good and noble, but it’s the wrong way around. It is a truth universally acknowledged, that an organization in possession of good Data, must want it Open (and indeed, Linked).

Well, I call bullshit. Most cultural heritage organizations (like most organizations) are terrible at data. And most of those who are good at collecting it, very rarely use it effectively or strategically.

Instead of Linked Open Data (LOD), Keir argues for DUCS:

I propose an alternative anagram, and an alternative order of importance.

  • D. Data. Step one, collect the data that is most likely to help you and your organization make better decisions in the future. For example collection breadth, depth, accuracy, completeness, diversity, and relationships between objects and creators.
  • U. Utilise. Actually use the data to inform your decisions, and test your hypotheses, within the bounds of your mission.
  • C. Context. Provide context for your data, both internally and externally. What’s inside? How is represented? How complete is it? How accurate? How current? How was it gathered?
  • S. Share. Now you’re ready to share it! Share it with context. Share it with the communities that are included in it first, follow the cultural heritage strategy of “nothing about me, without me”. Reach out to the relevant students, scholars, teachers, artists, designers, anthropologists, technologists, and whomever could use it. Get behind it and keep it up to date.

I’m against LOD, if it doesn’t follow DUCS first.

If you’re going to do it, do it right.

Source: Against Linked Open Data – Keir Winesmith – Medium