Ambient Assisted Living (AAL) Summer School

SICSA is sponsoring the Ambient Assisted Living (AAL) Summer School which is taking place on 6th– 8thAugust at Heriot Watt University.

The SICSA Ambient Assisted Living (AAL) Summer School is Scotland’s first ever summer school designed to allow students to explore key concepts for the design of advanced AAL systems.

The program includes presentations from industry representatives and healthcare organisations, in addition to lectures and tutorials on sensing, linked data, machine learning and robotics for AAL applications.

The summer school is open to students and research staff, and employees of charities, non-profit organisations and companies with relevant backgrounds. Financial support is available for students at SICSA institutes to cover their on-campus accommodation and lunch costs, generously sponsored by SICSA, SICSA CPS and AI themes, and Nexus.

The deadline for applications is 30th June. Attendance will be limited, so please apply early.

Full details of the SICSA AAL Summer School can be found here.

If you have any questions or you would like to contact the Organisers, please see here.

Comparison of Linked Data Triplestores: A New Contender

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.

After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox.

In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark.

They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up
How I Tested RDFox
First Impressions
Results
Conclusion

source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable.

In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark.

This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well.

Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others.

For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command.

RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance.

This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.

Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.

You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:

SELECT ?s ?o
WHERE {
?s example:predicate ?o .
}

This would return the triples:

person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .

Which make the use of alternate property paths less necessary.

With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms).

Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:

Others = AnzoGraph, Stardog and Virtuoso

Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).

RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.

It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.

Query 1:

This query is very simple and just counts the number of relationships in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.

Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB

Query 2:

This query returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic.

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

You will notice that this query is slightly different to query 3 in the original article.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
{
FILTER(EXISTS{ SELECT ?v WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
}
LIMIT 1000
})
}
}

RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

The results of query 2 and 3 are very similar of course as query 2 is nested within query 2.

Query 4:

The two queries above were similar but query 4 is a lot more mathematical.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

AnzoGraph was the quickest to complete query 4 with RDFox in second place.

Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso

Virtuoso was the third fastest to complete this query in a time of 519.5ms.

As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark.

Query 5:

This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab .
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:

Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso

AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores.

If we instead run full text index queries, they are significantly faster than RDFox.

Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores.

Query 6:

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
:countryOfBirth ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.

RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark.

Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average.

Query 7:

Finally, this query finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.

Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso

In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:

RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.

I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily.

If you want to try it for yourself, you can request a license here.

Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.


Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.
After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox. In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark. They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up How I Tested RDFox First Impressions Results Conclusion
source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable. In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark. This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well. Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others. For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command. RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance. This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.
Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.
You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:
SELECT ?s ?o
WHERE {
  ?s example:predicate ?o .
}
This would return the triples:
person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .
Which make the use of alternate property paths less necessary. With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms). Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:
Others = AnzoGraph, Stardog and Virtuoso
Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).
RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.
It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.
Query 1: This query is very simple and just counts the number of relationships in the graph.
SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.
Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB
Query 2: This query returns a list of 1000 settlement names which have airports with identification numbers.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
  { ?v2 a dbo:Settlement ;
        rdfs:label ?v .
    ?v6 a dbo:Airport . }
  { ?v6 dbo:city ?v2 . }
  UNION
    { ?v6 dbo:location ?v2 . }
  { ?v6 dbp:iata ?v5 . }
  UNION
    { ?v6 dbo:iataLocationIdentifier ?v5 . }
  OPTIONAL { ?v6 foaf:homepage ?v7 . }
  OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic. Query 3: This query nests query 2 to grab information about the 1,000 settlements returned above.
You will notice that this query is slightly different to query 3 in the original article.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
  ?v2 a dbo:Settlement;
      rdfs:label ?v.
  ?v6 a dbo:Airport.
  { ?v6 dbo:city ?v2. }
  UNION
  { ?v6 dbo:location ?v2. }
  { ?v6 dbp:iata ?v5. }
  UNION
  { ?v6 dbo:iataLocationIdentifier ?v5. }
  OPTIONAL { ?v6 foaf:homepage ?v7. }
  OPTIONAL { ?v6 dbp:nativename ?v8. }
  {
    FILTER(EXISTS{ SELECT ?v WHERE {
      ?v2 a dbo:Settlement;
          rdfs:label ?v.
      ?v6 a dbo:Airport.
      { ?v6 dbo:city ?v2. }
      UNION
      { ?v6 dbo:location ?v2. }
      { ?v6 dbp:iata ?v5. }
      UNION
      { ?v6 dbo:iataLocationIdentifier ?v5. }
      OPTIONAL { ?v6 foaf:homepage ?v7. }
      OPTIONAL { ?v6 dbp:nativename ?v8. }
    }
    LIMIT 1000
    })
  }
}
RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
The results of query 2 and 3 are very similar of course as query 2 is nested within query 2. Query 4: The two queries above were similar but query 4 is a lot more mathematical.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {  
  {SELECT (CEIL(?a + ?b) AS ?x) WHERE {
    {SELECT (AVG(?abslat) AS ?a) WHERE {
    ?s1 geo:lat ?lat .
    BIND(ABS(?lat) AS ?abslat)
    }}
    {SELECT (SUM(?rv) AS ?b) WHERE {
    ?s2 dbo:volume ?volume .
    BIND((RAND() * ?volume) AS ?rv)
    }}
  }}
  
  {SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
      {SELECT ?c WHERE {
        BIND(MINUTES(NOW()) AS ?c)
      }}
      {SELECT (AVG(?width) AS ?d) WHERE {
        ?s3 dbo:width ?width .
        FILTER(?width > 50)
      }}
  }}
}
AnzoGraph was the quickest to complete query 4 with RDFox in second place.
Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso
Virtuoso was the third fastest to complete this query in a time of 519.5ms. As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark. Query 5: This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.
I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ex:notglutenfree rdfs:label ?label ;
                   rdfs:comment ?sab ;
                   dbo:abstract ?lab .
} WHERE {
  {?s1 rdfs:label ?label .
  FILTER (REGEX(lcase(?label), 'venus'))
  } UNION
  {?s2 rdfs:comment ?sab .
  FILTER (REGEX(lcase(?sab), 'sleep'))
  } UNION
  {?s3 dbo:abstract ?lab .
  FILTER (REGEX(lcase(?lab), 'gluten'))
  }
}
As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:
Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso
AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores. If we instead run full text index queries, they are significantly faster than RDFox.
Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores. Query 6: This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.
Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{ 
?soccerplayer a dbo:SoccerPlayer ;
   :position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
   :countryOfBirth ?countryOfBirth ;
   dbo:team ?team .
   ?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam . 
   ?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
   ?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer
If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.
RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark. Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average. Query 7: Finally, this query finds all people born in Berlin before 1900.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
 ?person dbo:birthPlace :Berlin .
 ?person dbo:birthDate ?birth .
 ?person foaf:name ?name .
 ?person dbo:deathDate ?death .
 FILTER (?birth < "1900-01-01"^^xsd:date)
 }
 ORDER BY ?name
Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.
Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso
In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:
RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.
I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily. If you want to try it for yourself, you can request a license here.
Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.

Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Dementia Affects Conversation: Building a More Accessible Conversational AI

Diving into the Literature – What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation – Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood

There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer’s Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz

Note: I will refer to each paper with their corresponding number from now on.

Motivation – Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source – lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.

https://medium.com/media/05616eaceabf5537ffbda5b6811c367c/href


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diving into the Literature - What do we know?

source
We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual. However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction Overview of Dementia Papers Covered Motivation - Why Speech? Datasets Models Important Language Features Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here. To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment? In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.
If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.
As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.
Their Website
Dementia is not a disease but the name for a group of symptoms that commonly include problems with:
  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception
For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below). All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced. As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.
source — you can see the black areas expanding as nerve cells die and atrophy progresses.
For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems… Other common symptoms impact:
  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood
There is currently no cure…
Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia. All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer's Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research). A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.
source
These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above. The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation. Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost. AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply. Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused. Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.
source
Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage. The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged. Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies. Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.
source
Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms. People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems. Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.
source
FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65). The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first. The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions. Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word. FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary. Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles. As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following: [1]
A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost
[2]
A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert
[3]
Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed
[4]
Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz
Note: I will refer to each paper with their corresponding number from now on.

Motivation - Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner. These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this. [2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.
source - lumbar puncture aka “spinal tap”
[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient. As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.
source
[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses. Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner. To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls. To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:
source
Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect). [1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech. Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models. The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however. The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that. [1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous. [2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement. One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.
source
[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.
[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).
As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach. They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results. The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:
[4]
Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers. The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree). [4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.
In case of confusion, it’s important to reiterate that [2] used a different corpus.
Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated. [4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns. [2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale. [1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined. [1] listed the following transcription-based features:
  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns
and [1] listed the following acoustic features:
  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate
Brilliantly, [3] performed several feature selection methods upon the following features:
[3]
Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE). They output the following:
[3]
Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions. It’s also important to restate that the most accurate model used the features selected by the KNN method. Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:
  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD. Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned! Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.
How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Research Software Engineer Post

We are advertising for a Research Software Engineer to come work on

Identifier linking and Scientific Lenses in the context of the FAIRplus project, and
Data validation and markup support in the context of Bioschemas.

The post is for 12 months. Furth…

We are advertising for a Research Software Engineer to come work on

  1. Identifier linking and Scientific Lenses in the context of the FAIRplus project, and
  2. Data validation and markup support in the context of Bioschemas.

The post is for 12 months. Further details available on the Heriot-Watt Vacancies Site.

Constructing More Advanced SPARQL Queries

CONSTRUCT queries, VALUES and more property paths.It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both …

CONSTRUCT queries, VALUES and more property paths.

It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both my Transforming Tabular Data into Linked Data tutorial and the Linked Data Reconciliation article.

So, to finally correct this - I will cover them here!

Contents

SELECT vs CONSTRUCT
First Basic Example
- VALUES
- Alternative Property Paths
Second Basic Example
Example From the Reconciliation Article
Example From the Benchmark (Sneak Preview)

SELECT vs CONSTRUCT

In my last tutorial, I basically ran through SELECT queries from the most basic to some more complex. So what’s the difference?

With selects we are trying to match patterns in the knowledge graph to return results. With constructs we are specifying and building a new graph to return.

In the two tutorials linked (in the intro) I was constructing graphs from tabular data to then insert into a triplestore. I will discuss sections of these later but you should be able to follow the full queries after going through this tutorial.

We usually use CONSTRUCT queries at Wallscope to build a graph for the front-end team. Essentially, we create a portable sub-graph that contains all of the information needed to build a section of an application. Then instead of many select queries to the full database, these queries can be run over the much smaller sub-graph returned by the construct query.

First Basic Example

For this first example I will be querying my Superhero dataset that you can download here.

Each superhero entity in this dataset is connected to their height with the predicate dbo:height as shown here:

Using this basic SELECT query:

PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?hero ?height
WHERE {
?hero dbo:height ?height .
}

Now lets modify this query slightly into a CONSTRUCT that is almost the same:

PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?hero dbo:height ?height
} WHERE {
?hero dbo:height ?height .
}

As you can see, this returns the same information but in the form: subject, predicate, object.

This is obviously trivial and not entirely useful but we can play with this graph in the construct with only one condition:

All variables in the CONSTRUCT must be in the WHERE clause.

Basically, like in a SELECT query, the WHERE clause matches patterns in the knowledge graph and returns any variables. The difference with a CONSTRUCT is that these variables are then used to build the graph described in the CONSTRUCT clause.

Hopefully that is clear, but it makes more sense if we change the graph description.

For example, if we decided that we wanted to use schema instead of DBpedia’s ontology, we could switch to it in the first clause:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?hero schema:height ?height
} WHERE {
?hero dbo:height ?height .
}

This then returns the superheroes attached to their heights with the schema:height predicate as the variables are matched in the WHERE clause and then recombined in the CONSTRUCT clause.

This simple predicate switching is not entirely useful on it’s own (unless you really need to switch ontology for some reason) but is a good first step to understand this type of query.

To create some more useful CONSTRUCT queries, I’ll first go through VALUES and another type of property path.

VALUES

I’m sure there are many use-cases in which the VALUES clause is incredibly useful but I can’t say that I use it often. Essentially, it allows data to be provided within the query.

If you are searching for a particular sport in a dataset for example, you could match all entities that are sports and then filter the results for it. This gets more complex however if you are looking for a few particular sports and you may want to provide the few sports within the query.

With VALUES you can constrain your query by creating a variable (can also create multiple variables) and assigning it some data.

I tend to use this with federated queries to grab data (usually for insertion into my database) about a few particular entities.

Let’s go through a practical example of this:

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}
}

In this example I am interested in the five largest countries in the British Isles to compare populations. For reference (I’m from Scotland and had to check I was correct so imagine others may find this useful also):

source

I am using DBpedia for this example so I have assigned the five country entities to the variable ?countries and selected them to be returned.

It should therefore be easy enough to grab the corresponding populations you’d think. I add the SERVICE clause to make this a federated query (covered previously). This just sends the countries defined within the query to DBpedia and returns their corresponding populations.

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus ?pop .
}
}

Here are the results:

You will notice however that Ireland is missing from the results! You will often find this kind of problem with linked open data, the structure is not always consistent throughout.

To find Ireland’s population we need to switch the predicate from dbp:populationCensus to dbo:populationTotal like so:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbo:populationTotal ?pop .
}
}

which returns Ireland alongside its population… but none of the others:

This is of course a problem but before we can construct a solution, let’s run through alternate property paths.

Alternative Property Paths

In my last SPARQL tutorial we covered sequential property paths which (once the benchmark query templates come out) you may notice I am a big fan of.

Another type of property path that I use fairly often is called the Alternative Property Path and is made use of with the pipe (|) character.

If we look back at the problem above in the VALUES section, we can get some populations with one predicate and the rest with another. The alternate property path allows us to match patterns with either! For example, if we modify the population query above we get:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This is such a simple change but so powerful as we now return every country alongside their population with one relatively basic query:

This SELECT is great if we are just looking to find some results but what if we want to store this data in our knowledge graph?

Second Example

It would be a hassle to have to use this alternative property path every time we want to work with country populations. In addition, if users were not aware of this inconsistency, they could find and report incorrect results.

This is why we CONSTRUCT the result graph we want without the inconsistencies. In this case I have chosen dbo:populationTotal as I simply prefer it and use that to connect countries and their populations:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?country dbo:populationTotal ?pop
} WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This query returns the countries and their populations like we saw in the previous section but then connects each country to their population with dbo:populationTotal as described in the CONSTRUCT clause. This returns consistent triples:

This is useful if we wish to store this data as the fact it’s consistent will help avoid the problems mentioned above. I used this technique in one of my previous articles so lets take a look.

Example From Reconciliation Tutorial

This example is copied directly from my data reconciliation tutorial here. In that article I discuss this query in a lot more detail.

In brief, what I was doing here was grabbing car manufacturer names from tabular data and enhancing that information to store and analyse.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

There is little point repeating myself here so if interested, please take a look. What I am trying to display here is that I have used both the alternative property path (twice!) and the CONSTRUCT clause previously in an example use-case.

Construct queries are perfectly suited to ensuring any data you store is well typed, structured and importantly consistent.

I have been short on time since starting my new project but I am still working on the benchmark in development.

Example From The Benchmark (Sneak Preview)

The benchmark repository is not yet public as I don’t want opinions to be formed before it is fleshed out a little more.

I thought it would be good however to give a real (not made for a tutorial) example query that uses what this article teaches:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?city dbo:populationTotal ?pop
} WHERE {
{
SELECT ?city (MAX(?apop) AS ?pop) {
?user schema:location ?city .

SERVICE <https://dbpedia.org/sparql> {
?city dbo:populationTotal | dbp:populationCensus ?apop .
}
}
GROUP BY ?city
}
}

You will notice that this does not contain the CONSTRUCT clause but INSERT instead. You will see me do this switch in both the articles I linked in the introduction. Basically this does nothing too different, the graph that is constructed is inserted into your knowledge graph instead of just returned. The same can be done with the DELETE clause to remove patterns from your knowledge graph.

This query is very similar to the examples throughout this article (by design of course) but grabs countries populations from DBpedia and inserts them into the graph. This is just one point within the query cycle at which the graph changes structure in the benchmark.

Finally, the MAX population is grabbed because some countries in DBpedia have two different populations attached to them…

Conclusion

Hopefully this is useful for some of you! We have covered why and how to use construct queries along with values and alternative property paths.

At the end of May I am going to the DBpedia community meeting in Leipzig so my next linked data article will likely cover things I learned at that event or progress on the benchmark development.

In the meantime I will be releasing my next Computer Vision article and another dive into natural conversation.


Constructing More Advanced SPARQL Queries was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beginning to Replicate Natural Conversation in Real Time

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking – End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond – killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking – End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

https://medium.com/media/7f962d156a27bee0a2feb146b12778d3/href

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction – just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today’s world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied – dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking - End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond - killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking - End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction - just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today's world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied - dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Seminar: Utilising Linked Data in the Public Sector

Title: Utilising Linked Data in the Public Sector

Speaker: Angus Addlesee, PhD Student, Heriot-Watt University

Date: 11:15 on 25 March 2019

Location: CM F.17, Heriot-Watt University

Abstract: In this presentation I will explain how Wallscope (a small tech company in Edinburgh) is using linked data in public sector projects.

Bio: Angus has worked at Wallscope for two years in various roles and is now studying his PhD at Heriot-Watt which is part funded by Wallscope.

Wallscope uses Machine Learning and Semantic Technologies to build Knowledge Graphs and Linked Data applications. We are motivated to lower the barriers for accessing knowledge to improve the health, wealth and sustainability of the world we share.

Linked Data Reconciliation in GraphDB

Using DBpedia to Enhance your Data in GraphDBFollowing my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconcili…

Using DBpedia to Enhance your Data in GraphDB

Following my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconciliation.

In this tutorial we will begin with a .csv of car manufacturers and enhance this with DBpedia. This .csv can be downloaded from here if you want to follow along.

Contents

Setting Up
Constructing the Graph
Reconciling your Data
Exploring the New Graph
Conclusion

Setting Up

First things first, we need to load our tabular data into OntoRefine in GraphDB. Head to the import tab, select “Tabular (OntoRefine)” and upload cars.csv if you are following along.

Click “Next” to start creating the project.

On this screen you need to untick “Parse next 1 line(s) as column headers” as this .csv does not have a header row. Rename the project in the top right corner and click “Create Project”.

You should now have this screen (above) showing one column of car manufacturer names. The column has a space in it which is annoying when running SPARQL queries across so lets rename it.

Click the little arrow next to “Column 1”, open “Edit Column” and then click “Rename this Column”. I called it “carNames” and will use this in the queries below so remember if you name it something different.

If you ever make a mistake, remember there is and undo/redo tab.

Constructing the Graph

In the top right of the interface there is an orange button titled “SPARQL”. Click this to open the SPARQL interface from which you can query your tabular data.

In the above screenshot I have run the query we want. I have have pasted it here so you can see it all and I go through it in detail below.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

If you are unfamiliar with SPARQL queries then I recommend reading one of my previous articles before reading on.

I start this query by defining my prefixes as usual. I am wanting to construct a graph around these car manufacturers so I design that in my CONSTRUCT clause. I am building a fairly simple graph for this tutorial so lets just run through it very quickly.

I want to have entities representing car manufacturers that have a type, label and location. This location is the headquarters of the car manufacturer. In most cases, all entities should have both a type and a human-readable label so I have ensured this here.

Each location is also an entity with an attached type, label and population.

Unlike my superhero tutorial, the .csv only contains the car company names and not all the data we want in our graph. We therefore need to reconcile our data with information in an open linked dataset. In this tutorial we will use DBpedia, the linked data representation of Wikipedia.

To get the information needed to build the graph declared in our CONSTRUCT we first grab all the names in our .csv and assign them to the variable ?cname. String literals must be language tagged to reconcile with the data in DBpedia so I BIND the English language tag “en” to each string literal. This explanation is what the lines below do:

If you didn’t name the column “carNames” above, you will have to modify the <urn:col:carNames> predicate here.
  ?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

Following this we use the SERVICE tag to send the query to DBpedia (this is called a federated query). We find every entity with the label matching our language tagged strings from the original .csv.

Once I have those entities, I need to find their locations. DBpedia is a very messy dataset so we have to use an alternative path in the query (represented by the “pipe” | symbol). This finds locations connected by any of the alternate paths given (in this case dbo:location and dbo:locationCountry) and assigns them to the variable ?location.

That explanation is referring to these lines:

    ?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

Next we want to retrieve the information about each country. The first pattern in the location ensures the entity has the type dbo:Country so that we don’t find loads of irrelevant locations.

Following this we grab the label and again use alternate property paths to extract each countries population.

It is important to note that some countries have two different populations attached by these two predicates.

We finally FILTER the country labels to only return those that are in English as that is the language our original dataset is in. Data reconciliation can also be used to extend your data into other languages if it happens to fit a multilingual linked open dataset.

That covers the final few lines of our query:

    ?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")

Next we need to insert this graph we have constructed into a GraphDB repository.

Click “SPARQL endpoint” and copy your endpoint (will be different) to be used later.

Reconciling the Data

If you have not done already, create a repository and head to the SPARQL tab.

You can see in the top right of this screenshot that I’m using a repository called “cars”.

In this query panel you want to copy the CONSTRUCT query we built and modify it a little. The full query is here:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE { SERVICE <http://localhost:7200/rdf-bridge/yourID> {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}}
}

The first thing we do is replace CONSTRUCT with INSERT as we now want to ingest the returned graph into our repository.

The next and final thing we must do is nest the entire WHERE clause into a second SERVICE tag. This time however, the service endpoint is the endpoint you copied at the end of the construction section.

This constructs the graph and inserts it into your repository!

It should be a much larger graph but the messiness of DBpedia strikes again! Many car manufacturers are connected to the string label of their location and not the entity. Therefore, the locations do not have a population and are consequently not returned.

We started with a small .csv of car manufacturer names so lets explore this graph we now have.

Exploring the New Graph

If we head to the “Explore” tab and view Japan for example, we can see our data.

Japan has the attached type dbo:Country, label, population and has seven car manufacturers.

There is no point in linking data if we cannot gain further insight so lets head to the “SPARQL” tab of the workbench.

In this screenshot we can see the results of the below query. This query returns each country alongside the number of people per car manufacturer in that country.

There is nothing new in this query if you have read my SPARQL introduction. I have used the MAX population as some countries have two attached populations due to DBpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?name ((MAX(?pop) / COUNT(DISTINCT ?companies)) AS ?result)
WHERE {
?companies rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
dbp:populationCensus ?pop ;
rdfs:label ?name .
}
GROUP BY ?name
ORDER BY DESC (?result)

In the screenshot above you can see that the results (ordered by result in descending order) are:

  • Indonesia
  • Pakistan
  • India
  • China

India of course has a much larger population than Indonesia but also has a lot more car manufacturers (as shown below).

If you were a car manufacturer in Asia, Indonesia might be a good market to target for export as it has a high population but very little local competition.

Conclusion

We started with a small list of car manufacturer names but, by using GraphDB and DBpedia, we managed to extend this into a small graph that we could gain actual insight from.

Of course, this example is not entirely useful but perhaps you have a list of local areas or housing statistics that you want to reconcile with mapping or government linked open data. This can be done using the above approach to help you or your business gain further insight that you could not have otherwise identified.


Linked Data Reconciliation in GraphDB was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comparison of Linked Data Triplestores: Developing the Methodology

Inspecting Load and Query Times across DBPedia and Yago

Developers in small to medium scale companies are often asked to test software and decide what’s “best”. I have worked with RDF for a few years now and thought that comparing triplestores would be a relatively trivial task. I was wrong so here is what I have learned so far.

TL;DR – My original comparison had an imperfect methodology so I have developed this based on the community feedback. My queries now bias the results so I will next create data and query generators.

Contents

Introduction
Methodology –
What I am doing differently
Triplestores –
Which triplestores I tested.
Loading –
How fast does each triplestore load the data?
Queries –
Query Times (and how my queries bias these)
Next Steps –
Developing a realistic Benchmark
Conclusion
Appendix –
Versions, loading and query methods, etc…

Introduction

Over the past few months I have created a small RDF dataset and some SPARQL queries to introduce people to linked data. In December I tied these together to compare some of the existing triplestores (you can read that here). I was surprised by the amount of attention this article got and I received some really great feedback and advice from the community.

Based on this feedback, I realised that the dataset I created was simply too small to really compare these systems properly as time differences were often just a few milliseconds. Additionally, I did not run warm-up queries which proved to effect results significantly in some cases.

Methodology

I have therefore developed my methodology and run a second comparison to see how these systems perform on a larger scale (not huge due to current hardware restrictions).

I have increased the number of triples to 245,197,165 which is significantly more than the 1,781,625 triples that the original comparison was run on.

I performed three warm-up runs and then ran ten hot runs and chart the average time of those ten.

The machine I used has 32Gb Memory, 8 logical cores and was running Centos 7. I used each system one at a time so they did not interfere with each other.

I used the CLI to load and query the data in all systems so that there can be no possibility that the UI effects the time.

I split the RDF into many gzipped files containing 100k triples each. This improves loading times as the process can be optimised across cores.

If you would like to recreate this experiment, you can find my queries, results and instructions on how to get the data here.

Triplestores

In this comparison I evaluated five triplestores. These were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso.

I have listed the versions, query and load methods in the appendix of this article.

Loading

The first thing I did when evaluating each triplestore was of course load the data. Three distinct categories emerged: hours, 10’s of minutes and minutes.

In each case I loaded all the data with all of the gzipped .ttl files containing 100k triples each.

It is also important to note that loading time can be optimised in each case so these are not the fastest they can load, just the default. If you are a deciding for a business, the vendors are more than happy to help you optimise for your data structure.

Blazegraph and GraphDB load this dataset in roughly 8 hours. Stardog and Virtuoso load this in the 30 to 45 minute range but AnzoGraph loads the exact same dataset in just 3 minutes!

Why these three buckets though? Blazegraph, GraphDB and Stardog are all Java based so how does Stardog load the data so much faster (with the default settings)? This is likely due to differences in garbage collection, Stardog probably manages this more by default than the other two.

Virtuoso is written in C which doesn’t manage memory and is therefore easier to load faster than systems built in Java. AnzoGraph is developed in C/C++ so why is it so much faster?

The first reason is that it is simply newer and therefore a little more up to date. The second and more important reason is that they optimise highly for very fast loading speed as they are an OLAP database.

Initial loading speed is sometimes extremely important and sometimes relatively insignificant depending on your use case.

If you are setting up a pipeline that requires one initial big loading job to spin up a live system, that one loading time is insignificant in the long run. Basically, a loading time of minutes or hours is of little relevance to kick off a system that will run for weeks or years.

However, if you want to perform deep analysis across all of your data quickly, this loading time becomes very important. Maybe you suspect a security flaw and need to scrutinise huge amounts of your data to find it… Alternatively, you may be running your analysis on AWS as you don’t have the in-house resources to perform such a large scale investigation. In both of these scenarios, time to load your data is crucial and speed saves you money.

Queries

In this section I will analyse the results of each query and discuss why the time differences exist. As I mentioned, this article is more about why there are differences and how to avoid the causes of these differences to create a fair benchmark in the future.

This is not a speed comparison but an analysis of problems to avoid when creating a benchmark (which I am working on).

I briefly go over each query but they can be found here.

Query 1:

This query is very simple but highlights a number of issues. It simply counts the number of triples in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

To understand the problems, let’s first take a look at the results:

You can see that we again have significant differences in times (Red bar extends so far that the others were unreadable so cut vertical axis).

The first problem with this query is that it will never be run in production as it provides no valuable information. Linked data is useful to analyse relationships and grab information for interfaces, etc… not to count the number of triples.

GraphDB, likely for this reason, has not optimised for this query at all. An additional reason for this is that they have tried many optimisations to make counting fast; essentially counting based on (specific) indices, without iterating bindings/solutions. Many of those optimisations show great performance on specific queries, but are slow or return incorrect results on real queries.

AnzoGraph equally completes an actual ‘count’ of each triple every time this query is run but the difference is likely a Java vs C difference again (or they have optimised slightly for this query).

Virtuoso is interesting as it is built upon a relational database and therefore keeps a record of the number of triples in the database at all times. It can therefore translate this query to look up that record and not actually ‘count’ like the last two.

Stardog takes another approach which is to run an index to help them avoid counting at all.

Blazegraph perhaps take this further which raises another problem with this query (in fact this is a problem with all of my queries). They possibly cache the result from the warm-up runs and display that on request.

A major problem is that I run the EXACT same queries repeatedly. After the first run, the result can simply be cached and recalled. This mixed with the need for warm-up runs creates an unrealistic test.

In production, queries are usually similar but with different entities within. For example, if you click on a person in an interface to bring up a detailed page about them, the information needed is always the same. The query is therefore the same apart from the person entity (the person you click on).

To combat this, I will make sure to have at least one randomly generated seed in each of my query templates.

Query 2:

This query, grabbed from this paper, returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

This is a little more realistic when compared to query 1 but again has the problem that each run sends the exact same query.

In addition, a new issue becomes clear.

Once again, I have chopped the vertical axis so that the results can be shown clearly (and labelled at the base).

The interesting thing here is the fact that all of the triplestores return exactly the same 1,000 labels apart from one – AnzoGraph. This is almost certainly the cause of the time difference as they return a different 1,000 people each time the query is run.

This is possibly by design so that limits do not skew analytical results. AnzoGraph is the only OLAP database in this comparison so they focus on deep analytics. They therefore would not want limits to return the same results every time, potentially missing something important.

Another important point regarding this query is that we have a LIMIT but no ORDER BY which is extremely unusual in real usage. You don’t tend to want 100 random movies, for example, but the 100 highest rated movies.

On testing this, adding an ORDER BY did increase the response times. This difference then extends into query 3…

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.}
{?v6 dbo:city ?v2.}
UNION
{?v6 dbo:location ?v2.}
{?v6 dbp:iata ?v5.}
UNION
{?v6 dbo:iataLocationIdentifier ?v5.}
OPTIONAL {?v6 foaf:homepage ?v7.}
OPTIONAL {?v6 dbp:nativename ?v8.}
{In the contract it might be important to note that my legal first name is John
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
}}

As you can imagine, there is a very similar pattern between query 2 and query 3 results.

Remember that each run of this query asks for exactly the same information in each system except for AnzoGraph, which is different every time.

As with all of the other queries, returning the exact same results each run is problematic. Not only is it unrealistic but it is impossible to make a distinction between fast querying and smart caching. It is not bad to cache, it is smart to do for fast response times. The problem is the fact that this type of caching is unlikely to be needed in production.

A nice note to make is that, unlike the others, AnzoGraph is retrieving information about a different 1,000 settlements each run and only takes an additional 300ms to do this. Whether this is impressive or not cannot be known from this experiment.

If caching an answer is possible for some systems and not others, the results can not be fairly compared. This is of course a problem if developing a benchmark.

Again however, randomly generated seeds would solve this.

Query 4:

To gauge the speed of each system’s mathematical functionality, I created a nonsensical query that uses many of these (now, sum, avg, ceil, rand, etc…).

The fact that this is nonsensical is not entirely a problem in this case. The fact that the query is exactly the same each run is however (as always).

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

Essentially, this query is built from multiple nested selects that return and process numbers into a final result.

Once again, I have cut the vertical axis and labelled the bar for clarity.

This is a perfect example of query caching. I would be extremely surprised if AnzoGraph could actually run this query in 20ms. As mentioned above, caching is not cheating – just a problem when the exact same query is run repeatedly which is unrealistic.

It is also important to note that when I say caching, I do not necessarily mean result caching. Query structure can be cached for example to optimise any following queries. In fact, result caching could cause truth maintenance issues in a dynamic graph.

Blazegraph, Stardog and Virtuoso take a little longer but it is impossible to tell whether the impressive speed compared to GraphDB is due to calculation performance or some level of caching.

In conjunction with this, we can also not conclude that GraphDB is mathematically slow. It of course looks like that could be a clear conclusion but it is not.

Without knowing what causes the increased performance (likely because the query is exactly the same each run), we cannot conclude what can be deemed poor performance.

Once again (there’s a pattern here) randomly generated seeds within query templates would make this fair as result caching could not take place.

Query 5a (Regex):

This query, like query 4, is nonsensical but aims to evaluate string instead of math queries. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab ../stardog-admin db create -n
./stardog-admin db create -o search.enabled=true -n bench /root/virtuoso/dumps/*ttl.gz
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

Regex SPARQL queries are very uncommon as the majority of triplestores have a full text search implementation that is much faster!

If however, you wished to send the same string query to multiple triplestores (you want to use an OLTP and an OLAP database together for example) then you may want to use Regex so you don’t have to customise each query.

AnzoGraph is the only triplestore here that does not have a built in full text indexing tool. This can be added by integrating AnzoGraph with Anzo, a data management and analytics tool.

Blazegraph, GraphDB and Virtuoso therefore do not optimise for this type of query as it is so uncommonly used. AnzoGraph however does optimise for this as users may not want to integrate Anzo into their software.

Searching for all of these literals, constructing the graph and returning the result in half a second is incredibly fast. So fast that I believe we run into the caching problem again.

To reiterate, I am not saying caching is bad! It is just a problem to compare results because my queries are the same every run.

Comparing Regex results is unnecessary when there are better ways to write the exact same query. If you were using different triplestores in production, it would be best to add a query modifier to transform string queries into their corresponding full text search representation.

For this reason I will use full text search (where possible) in my benchmark.

Query 5b (Full Text Index):

This query is exactly the same as above but uses each triplestores full text index instead of Regex.

As these are all different, I have the Stardog implementation below (as they were the fastest in this case). The others can be found here.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .}
WHERE {
{?s1 rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> 'venus'
} UNION {?s2 rdfs:comment ?sab .
?sab <tag:stardog:api:property:textMatch> 'sleep'
} UNION {?s3 dbo:abstract ?lab .
?lab <tag:stardog:api:property:textMatch> 'gluten'
}
}

I did not integrate AnzoGraph with Anzo so they are not below.

All of these times are significantly faster than their corresponding times in query 5b. Even the slowest time here is less than half the fastest query 5b time!

This really highlights why I will not include regex queries (where possible) in my benchmark.

Once again, due to the fact that the query is exactly the same each run I cannot compare how well these systems would perform in production.

Query 6:

Queries 1, 4 and 5 (2 and 3 also to an extent) are not like real queries that would be used in a real pipeline. To add a couple more sensible queries, I grabbed the two queries listed here.

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
dbo:position|dbp:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
dbo:birthPlace/dbo:country* ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

Of course even with a more realistic query, my main problem remains…

Is the difference in time between Virtuoso and AnzoGraph due to performance or the fact that the same query is run thirteen times? It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment. As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered. This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data. How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:

This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

This is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next Steps

I learned recently that benchmarks require significantly more than three warm up runs. In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance. For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China – caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic. If the data generated was real, they would return useful insight into the data. This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system. Systems are optimised for different structures of data by default which effects the results and therefore needs to change. Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

Conclusion

Fairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed. The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!

AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…

If this was a fair experiment, which of these results would be the “best”?

It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

Appendix

Below I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:

AnzoGraph

Version: r201901292057.beta

Queried with:
azgi -silent -timer -csv -f /my/query.rq

Loaded with:
azgi -silent -f -timer /my/load.rq

Blazegraph

Version: 2.1.5

Queried with:
Rest API

Loaded with:
Using the dataloader Rest API by sending a dataloader.txt file.

GraphDB

Version: GraphDB-free 8.8.1

Queried with:
Rest API

Loaded with:
loadrdf -f -i repoName -m parallel /path/to/data/directory

It is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

Stardog

Version: 5.3.5

Queried with:
stardog query myDB query.rq

Loaded with:
stardog-admin db create -n repoName /path/to/my/data/*.ttl.gz

Virtuoso

Version: VOS 7.2.4.2

Queried within isql-v:
SPARQL PREFIX … rest of query … ;

Loaded within isql-v:
ld_dir (‘directory’, ‘*.*’, ‘http://dbpedia.org’) ;
then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:
BufferSize = 1360000
DirtyBufferSize = 1000000

This was a recommended switch in the default virtuoso.ini file.


Comparison of Linked Data Triplestores: Developing the Methodology was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inspecting Load and Query Times across DBPedia and Yago

Developers in small to medium scale companies are often asked to test software and decide what’s “best”. I have worked with RDF for a few years now and thought that comparing triplestores would be a relatively trivial task. I was wrong so here is what I have learned so far.

TL;DR - My original comparison had an imperfect methodology so I have developed this based on the community feedback. My queries now bias the results so I will next create data and query generators.

Contents

Introduction
Methodology -
What I am doing differently
Triplestores -
Which triplestores I tested.
Loading -
How fast does each triplestore load the data?
Queries -
Query Times (and how my queries bias these)
Next Steps -
Developing a realistic Benchmark
Conclusion
Appendix -
Versions, loading and query methods, etc…

Introduction

Over the past few months I have created a small RDF dataset and some SPARQL queries to introduce people to linked data. In December I tied these together to compare some of the existing triplestores (you can read that here). I was surprised by the amount of attention this article got and I received some really great feedback and advice from the community.

Based on this feedback, I realised that the dataset I created was simply too small to really compare these systems properly as time differences were often just a few milliseconds. Additionally, I did not run warm-up queries which proved to effect results significantly in some cases.

Methodology

I have therefore developed my methodology and run a second comparison to see how these systems perform on a larger scale (not huge due to current hardware restrictions).

I have increased the number of triples to 245,197,165 which is significantly more than the 1,781,625 triples that the original comparison was run on.

I performed three warm-up runs and then ran ten hot runs and chart the average time of those ten.

The machine I used has 32Gb Memory, 8 logical cores and was running Centos 7. I used each system one at a time so they did not interfere with each other.

I used the CLI to load and query the data in all systems so that there can be no possibility that the UI effects the time.

I split the RDF into many gzipped files containing 100k triples each. This improves loading times as the process can be optimised across cores.

If you would like to recreate this experiment, you can find my queries, results and instructions on how to get the data here.

Triplestores

In this comparison I evaluated five triplestores. These were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso.

I have listed the versions, query and load methods in the appendix of this article.

Loading

The first thing I did when evaluating each triplestore was of course load the data. Three distinct categories emerged: hours, 10’s of minutes and minutes.

In each case I loaded all the data with all of the gzipped .ttl files containing 100k triples each.

It is also important to note that loading time can be optimised in each case so these are not the fastest they can load, just the default. If you are a deciding for a business, the vendors are more than happy to help you optimise for your data structure.

Blazegraph and GraphDB load this dataset in roughly 8 hours. Stardog and Virtuoso load this in the 30 to 45 minute range but AnzoGraph loads the exact same dataset in just 3 minutes!

Why these three buckets though? Blazegraph, GraphDB and Stardog are all Java based so how does Stardog load the data so much faster (with the default settings)? This is likely due to differences in garbage collection, Stardog probably manages this more by default than the other two.

Virtuoso is written in C which doesn’t manage memory and is therefore easier to load faster than systems built in Java. AnzoGraph is developed in C/C++ so why is it so much faster?

The first reason is that it is simply newer and therefore a little more up to date. The second and more important reason is that they optimise highly for very fast loading speed as they are an OLAP database.

Initial loading speed is sometimes extremely important and sometimes relatively insignificant depending on your use case.

If you are setting up a pipeline that requires one initial big loading job to spin up a live system, that one loading time is insignificant in the long run. Basically, a loading time of minutes or hours is of little relevance to kick off a system that will run for weeks or years.

However, if you want to perform deep analysis across all of your data quickly, this loading time becomes very important. Maybe you suspect a security flaw and need to scrutinise huge amounts of your data to find it… Alternatively, you may be running your analysis on AWS as you don’t have the in-house resources to perform such a large scale investigation. In both of these scenarios, time to load your data is crucial and speed saves you money.

Queries

In this section I will analyse the results of each query and discuss why the time differences exist. As I mentioned, this article is more about why there are differences and how to avoid the causes of these differences to create a fair benchmark in the future.

This is not a speed comparison but an analysis of problems to avoid when creating a benchmark (which I am working on).

I briefly go over each query but they can be found here.

Query 1:

This query is very simple but highlights a number of issues. It simply counts the number of triples in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

To understand the problems, let’s first take a look at the results:

You can see that we again have significant differences in times (Red bar extends so far that the others were unreadable so cut vertical axis).

The first problem with this query is that it will never be run in production as it provides no valuable information. Linked data is useful to analyse relationships and grab information for interfaces, etc… not to count the number of triples.

GraphDB, likely for this reason, has not optimised for this query at all. An additional reason for this is that they have tried many optimisations to make counting fast; essentially counting based on (specific) indices, without iterating bindings/solutions. Many of those optimisations show great performance on specific queries, but are slow or return incorrect results on real queries.

AnzoGraph equally completes an actual ‘count’ of each triple every time this query is run but the difference is likely a Java vs C difference again (or they have optimised slightly for this query).

Virtuoso is interesting as it is built upon a relational database and therefore keeps a record of the number of triples in the database at all times. It can therefore translate this query to look up that record and not actually ‘count’ like the last two.

Stardog takes another approach which is to run an index to help them avoid counting at all.

Blazegraph perhaps take this further which raises another problem with this query (in fact this is a problem with all of my queries). They possibly cache the result from the warm-up runs and display that on request.

A major problem is that I run the EXACT same queries repeatedly. After the first run, the result can simply be cached and recalled. This mixed with the need for warm-up runs creates an unrealistic test.

In production, queries are usually similar but with different entities within. For example, if you click on a person in an interface to bring up a detailed page about them, the information needed is always the same. The query is therefore the same apart from the person entity (the person you click on).

To combat this, I will make sure to have at least one randomly generated seed in each of my query templates.

Query 2:

This query, grabbed from this paper, returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

This is a little more realistic when compared to query 1 but again has the problem that each run sends the exact same query.

In addition, a new issue becomes clear.

Once again, I have chopped the vertical axis so that the results can be shown clearly (and labelled at the base).

The interesting thing here is the fact that all of the triplestores return exactly the same 1,000 labels apart from one - AnzoGraph. This is almost certainly the cause of the time difference as they return a different 1,000 people each time the query is run.

This is possibly by design so that limits do not skew analytical results. AnzoGraph is the only OLAP database in this comparison so they focus on deep analytics. They therefore would not want limits to return the same results every time, potentially missing something important.

Another important point regarding this query is that we have a LIMIT but no ORDER BY which is extremely unusual in real usage. You don’t tend to want 100 random movies, for example, but the 100 highest rated movies.

On testing this, adding an ORDER BY did increase the response times. This difference then extends into query 3…

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.}
{?v6 dbo:city ?v2.}
UNION
{?v6 dbo:location ?v2.}
{?v6 dbp:iata ?v5.}
UNION
{?v6 dbo:iataLocationIdentifier ?v5.}
OPTIONAL {?v6 foaf:homepage ?v7.}
OPTIONAL {?v6 dbp:nativename ?v8.}
{In the contract it might be important to note that my legal first name is John
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
}}

As you can imagine, there is a very similar pattern between query 2 and query 3 results.

Remember that each run of this query asks for exactly the same information in each system except for AnzoGraph, which is different every time.

As with all of the other queries, returning the exact same results each run is problematic. Not only is it unrealistic but it is impossible to make a distinction between fast querying and smart caching. It is not bad to cache, it is smart to do for fast response times. The problem is the fact that this type of caching is unlikely to be needed in production.

A nice note to make is that, unlike the others, AnzoGraph is retrieving information about a different 1,000 settlements each run and only takes an additional 300ms to do this. Whether this is impressive or not cannot be known from this experiment.

If caching an answer is possible for some systems and not others, the results can not be fairly compared. This is of course a problem if developing a benchmark.

Again however, randomly generated seeds would solve this.

Query 4:

To gauge the speed of each system’s mathematical functionality, I created a nonsensical query that uses many of these (now, sum, avg, ceil, rand, etc…).

The fact that this is nonsensical is not entirely a problem in this case. The fact that the query is exactly the same each run is however (as always).

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

Essentially, this query is built from multiple nested selects that return and process numbers into a final result.

Once again, I have cut the vertical axis and labelled the bar for clarity.

This is a perfect example of query caching. I would be extremely surprised if AnzoGraph could actually run this query in 20ms. As mentioned above, caching is not cheating - just a problem when the exact same query is run repeatedly which is unrealistic.

It is also important to note that when I say caching, I do not necessarily mean result caching. Query structure can be cached for example to optimise any following queries. In fact, result caching could cause truth maintenance issues in a dynamic graph.

Blazegraph, Stardog and Virtuoso take a little longer but it is impossible to tell whether the impressive speed compared to GraphDB is due to calculation performance or some level of caching.

In conjunction with this, we can also not conclude that GraphDB is mathematically slow. It of course looks like that could be a clear conclusion but it is not.

Without knowing what causes the increased performance (likely because the query is exactly the same each run), we cannot conclude what can be deemed poor performance.

Once again (there’s a pattern here) randomly generated seeds within query templates would make this fair as result caching could not take place.

Query 5a (Regex):

This query, like query 4, is nonsensical but aims to evaluate string instead of math queries. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab ../stardog-admin db create -n
./stardog-admin db create -o search.enabled=true -n bench /root/virtuoso/dumps/*ttl.gz
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

Regex SPARQL queries are very uncommon as the majority of triplestores have a full text search implementation that is much faster!

If however, you wished to send the same string query to multiple triplestores (you want to use an OLTP and an OLAP database together for example) then you may want to use Regex so you don’t have to customise each query.

AnzoGraph is the only triplestore here that does not have a built in full text indexing tool. This can be added by integrating AnzoGraph with Anzo, a data management and analytics tool.

Blazegraph, GraphDB and Virtuoso therefore do not optimise for this type of query as it is so uncommonly used. AnzoGraph however does optimise for this as users may not want to integrate Anzo into their software.

Searching for all of these literals, constructing the graph and returning the result in half a second is incredibly fast. So fast that I believe we run into the caching problem again.

To reiterate, I am not saying caching is bad! It is just a problem to compare results because my queries are the same every run.

Comparing Regex results is unnecessary when there are better ways to write the exact same query. If you were using different triplestores in production, it would be best to add a query modifier to transform string queries into their corresponding full text search representation.

For this reason I will use full text search (where possible) in my benchmark.

Query 5b (Full Text Index):

This query is exactly the same as above but uses each triplestores full text index instead of Regex.

As these are all different, I have the Stardog implementation below (as they were the fastest in this case). The others can be found here.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .}
WHERE {
{?s1 rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> 'venus'
} UNION {?s2 rdfs:comment ?sab .
?sab <tag:stardog:api:property:textMatch> 'sleep'
} UNION {?s3 dbo:abstract ?lab .
?lab <tag:stardog:api:property:textMatch> 'gluten'
}
}

I did not integrate AnzoGraph with Anzo so they are not below.

All of these times are significantly faster than their corresponding times in query 5b. Even the slowest time here is less than half the fastest query 5b time!

This really highlights why I will not include regex queries (where possible) in my benchmark.

Once again, due to the fact that the query is exactly the same each run I cannot compare how well these systems would perform in production.

Query 6:

Queries 1, 4 and 5 (2 and 3 also to an extent) are not like real queries that would be used in a real pipeline. To add a couple more sensible queries, I grabbed the two queries listed here.

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
dbo:position|dbp:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
dbo:birthPlace/dbo:country* ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

Of course even with a more realistic query, my main problem remains…

Is the difference in time between Virtuoso and AnzoGraph due to performance or the fact that the same query is run thirteen times? It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment. As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered. This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data. How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:

This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

This is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next Steps

I learned recently that benchmarks require significantly more than three warm up runs. In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance. For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China - caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic. If the data generated was real, they would return useful insight into the data. This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system. Systems are optimised for different structures of data by default which effects the results and therefore needs to change. Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

Conclusion

Fairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed. The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!

AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…

If this was a fair experiment, which of these results would be the “best”?

It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

Appendix

Below I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:

AnzoGraph

Version: r201901292057.beta

Queried with:
azgi -silent -timer -csv -f /my/query.rq

Loaded with:
azgi -silent -f -timer /my/load.rq

Blazegraph

Version: 2.1.5

Queried with:
Rest API

Loaded with:
Using the dataloader Rest API by sending a dataloader.txt file.

GraphDB

Version: GraphDB-free 8.8.1

Queried with:
Rest API

Loaded with:
loadrdf -f -i repoName -m parallel /path/to/data/directory

It is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

Stardog

Version: 5.3.5

Queried with:
stardog query myDB query.rq

Loaded with:
stardog-admin db create -n repoName /path/to/my/data/*.ttl.gz

Virtuoso

Version: VOS 7.2.4.2

Queried within isql-v:
SPARQL PREFIX ... rest of query ... ;

Loaded within isql-v:
ld_dir ('directory', '*.*', 'http://dbpedia.org') ;
then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:
BufferSize = 1360000
DirtyBufferSize = 1000000

This was a recommended switch in the default virtuoso.ini file.


Comparison of Linked Data Triplestores: Developing the Methodology was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.