Comparison of Linked Data Triplestores: A New Contender

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.

After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox.

In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark.

They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up
How I Tested RDFox
First Impressions
Results
Conclusion

source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable.

In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark.

This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well.

Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others.

For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command.

RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance.

This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.

Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.

You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:

SELECT ?s ?o
WHERE {
?s example:predicate ?o .
}

This would return the triples:

person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .

Which make the use of alternate property paths less necessary.

With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms).

Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:

Others = AnzoGraph, Stardog and Virtuoso

Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).

RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.

It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.

Query 1:

This query is very simple and just counts the number of relationships in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.

Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB

Query 2:

This query returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic.

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

You will notice that this query is slightly different to query 3 in the original article.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
{
FILTER(EXISTS{ SELECT ?v WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
}
LIMIT 1000
})
}
}

RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

The results of query 2 and 3 are very similar of course as query 2 is nested within query 2.

Query 4:

The two queries above were similar but query 4 is a lot more mathematical.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

AnzoGraph was the quickest to complete query 4 with RDFox in second place.

Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso

Virtuoso was the third fastest to complete this query in a time of 519.5ms.

As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark.

Query 5:

This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab .
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:

Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso

AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores.

If we instead run full text index queries, they are significantly faster than RDFox.

Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores.

Query 6:

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
:countryOfBirth ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.

RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark.

Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average.

Query 7:

Finally, this query finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.

Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso

In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:

RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.

I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily.

If you want to try it for yourself, you can request a license here.

Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.


Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.
After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox. In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark. They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up How I Tested RDFox First Impressions Results Conclusion
source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable. In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark. This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well. Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others. For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command. RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance. This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.
Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.
You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:
SELECT ?s ?o
WHERE {
  ?s example:predicate ?o .
}
This would return the triples:
person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .
Which make the use of alternate property paths less necessary. With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms). Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:
Others = AnzoGraph, Stardog and Virtuoso
Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).
RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.
It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.
Query 1: This query is very simple and just counts the number of relationships in the graph.
SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.
Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB
Query 2: This query returns a list of 1000 settlement names which have airports with identification numbers.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
  { ?v2 a dbo:Settlement ;
        rdfs:label ?v .
    ?v6 a dbo:Airport . }
  { ?v6 dbo:city ?v2 . }
  UNION
    { ?v6 dbo:location ?v2 . }
  { ?v6 dbp:iata ?v5 . }
  UNION
    { ?v6 dbo:iataLocationIdentifier ?v5 . }
  OPTIONAL { ?v6 foaf:homepage ?v7 . }
  OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic. Query 3: This query nests query 2 to grab information about the 1,000 settlements returned above.
You will notice that this query is slightly different to query 3 in the original article.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
  ?v2 a dbo:Settlement;
      rdfs:label ?v.
  ?v6 a dbo:Airport.
  { ?v6 dbo:city ?v2. }
  UNION
  { ?v6 dbo:location ?v2. }
  { ?v6 dbp:iata ?v5. }
  UNION
  { ?v6 dbo:iataLocationIdentifier ?v5. }
  OPTIONAL { ?v6 foaf:homepage ?v7. }
  OPTIONAL { ?v6 dbp:nativename ?v8. }
  {
    FILTER(EXISTS{ SELECT ?v WHERE {
      ?v2 a dbo:Settlement;
          rdfs:label ?v.
      ?v6 a dbo:Airport.
      { ?v6 dbo:city ?v2. }
      UNION
      { ?v6 dbo:location ?v2. }
      { ?v6 dbp:iata ?v5. }
      UNION
      { ?v6 dbo:iataLocationIdentifier ?v5. }
      OPTIONAL { ?v6 foaf:homepage ?v7. }
      OPTIONAL { ?v6 dbp:nativename ?v8. }
    }
    LIMIT 1000
    })
  }
}
RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
The results of query 2 and 3 are very similar of course as query 2 is nested within query 2. Query 4: The two queries above were similar but query 4 is a lot more mathematical.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {  
  {SELECT (CEIL(?a + ?b) AS ?x) WHERE {
    {SELECT (AVG(?abslat) AS ?a) WHERE {
    ?s1 geo:lat ?lat .
    BIND(ABS(?lat) AS ?abslat)
    }}
    {SELECT (SUM(?rv) AS ?b) WHERE {
    ?s2 dbo:volume ?volume .
    BIND((RAND() * ?volume) AS ?rv)
    }}
  }}
  
  {SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
      {SELECT ?c WHERE {
        BIND(MINUTES(NOW()) AS ?c)
      }}
      {SELECT (AVG(?width) AS ?d) WHERE {
        ?s3 dbo:width ?width .
        FILTER(?width > 50)
      }}
  }}
}
AnzoGraph was the quickest to complete query 4 with RDFox in second place.
Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso
Virtuoso was the third fastest to complete this query in a time of 519.5ms. As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark. Query 5: This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.
I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ex:notglutenfree rdfs:label ?label ;
                   rdfs:comment ?sab ;
                   dbo:abstract ?lab .
} WHERE {
  {?s1 rdfs:label ?label .
  FILTER (REGEX(lcase(?label), 'venus'))
  } UNION
  {?s2 rdfs:comment ?sab .
  FILTER (REGEX(lcase(?sab), 'sleep'))
  } UNION
  {?s3 dbo:abstract ?lab .
  FILTER (REGEX(lcase(?lab), 'gluten'))
  }
}
As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:
Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso
AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores. If we instead run full text index queries, they are significantly faster than RDFox.
Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores. Query 6: This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.
Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{ 
?soccerplayer a dbo:SoccerPlayer ;
   :position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
   :countryOfBirth ?countryOfBirth ;
   dbo:team ?team .
   ?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam . 
   ?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
   ?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer
If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.
RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark. Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average. Query 7: Finally, this query finds all people born in Berlin before 1900.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
 ?person dbo:birthPlace :Berlin .
 ?person dbo:birthDate ?birth .
 ?person foaf:name ?name .
 ?person dbo:deathDate ?death .
 FILTER (?birth < "1900-01-01"^^xsd:date)
 }
 ORDER BY ?name
Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.
Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso
In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:
RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.
I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily. If you want to try it for yourself, you can request a license here.
Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.

Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Dementia Affects Conversation: Building a More Accessible Conversational AI

Diving into the Literature – What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation – Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood

There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer’s Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz

Note: I will refer to each paper with their corresponding number from now on.

Motivation – Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source – lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.

https://medium.com/media/05616eaceabf5537ffbda5b6811c367c/href


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diving into the Literature - What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation - Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood
There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer's Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz
Note: I will refer to each paper with their corresponding number from now on.

Motivation - Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source - lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Constructing More Advanced SPARQL Queries

CONSTRUCT queries, VALUES and more property paths.It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both …

CONSTRUCT queries, VALUES and more property paths.

It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both my Transforming Tabular Data into Linked Data tutorial and the Linked Data Reconciliation article.

So, to finally correct this - I will cover them here!

Contents

SELECT vs CONSTRUCT
First Basic Example
- VALUES
- Alternative Property Paths
Second Basic Example
Example From the Reconciliation Article
Example From the Benchmark (Sneak Preview)

SELECT vs CONSTRUCT

In my last tutorial, I basically ran through SELECT queries from the most basic to some more complex. So what’s the difference?

With selects we are trying to match patterns in the knowledge graph to return results. With constructs we are specifying and building a new graph to return.

In the two tutorials linked (in the intro) I was constructing graphs from tabular data to then insert into a triplestore. I will discuss sections of these later but you should be able to follow the full queries after going through this tutorial.

We usually use CONSTRUCT queries at Wallscope to build a graph for the front-end team. Essentially, we create a portable sub-graph that contains all of the information needed to build a section of an application. Then instead of many select queries to the full database, these queries can be run over the much smaller sub-graph returned by the construct query.

First Basic Example

For this first example I will be querying my Superhero dataset that you can download here.

Each superhero entity in this dataset is connected to their height with the predicate dbo:height as shown here:

Using this basic SELECT query:

PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?hero ?height
WHERE {
?hero dbo:height ?height .
}

Now lets modify this query slightly into a CONSTRUCT that is almost the same:

PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?hero dbo:height ?height
} WHERE {
?hero dbo:height ?height .
}

As you can see, this returns the same information but in the form: subject, predicate, object.

This is obviously trivial and not entirely useful but we can play with this graph in the construct with only one condition:

All variables in the CONSTRUCT must be in the WHERE clause.

Basically, like in a SELECT query, the WHERE clause matches patterns in the knowledge graph and returns any variables. The difference with a CONSTRUCT is that these variables are then used to build the graph described in the CONSTRUCT clause.

Hopefully that is clear, but it makes more sense if we change the graph description.

For example, if we decided that we wanted to use schema instead of DBpedia’s ontology, we could switch to it in the first clause:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?hero schema:height ?height
} WHERE {
?hero dbo:height ?height .
}

This then returns the superheroes attached to their heights with the schema:height predicate as the variables are matched in the WHERE clause and then recombined in the CONSTRUCT clause.

This simple predicate switching is not entirely useful on it’s own (unless you really need to switch ontology for some reason) but is a good first step to understand this type of query.

To create some more useful CONSTRUCT queries, I’ll first go through VALUES and another type of property path.

VALUES

I’m sure there are many use-cases in which the VALUES clause is incredibly useful but I can’t say that I use it often. Essentially, it allows data to be provided within the query.

If you are searching for a particular sport in a dataset for example, you could match all entities that are sports and then filter the results for it. This gets more complex however if you are looking for a few particular sports and you may want to provide the few sports within the query.

With VALUES you can constrain your query by creating a variable (can also create multiple variables) and assigning it some data.

I tend to use this with federated queries to grab data (usually for insertion into my database) about a few particular entities.

Let’s go through a practical example of this:

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}
}

In this example I am interested in the five largest countries in the British Isles to compare populations. For reference (I’m from Scotland and had to check I was correct so imagine others may find this useful also):

source

I am using DBpedia for this example so I have assigned the five country entities to the variable ?countries and selected them to be returned.

It should therefore be easy enough to grab the corresponding populations you’d think. I add the SERVICE clause to make this a federated query (covered previously). This just sends the countries defined within the query to DBpedia and returns their corresponding populations.

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus ?pop .
}
}

Here are the results:

You will notice however that Ireland is missing from the results! You will often find this kind of problem with linked open data, the structure is not always consistent throughout.

To find Ireland’s population we need to switch the predicate from dbp:populationCensus to dbo:populationTotal like so:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbo:populationTotal ?pop .
}
}

which returns Ireland alongside its population… but none of the others:

This is of course a problem but before we can construct a solution, let’s run through alternate property paths.

Alternative Property Paths

In my last SPARQL tutorial we covered sequential property paths which (once the benchmark query templates come out) you may notice I am a big fan of.

Another type of property path that I use fairly often is called the Alternative Property Path and is made use of with the pipe (|) character.

If we look back at the problem above in the VALUES section, we can get some populations with one predicate and the rest with another. The alternate property path allows us to match patterns with either! For example, if we modify the population query above we get:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This is such a simple change but so powerful as we now return every country alongside their population with one relatively basic query:

This SELECT is great if we are just looking to find some results but what if we want to store this data in our knowledge graph?

Second Example

It would be a hassle to have to use this alternative property path every time we want to work with country populations. In addition, if users were not aware of this inconsistency, they could find and report incorrect results.

This is why we CONSTRUCT the result graph we want without the inconsistencies. In this case I have chosen dbo:populationTotal as I simply prefer it and use that to connect countries and their populations:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?country dbo:populationTotal ?pop
} WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This query returns the countries and their populations like we saw in the previous section but then connects each country to their population with dbo:populationTotal as described in the CONSTRUCT clause. This returns consistent triples:

This is useful if we wish to store this data as the fact it’s consistent will help avoid the problems mentioned above. I used this technique in one of my previous articles so lets take a look.

Example From Reconciliation Tutorial

This example is copied directly from my data reconciliation tutorial here. In that article I discuss this query in a lot more detail.

In brief, what I was doing here was grabbing car manufacturer names from tabular data and enhancing that information to store and analyse.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

There is little point repeating myself here so if interested, please take a look. What I am trying to display here is that I have used both the alternative property path (twice!) and the CONSTRUCT clause previously in an example use-case.

Construct queries are perfectly suited to ensuring any data you store is well typed, structured and importantly consistent.

I have been short on time since starting my new project but I am still working on the benchmark in development.

Example From The Benchmark (Sneak Preview)

The benchmark repository is not yet public as I don’t want opinions to be formed before it is fleshed out a little more.

I thought it would be good however to give a real (not made for a tutorial) example query that uses what this article teaches:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?city dbo:populationTotal ?pop
} WHERE {
{
SELECT ?city (MAX(?apop) AS ?pop) {
?user schema:location ?city .

SERVICE <https://dbpedia.org/sparql> {
?city dbo:populationTotal | dbp:populationCensus ?apop .
}
}
GROUP BY ?city
}
}

You will notice that this does not contain the CONSTRUCT clause but INSERT instead. You will see me do this switch in both the articles I linked in the introduction. Basically this does nothing too different, the graph that is constructed is inserted into your knowledge graph instead of just returned. The same can be done with the DELETE clause to remove patterns from your knowledge graph.

This query is very similar to the examples throughout this article (by design of course) but grabs countries populations from DBpedia and inserts them into the graph. This is just one point within the query cycle at which the graph changes structure in the benchmark.

Finally, the MAX population is grabbed because some countries in DBpedia have two different populations attached to them…

Conclusion

Hopefully this is useful for some of you! We have covered why and how to use construct queries along with values and alternative property paths.

At the end of May I am going to the DBpedia community meeting in Leipzig so my next linked data article will likely cover things I learned at that event or progress on the benchmark development.

In the meantime I will be releasing my next Computer Vision article and another dive into natural conversation.


Constructing More Advanced SPARQL Queries was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beginning to Replicate Natural Conversation in Real Time

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking – End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond – killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking – End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

https://medium.com/media/7f962d156a27bee0a2feb146b12778d3/href

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction – just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today’s world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied – dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking - End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond - killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking - End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction - just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today's world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied - dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Linked Data Reconciliation in GraphDB

Using DBpedia to Enhance your Data in GraphDBFollowing my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconcili…

Using DBpedia to Enhance your Data in GraphDB

Following my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconciliation.

In this tutorial we will begin with a .csv of car manufacturers and enhance this with DBpedia. This .csv can be downloaded from here if you want to follow along.

Contents

Setting Up
Constructing the Graph
Reconciling your Data
Exploring the New Graph
Conclusion

Setting Up

First things first, we need to load our tabular data into OntoRefine in GraphDB. Head to the import tab, select “Tabular (OntoRefine)” and upload cars.csv if you are following along.

Click “Next” to start creating the project.

On this screen you need to untick “Parse next 1 line(s) as column headers” as this .csv does not have a header row. Rename the project in the top right corner and click “Create Project”.

You should now have this screen (above) showing one column of car manufacturer names. The column has a space in it which is annoying when running SPARQL queries across so lets rename it.

Click the little arrow next to “Column 1”, open “Edit Column” and then click “Rename this Column”. I called it “carNames” and will use this in the queries below so remember if you name it something different.

If you ever make a mistake, remember there is and undo/redo tab.

Constructing the Graph

In the top right of the interface there is an orange button titled “SPARQL”. Click this to open the SPARQL interface from which you can query your tabular data.

In the above screenshot I have run the query we want. I have have pasted it here so you can see it all and I go through it in detail below.

I use a CONSTRUCT query here. If you are new to SPARQL entirely then I recommend reading my tutorial on Constructing SPARQL Queries first. I then wrote a second tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

I start this query by defining my prefixes as usual. I am wanting to construct a graph around these car manufacturers so I design that in my CONSTRUCT clause. I am building a fairly simple graph for this tutorial so lets just run through it very quickly.

I want to have entities representing car manufacturers that have a type, label and location. This location is the headquarters of the car manufacturer. In most cases, all entities should have both a type and a human-readable label so I have ensured this here.

Each location is also an entity with an attached type, label and population.

Unlike my superhero tutorial, the .csv only contains the car company names and not all the data we want in our graph. We therefore need to reconcile our data with information in an open linked dataset. In this tutorial we will use DBpedia, the linked data representation of Wikipedia.

To get the information needed to build the graph declared in our CONSTRUCT we first grab all the names in our .csv and assign them to the variable ?cname. String literals must be language tagged to reconcile with the data in DBpedia so I BIND the English language tag “en” to each string literal. This explanation is what the lines below do:

If you didn’t name the column “carNames” above, you will have to modify the <urn:col:carNames> predicate here.
  ?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

Following this we use the SERVICE tag to send the query to DBpedia (this is called a federated query). We find every entity with the label matching our language tagged strings from the original .csv.

Once I have those entities, I need to find their locations. DBpedia is a very messy dataset so we have to use an alternative path in the query (represented by the “pipe” | symbol). This finds locations connected by any of the alternate paths given (in this case dbo:location and dbo:locationCountry) and assigns them to the variable ?location.

That explanation is referring to these lines:

    ?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

Next we want to retrieve the information about each country. The first pattern in the location ensures the entity has the type dbo:Country so that we don’t find loads of irrelevant locations.

Following this we grab the label and again use alternate property paths to extract each countries population.

It is important to note that some countries have two different populations attached by these two predicates.

We finally FILTER the country labels to only return those that are in English as that is the language our original dataset is in. Data reconciliation can also be used to extend your data into other languages if it happens to fit a multilingual linked open dataset.

That covers the final few lines of our query:

    ?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")

Next we need to insert this graph we have constructed into a GraphDB repository.

Click “SPARQL endpoint” and copy your endpoint (will be different) to be used later.

Reconciling the Data

If you have not done already, create a repository and head to the SPARQL tab.

You can see in the top right of this screenshot that I’m using a repository called “cars”.

In this query panel you want to copy the CONSTRUCT query we built and modify it a little. The full query is here:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE { SERVICE <http://localhost:7200/rdf-bridge/yourID> {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}}
}

The first thing we do is replace CONSTRUCT with INSERT as we now want to ingest the returned graph into our repository.

The next and final thing we must do is nest the entire WHERE clause into a second SERVICE tag. This time however, the service endpoint is the endpoint you copied at the end of the construction section.

This constructs the graph and inserts it into your repository!

It should be a much larger graph but the messiness of DBpedia strikes again! Many car manufacturers are connected to the string label of their location and not the entity. Therefore, the locations do not have a population and are consequently not returned.

We started with a small .csv of car manufacturer names so lets explore this graph we now have.

Exploring the New Graph

If we head to the “Explore” tab and view Japan for example, we can see our data.

Japan has the attached type dbo:Country, label, population and has seven car manufacturers.

There is no point in linking data if we cannot gain further insight so lets head to the “SPARQL” tab of the workbench.

In this screenshot we can see the results of the below query. This query returns each country alongside the number of people per car manufacturer in that country.

There is nothing new in this query if you have read my SPARQL introduction. I have used the MAX population as some countries have two attached populations due to DBpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?name ((MAX(?pop) / COUNT(DISTINCT ?companies)) AS ?result)
WHERE {
?companies rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
dbp:populationCensus ?pop ;
rdfs:label ?name .
}
GROUP BY ?name
ORDER BY DESC (?result)

In the screenshot above you can see that the results (ordered by result in descending order) are:

  • Indonesia
  • Pakistan
  • India
  • China

India of course has a much larger population than Indonesia but also has a lot more car manufacturers (as shown below).

If you were a car manufacturer in Asia, Indonesia might be a good market to target for export as it has a high population but very little local competition.

Conclusion

We started with a small list of car manufacturer names but, by using GraphDB and DBpedia, we managed to extend this into a small graph that we could gain actual insight from.

Of course, this example is not entirely useful but perhaps you have a list of local areas or housing statistics that you want to reconcile with mapping or government linked open data. This can be done using the above approach to help you or your business gain further insight that you could not have otherwise identified.


Linked Data Reconciliation in GraphDB was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.