Pronto – Find Predicates Fast

If you work with linked data or the semantic web, you understand how dull digging through ontologies to find concepts and predicates can be. At Wallscope we understand this too – so we created Pronto, a free tool that makes this work easier and more ef…

If you work with linked data or the semantic web, you understand how dull digging through ontologies to find concepts and predicates can be. At Wallscope we understand this too – so we created Pronto, a free tool that makes this work easier and more efficient. (If you are new to the semantic web and link data, I suggest you have a look at the type of challenges it aims to solve first.) The Problem. The objective of an ontology is to be reused. Although this is a simple concept, it can prove inconvenient in the long run. The many existing ontologies make searching for concepts and predicates tedious, labour-intensive and time-consuming. One has to iteratively and manually inspect a number of ontologies until a suitable ontological component is found. At Wallscope this issue impacts us since our work includes building data exploration systems that connect independent and diverse data sources. So we started thinking.
It would be much easier to search through all ontologies — or at least the main ones — at the same time.
As a result, we decided to invest in the creation of Pronto with the aim to overcome this challenge.
Example search of a predicate with Pronto.
The Solution. Pronto allows developers to search for concepts and predicates among a number of ontologies, originally selected from the prefix.cc user-curated “popular” list, along with some others we use. These include:
  • rdf and rdfs
  • foaf
  • schema
  • geo
  • dbo and dbp
  • owl
  • skos
  • xsd
  • vcard
  • dcat
  • dc and dcterms
  • madsrdf
  • bflc
Searching for a concept or a predicate retrieves results from the above ontologies, ordered by relevance. Try it here to see how Pronto works in practice. Thanks for reading. We would be interested to hear your feedback or suggestions for others ontologies to include. If you find Pronto useful give the article some claps (up to 50 if you hold the button 😄) so that more people can benefit from this tool!
Pronto - Find Predicates Fast was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comparison of Linked Data Triplestores: A New Contender

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.

After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox.

In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark.

They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up
How I Tested RDFox
First Impressions
Results
Conclusion

source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable.

In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark.

This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well.

Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others.

For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command.

RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance.

This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.

Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.

You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:

SELECT ?s ?o
WHERE {
?s example:predicate ?o .
}

This would return the triples:

person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .

Which make the use of alternate property paths less necessary.

With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms).

Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:

Others = AnzoGraph, Stardog and Virtuoso

Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).

RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.

It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.

Query 1:

This query is very simple and just counts the number of relationships in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.

Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB

Query 2:

This query returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic.

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

You will notice that this query is slightly different to query 3 in the original article.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
{
FILTER(EXISTS{ SELECT ?v WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
}
LIMIT 1000
})
}
}

RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

The results of query 2 and 3 are very similar of course as query 2 is nested within query 2.

Query 4:

The two queries above were similar but query 4 is a lot more mathematical.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

AnzoGraph was the quickest to complete query 4 with RDFox in second place.

Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso

Virtuoso was the third fastest to complete this query in a time of 519.5ms.

As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark.

Query 5:

This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab .
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:

Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso

AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores.

If we instead run full text index queries, they are significantly faster than RDFox.

Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores.

Query 6:

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
:countryOfBirth ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.

RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark.

Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average.

Query 7:

Finally, this query finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.

Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso

In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:

RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.

I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily.

If you want to try it for yourself, you can request a license here.

Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.


Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.
After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox. In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark. They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up How I Tested RDFox First Impressions Results Conclusion
source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable. In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark. This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well. Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others. For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command. RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance. This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.
Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.
You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:
SELECT ?s ?o
WHERE {
  ?s example:predicate ?o .
}
This would return the triples:
person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .
Which make the use of alternate property paths less necessary. With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms). Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:
Others = AnzoGraph, Stardog and Virtuoso
Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).
RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.
It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.
Query 1: This query is very simple and just counts the number of relationships in the graph.
SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.
Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB
Query 2: This query returns a list of 1000 settlement names which have airports with identification numbers.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
  { ?v2 a dbo:Settlement ;
        rdfs:label ?v .
    ?v6 a dbo:Airport . }
  { ?v6 dbo:city ?v2 . }
  UNION
    { ?v6 dbo:location ?v2 . }
  { ?v6 dbp:iata ?v5 . }
  UNION
    { ?v6 dbo:iataLocationIdentifier ?v5 . }
  OPTIONAL { ?v6 foaf:homepage ?v7 . }
  OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic. Query 3: This query nests query 2 to grab information about the 1,000 settlements returned above.
You will notice that this query is slightly different to query 3 in the original article.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
  ?v2 a dbo:Settlement;
      rdfs:label ?v.
  ?v6 a dbo:Airport.
  { ?v6 dbo:city ?v2. }
  UNION
  { ?v6 dbo:location ?v2. }
  { ?v6 dbp:iata ?v5. }
  UNION
  { ?v6 dbo:iataLocationIdentifier ?v5. }
  OPTIONAL { ?v6 foaf:homepage ?v7. }
  OPTIONAL { ?v6 dbp:nativename ?v8. }
  {
    FILTER(EXISTS{ SELECT ?v WHERE {
      ?v2 a dbo:Settlement;
          rdfs:label ?v.
      ?v6 a dbo:Airport.
      { ?v6 dbo:city ?v2. }
      UNION
      { ?v6 dbo:location ?v2. }
      { ?v6 dbp:iata ?v5. }
      UNION
      { ?v6 dbo:iataLocationIdentifier ?v5. }
      OPTIONAL { ?v6 foaf:homepage ?v7. }
      OPTIONAL { ?v6 dbp:nativename ?v8. }
    }
    LIMIT 1000
    })
  }
}
RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
The results of query 2 and 3 are very similar of course as query 2 is nested within query 2. Query 4: The two queries above were similar but query 4 is a lot more mathematical.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {  
  {SELECT (CEIL(?a + ?b) AS ?x) WHERE {
    {SELECT (AVG(?abslat) AS ?a) WHERE {
    ?s1 geo:lat ?lat .
    BIND(ABS(?lat) AS ?abslat)
    }}
    {SELECT (SUM(?rv) AS ?b) WHERE {
    ?s2 dbo:volume ?volume .
    BIND((RAND() * ?volume) AS ?rv)
    }}
  }}
  
  {SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
      {SELECT ?c WHERE {
        BIND(MINUTES(NOW()) AS ?c)
      }}
      {SELECT (AVG(?width) AS ?d) WHERE {
        ?s3 dbo:width ?width .
        FILTER(?width > 50)
      }}
  }}
}
AnzoGraph was the quickest to complete query 4 with RDFox in second place.
Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso
Virtuoso was the third fastest to complete this query in a time of 519.5ms. As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark. Query 5: This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.
I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ex:notglutenfree rdfs:label ?label ;
                   rdfs:comment ?sab ;
                   dbo:abstract ?lab .
} WHERE {
  {?s1 rdfs:label ?label .
  FILTER (REGEX(lcase(?label), 'venus'))
  } UNION
  {?s2 rdfs:comment ?sab .
  FILTER (REGEX(lcase(?sab), 'sleep'))
  } UNION
  {?s3 dbo:abstract ?lab .
  FILTER (REGEX(lcase(?lab), 'gluten'))
  }
}
As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:
Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso
AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores. If we instead run full text index queries, they are significantly faster than RDFox.
Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores. Query 6: This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.
Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{ 
?soccerplayer a dbo:SoccerPlayer ;
   :position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
   :countryOfBirth ?countryOfBirth ;
   dbo:team ?team .
   ?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam . 
   ?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
   ?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer
If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.
RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark. Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average. Query 7: Finally, this query finds all people born in Berlin before 1900.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
 ?person dbo:birthPlace :Berlin .
 ?person dbo:birthDate ?birth .
 ?person foaf:name ?name .
 ?person dbo:deathDate ?death .
 FILTER (?birth < "1900-01-01"^^xsd:date)
 }
 ORDER BY ?name
Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.
Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso
In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:
RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.
I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily. If you want to try it for yourself, you can request a license here.
Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.

Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Constructing More Advanced SPARQL Queries

CONSTRUCT queries, VALUES and more property paths.It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both …

CONSTRUCT queries, VALUES and more property paths.

It was (quite rightly) pointed out that I strangely did not cover CONSTRUCT queries in my previous tutorial on Constructing SPARQL Queries. Additionally, I then went on to use CONSTRUCT queries in both my Transforming Tabular Data into Linked Data tutorial and the Linked Data Reconciliation article.

So, to finally correct this - I will cover them here!

Contents

SELECT vs CONSTRUCT
First Basic Example
- VALUES
- Alternative Property Paths
Second Basic Example
Example From the Reconciliation Article
Example From the Benchmark (Sneak Preview)

SELECT vs CONSTRUCT

In my last tutorial, I basically ran through SELECT queries from the most basic to some more complex. So what’s the difference?

With selects we are trying to match patterns in the knowledge graph to return results. With constructs we are specifying and building a new graph to return.

In the two tutorials linked (in the intro) I was constructing graphs from tabular data to then insert into a triplestore. I will discuss sections of these later but you should be able to follow the full queries after going through this tutorial.

We usually use CONSTRUCT queries at Wallscope to build a graph for the front-end team. Essentially, we create a portable sub-graph that contains all of the information needed to build a section of an application. Then instead of many select queries to the full database, these queries can be run over the much smaller sub-graph returned by the construct query.

First Basic Example

For this first example I will be querying my Superhero dataset that you can download here.

Each superhero entity in this dataset is connected to their height with the predicate dbo:height as shown here:

Using this basic SELECT query:

PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?hero ?height
WHERE {
?hero dbo:height ?height .
}

Now lets modify this query slightly into a CONSTRUCT that is almost the same:

PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?hero dbo:height ?height
} WHERE {
?hero dbo:height ?height .
}

As you can see, this returns the same information but in the form: subject, predicate, object.

This is obviously trivial and not entirely useful but we can play with this graph in the construct with only one condition:

All variables in the CONSTRUCT must be in the WHERE clause.

Basically, like in a SELECT query, the WHERE clause matches patterns in the knowledge graph and returns any variables. The difference with a CONSTRUCT is that these variables are then used to build the graph described in the CONSTRUCT clause.

Hopefully that is clear, but it makes more sense if we change the graph description.

For example, if we decided that we wanted to use schema instead of DBpedia’s ontology, we could switch to it in the first clause:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?hero schema:height ?height
} WHERE {
?hero dbo:height ?height .
}

This then returns the superheroes attached to their heights with the schema:height predicate as the variables are matched in the WHERE clause and then recombined in the CONSTRUCT clause.

This simple predicate switching is not entirely useful on it’s own (unless you really need to switch ontology for some reason) but is a good first step to understand this type of query.

To create some more useful CONSTRUCT queries, I’ll first go through VALUES and another type of property path.

VALUES

I’m sure there are many use-cases in which the VALUES clause is incredibly useful but I can’t say that I use it often. Essentially, it allows data to be provided within the query.

If you are searching for a particular sport in a dataset for example, you could match all entities that are sports and then filter the results for it. This gets more complex however if you are looking for a few particular sports and you may want to provide the few sports within the query.

With VALUES you can constrain your query by creating a variable (can also create multiple variables) and assigning it some data.

I tend to use this with federated queries to grab data (usually for insertion into my database) about a few particular entities.

Let’s go through a practical example of this:

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}
}

In this example I am interested in the five largest countries in the British Isles to compare populations. For reference (I’m from Scotland and had to check I was correct so imagine others may find this useful also):

source

I am using DBpedia for this example so I have assigned the five country entities to the variable ?countries and selected them to be returned.

It should therefore be easy enough to grab the corresponding populations you’d think. I add the SERVICE clause to make this a federated query (covered previously). This just sends the countries defined within the query to DBpedia and returns their corresponding populations.

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus ?pop .
}
}

Here are the results:

You will notice however that Ireland is missing from the results! You will often find this kind of problem with linked open data, the structure is not always consistent throughout.

To find Ireland’s population we need to switch the predicate from dbp:populationCensus to dbo:populationTotal like so:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbo:populationTotal ?pop .
}
}

which returns Ireland alongside its population… but none of the others:

This is of course a problem but before we can construct a solution, let’s run through alternate property paths.

Alternative Property Paths

In my last SPARQL tutorial we covered sequential property paths which (once the benchmark query templates come out) you may notice I am a big fan of.

Another type of property path that I use fairly often is called the Alternative Property Path and is made use of with the pipe (|) character.

If we look back at the problem above in the VALUES section, we can get some populations with one predicate and the rest with another. The alternate property path allows us to match patterns with either! For example, if we modify the population query above we get:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?pop
WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This is such a simple change but so powerful as we now return every country alongside their population with one relatively basic query:

This SELECT is great if we are just looking to find some results but what if we want to store this data in our knowledge graph?

Second Example

It would be a hassle to have to use this alternative property path every time we want to work with country populations. In addition, if users were not aware of this inconsistency, they could find and report incorrect results.

This is why we CONSTRUCT the result graph we want without the inconsistencies. In this case I have chosen dbo:populationTotal as I simply prefer it and use that to connect countries and their populations:

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
CONSTRUCT {
?country dbo:populationTotal ?pop
} WHERE {
VALUES ?country {
dbr:Scotland
dbr:England
dbr:Wales
dbr:Northern_Ireland
dbr:Ireland
}

SERVICE <http://dbpedia.org/sparql> {
?country dbp:populationCensus | dbo:populationTotal ?pop .
}
}

This query returns the countries and their populations like we saw in the previous section but then connects each country to their population with dbo:populationTotal as described in the CONSTRUCT clause. This returns consistent triples:

This is useful if we wish to store this data as the fact it’s consistent will help avoid the problems mentioned above. I used this technique in one of my previous articles so lets take a look.

Example From Reconciliation Tutorial

This example is copied directly from my data reconciliation tutorial here. In that article I discuss this query in a lot more detail.

In brief, what I was doing here was grabbing car manufacturer names from tabular data and enhancing that information to store and analyse.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

There is little point repeating myself here so if interested, please take a look. What I am trying to display here is that I have used both the alternative property path (twice!) and the CONSTRUCT clause previously in an example use-case.

Construct queries are perfectly suited to ensuring any data you store is well typed, structured and importantly consistent.

I have been short on time since starting my new project but I am still working on the benchmark in development.

Example From The Benchmark (Sneak Preview)

The benchmark repository is not yet public as I don’t want opinions to be formed before it is fleshed out a little more.

I thought it would be good however to give a real (not made for a tutorial) example query that uses what this article teaches:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?city dbo:populationTotal ?pop
} WHERE {
{
SELECT ?city (MAX(?apop) AS ?pop) {
?user schema:location ?city .

SERVICE <https://dbpedia.org/sparql> {
?city dbo:populationTotal | dbp:populationCensus ?apop .
}
}
GROUP BY ?city
}
}

You will notice that this does not contain the CONSTRUCT clause but INSERT instead. You will see me do this switch in both the articles I linked in the introduction. Basically this does nothing too different, the graph that is constructed is inserted into your knowledge graph instead of just returned. The same can be done with the DELETE clause to remove patterns from your knowledge graph.

This query is very similar to the examples throughout this article (by design of course) but grabs countries populations from DBpedia and inserts them into the graph. This is just one point within the query cycle at which the graph changes structure in the benchmark.

Finally, the MAX population is grabbed because some countries in DBpedia have two different populations attached to them…

Conclusion

Hopefully this is useful for some of you! We have covered why and how to use construct queries along with values and alternative property paths.

At the end of May I am going to the DBpedia community meeting in Leipzig so my next linked data article will likely cover things I learned at that event or progress on the benchmark development.

In the meantime I will be releasing my next Computer Vision article and another dive into natural conversation.


Constructing More Advanced SPARQL Queries was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beginning to Replicate Natural Conversation in Real Time

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking – End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond – killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking – End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

https://medium.com/media/7f962d156a27bee0a2feb146b12778d3/href

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction – just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today’s world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied – dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking - End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond - killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking - End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction - just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today's world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied - dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.