BioHackathon 2019

I once again attended the European BioHackathon which took place in November outside of Paris. It was another intense week with 150 developers from across Europe (and beyond) working together on 34 topics. Bioschemas was well represented in the topics and the hacking activities of the week. By the end of the week we had […]

Petros and Alasdair hacking at the BioHackathon

I once again attended the European BioHackathon which took place in November outside of Paris. It was another intense week with 150 developers from across Europe (and beyond) working together on 34 topics.

Bioschemas was well represented in the topics and the hacking activities of the week. By the end of the week we had an approach for marking up Intrinsically Disordered Proteins and Rare Disease resources. We also had several resources with newly deployed or improved Bioschemas markup.

For a fuller overview of the event and outcomes, please see the ELIXIR news item.

SPARQL For Beginners

SPARQL (Sparql Protocol And RDF Query Language) is the W3C standard. The protocol part is usally only an issue for people writing programs that pass SPARQL queries back and forth between different machines. For most people SPARQL greatest value is as a query language for RDF – another W3C standard. RDF describes data using a collection of three-part of one statement such as emp3 has a title of Vice President.

We call each statement a triple and one triple consist of three parts these are Subject, Predicate and Object.

We can also say the Subject as an Entity Identifier, Predicate as an Attribute Name and Object as an Attribute Value.

The subject and predicate are actually represented using URIs to make it absolutely clear what we are talking about. URIs (Uniform Resource Identifier) kind of URLs and often look like them but they are not locators or addresses they are just identifiers. In our example, emp3 is the person who works in a specific company so we can represent this using URI like http://www.snee.com/hr/emp3 and title is also URI from the published ontology (In our case VCard business card ontology).

The object or third part of a triple can also be a URI if you like this way that same resource can be the object of some triples and subject of the others which lets you connect up triples into networks of data called Graphs.

To make URIs simpler to write RDF popular Turtle syntax often shortens the URIs by having the abbreviated prefix stand-in for everything in the URI before the last part.

Any data can be represented as a collection of triples for example we can usually represent each entry of a table by using the Row Identifier is the Subject, Column Name is the Predicate and Value is the Object.

Convert Relational Database Table to RDF Statements


We can convert above table in RDF triple statements. The following RDF statements are in Turtle format of above table.
 
                                
    @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
    @prefix sn: <http://www.snee.com/hr/> .
    

    sn:emp1   vcard:given-name        "Heidi" .
    sn:emp1   vcard:family-name       "Peter" .
    sn:emp1   vcard:title             "CEO" .
    sn:emp1   sn:hireDate             "2016-10-21" .
    sn:emp1   sn:completedOrientation "2016-10-30" .

    sn:emp2   vcard:given-name         "John" .
    sn:emp2   vcard:family-name        "Peter" .
    sn:emp2   sn:hireDate              "2016-10-28" .
    sn:emp2   vcard:title              "Engineer" .
    sn:emp2   sn:completedOrientation  "2015-01-30" .

    sn:emp3   vcard:given-name          "Imran" .
    sn:emp3   vcard:family-name         "Asif" .
    sn:emp3   sn:hireDate               "2014-12-03" .
    sn:emp3   vcard:title               "Vice President" .

    sn:emp4   vcard:given-name          "Duke" .
    sn:emp4   vcard:family-name         "Walliam" .
    sn:emp4   vcard:title               "Sales" .
    sn:emp4   sn:hireDate               "2015-11-10" .

This information can give us triples for every fact on the table. Some of the property names here from the vCard vocabulary. For those properties that are not available in vCard vocabulary, I made up my own property names using my own domain name. RDF makes it easy to mix and mash standard vocabularies and customizations.

Let’s say that the employee in the above table, John Peter completed his employee orientation course twice and if we want to store both of his completed course orientation values in the RDF then there is not a problem with the RDF. But if we want to stored John’s second completed orientation value in a relational database table then it would have been a lot more difficult.

 
    sn:emp2   vcard:given-name          "John" .
    sn:emp2   vcard:family-name         "Peter" .
    sn:emp2   sn:hireDate               "2016-10-28" .
    sn:emp2   vcard:title               "Engineer" .
    sn:emp2   sn:completedOrientation   "2015-01-30" .
    sn:emp2   sn:completedOrientation   "2015-03-15" .

SPARQL Example Queries


WHERE Clause

let’s look at a simple SPARQL query that retrieve some of the data from the above RDF Triples.
Query 1: We want a list of all employees whose last name is Peter.

                                
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    
    SELECT ?person
    WHERE
    {
      ?person vcard:family-name "Peter" .
    }

We can define the prefixes in the start of SPARQL query due to the Turtle RDF syntax, that’s why you don’t have to write absolute URIs in your queries. For most SPARQL queries it’s best to look at the Where clause first because that describe which triples we want to pull from the dataset that we are querying. The Where clause does this with one or more triple patterns which are likely triples with variables as wildcards substituted into one, two or all three of each triples parts.

In the Query 1, one triple pattern will match against triples whose predicate is the family name property from the vCard vocabulary, whose object is string Peter and whose subject is anything at all. Because this triple pattern has a variable that I named person.

SELECT Clause

The Select clause indicates which variables values we want listed after the query executes. The Query 1 only has one variable so that’s the one we want to see.

When the query executes, it finds two triples that match the specified pattern from the RDF triples set and assigned these two triples to the person variable which are the subjects e.g., emp1 and emp2. The following triples show the two matches of the above query with green colour.

 
    @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
    @prefix sn: <http://www.snee.com/hr/> .

    sn:emp1   vcard:given-name        "Heidi" .
    sn:emp1   vcard:family-name       "Peter" .
    sn:emp1   vcard:title             "CEO" .
    sn:emp1   sn:hireDate             "2016-10-21" .
    sn:emp1   sn:completedOrientation "2016-10-30" .

    sn:emp2   vcard:given-name         "John" .
    sn:emp2   vcard:family-name        "Peter" .
    sn:emp2   sn:hireDate              "2016-10-28" .
    sn:emp2   vcard:title              "Engineer" .
    sn:emp2   sn:completedOrientation  "2015-01-30" .
    sn:emp2   sn:completedOrientation  "2015-03-15" .

    sn:emp3   vcard:given-name          "Imran" .
    sn:emp3   vcard:family-name         "Asif" .
    sn:emp3   sn:hireDate               "2014-12-03" .
    sn:emp3   vcard:title               "Vice President" .

    sn:emp4   vcard:given-name          "Duke" .
    sn:emp4   vcard:family-name         "Walliam" .
    sn:emp4   vcard:title               "Sales" .
    sn:emp4   sn:hireDate               "2015-11-10" .

Let’s executes the Query 1 and find out who are these Peters. we get the following table as a result.

From the above results, we can see that emp1 and emp2 are just identifiers. This doesn’t give us meaningful result.
Query 2: Now let’s add a second triple pattern in the WHERE clause that matches on the given name of the employee, who matches the first triple pattern and stores that value in a new givenName variable.

                                
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#> 

    SELECT ?person ?givenName
    WHERE 
    { 
       ?person vcard:family-name "Peter" . 
       ?person vcard:given-name ?givenName .
    }

The explanation of the Query 2 is that we need to adjust the SELECT Clause with the new variable givenName because now we want this in the result. The SPARQL query processor finds each triple that matches the first triple pattern and store the value in the person variable. When it looks for the second triple pattern in the WHERE clause, who have a triple that matches the first triple pattern. In easy words, SPARQL processor get all triples of family-name Peter along with given-name. So when we run the Query 2 we see given name values in the result.

Query 3: Let’s retrieve the given name, family name and hire date of all the employees.
We can do this with a WHERE Clause that has three triple patterns one for each piece of information that we want to retrieve.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?hireDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:hireDate ?hireDate .
    }

When we run the query 3, we get the following results.

FILTER Keyword

If we want to narrow down the results based on some condition, we can use a FILTER pattern.
Query 4: Let’s say we want a list of employees who are hired before November 1st so the FILTER pattern specifies that we only want HD values that are less than November 1st 2015.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?hireDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:hireDate ?hireDate .
        FILTER(?hireDate < "2015-11-01")
    }

When we run the query 4, we get the following results in the ISO 8601 date format.

Query 5: Let’s remove the FILTER condition and list the employees and their completed orientation values instead of their hire date values.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?oDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:completedOrientation  ?oDate .
    }

When we run the query 5, we get the following results.

We see only Heidi and John’s orientation dates but the other employees don’t appear at all in the results why not? Let’s look more closely at the query triple patterns. The query first looks for a triple with a given name value and then a triple with the same subject as the subject that it found to match the first triple pattern but with a family name value and then another triple with the same subject and a completed orientation value. John and Heidi each have triples that match all the query triple patterns but Imran and Duke cannot match all three triple patterns. You have noted that John actually had two triples that matched the third pattern of the query, so the query had two rows of results for him, one for each completed orientation value.

OPTIONAL Keyword

Query 6: Let’s take another example, list of all employees and if they have any their completed orientation values.
We can tell query processor that matching on the third triple pattern is OPTIONAL.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?oDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        OPTIONAL { ?person sn:completedOrientation  ?oDate . }
    }

This query asks for everyone with a given name and a family name and if they have a completed orientation value it will show the following result.

NOT EXISTS Keyword

Query 7: Next let’s say that Heidi is scheduling a new orientation meeting and wants to know who to invite, in other words she wants to list all employees who do not have a completed orientation value.
Her query asks for everyone’s given and family names but only if for any employee who matches those first two triple patterns no triple exists that lists a completed orientation value for that employee. We do this with the keywords NOT EXISTS.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        NOT EXISTS { ?person sn:completedOrientation  ?oDate . }
    }

When we run the query 7, we get the following results.

BIND Keyword

So far, the only way we have seen to store a value in a variable is to include that variable in a triple pattern for the query processor to match against some part of triple. We can use the bind keyword to store whatever we like in a variable.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?someVariable
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND("some value" AS ?someVariable)
    }

When we run the above query, we get the following results.

This can be especially useful when the BIND expression uses values from other variables and calls some of SPARQL broad range of available functions to create a new value. In the following query the bind statement uses SPARQL’s concat function to concatenate the given name value stored by the first triple pattern a space and the family name value stored by the second triple pattern. It stores the result of this concatenation in a new variable called fullName.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?fullName
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following results with new full name value for each employee.

CONSTRUCT Clause

All the queries we have seen so far have been SELECT queries which are like SQL SELECT statements. A Sparql construct query uses the same kind of WHERE clauses that a SELECT query can use but it can use the values stored in the variables to create the new triples.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    CONSTRUCT {?person vcard:fn ?fullName
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following new triples.

Note that how the triple pattern showing the triple to construct is inside of curly braces. These curly braces can enclose multiple triple patterns which is a common practice when for example a construct query takes data conforming to one modal and creates triples conforming to another. The construction queries are great for data integration projects.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

    CONSTRUCT {
        ?person rdf:type foaf:Person .
        ?person foaf:givenName ?givenName .
        ?person foaf:familyName ?familyName .
        ?person foaf:name ?fullName .
    }
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following new triples.


            @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
            @prefix sn: <http://www.snee.com/hr/> .
            @prefix foaf: <http://xmlns.com/foaf/0.1/>
            @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

            sn:emp1   rdf:type          foaf:Person .
            sn:emp1   foaf:familyName   "Peter" .
            sn:emp1   foaf:givenName    "Heidi"
            sn:emp1   foaf:name		"Heidi Peter"

            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Peter" .
            sn:emp2   foaf:givenName    "John"
            sn:emp2   foaf:name		"John Peter"


            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Asif" .
            sn:emp2   foaf:givenName    "Imran"
            sn:emp2   foaf:name		"Imran Asif"

            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Walliam" .
            sn:emp2   foaf:givenName    "Duke"
            sn:emp2   foaf:name		"Duke Walliam"

SPARQL Do More:

Sparql can do a lot more than what we have seen here. You can
  • Use Data types languages tags,
  • Sort and aggregate query results
  • Add, delete and update data
  • Retrieve JSON, XML, and delimited versions of query results from query processors
  • Send queries to remote private or public data collections and there are quite a few of those out there.

References


DuCharme, B., 2013. Learning SPARQL: querying and updating with SPARQL 1.1. ” O’Reilly Media, Inc.”.

Seminar: Data Quality Issues in Current Nanopublications

Speaker: Imran Asif
Date: Wednesday 18 September 2019
Time: 11:15 – 12:15
Venue: CM T.01 EM1.58

Imran will give a practice version of his workshop paper that will be given at Research Objects 2019 (RO2019).

Abstract: Nanopublications are a granular way of publishing scientific claims together with their associated provenance and publication information. More than 10 million nanopublications have been published by a handful of researchers covering a wide range of topics within the life sciences. We were motivated to replicate an existing analysis of these nanopublications, but then went deeper into the structure of the existing nanopublications. In this paper, we analyse the usage of nanopublications by investigating the distribution of triples in each part and discuss the data quality issues raised by this analysis. From this analysis we argue that there is a need for the community to develop a set of community guidelines for the modelling of nanopublications.

BridgeDb GSoC 2019 Student

During the summer, BridgeDb has had a Google Summer of Code student working on extending the system to work with secondary identifiers; these are alternative identifiers for a given resource. The student Manas Awasthi has maintained a blog of his experiences. Below are some excerpts of his activity. Google Summer of Code 2019: Dream to […]

During the summer, BridgeDb has had a Google Summer of Code student working on extending the system to work with secondary identifiers; these are alternative identifiers for a given resource.

The student Manas Awasthi has maintained a blog of his experiences. Below are some excerpts of his activity.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
May 28 · 3 min read
Google Summer of Code, an annual Google program which encourages open source contribution from students. The term I was introduced to by my seniors in my freshman year. Having no clue about open source, I started gathering knowledge about ‘How to contribute to open source projects?’ Then I came across version control, being a freshman it was an unknown territory for me. I started using Github for my personal projects which gave me a better understanding of how to use it. Version Control Service was off the checklists. By the time all this was done Google Summer of Code 2018 was announced.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Jun 12 · 3 min read
The Coding Period: The First Two Weeks
The coding period of Google Summer of Code started on 27th of May, at the time of publishing it’s been more than 2 weeks, here I am writing this blog to discuss what I have done over the past two weeks, and what a ride it has been already. Plenty of coding along with plenty of learning. From the code base to the test suite.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Jun 22 · 3 min read
The Coding Period: Week 3 — Week 4
Hola Amigos!!! Let’s discuss my progress through week 3 and 4 of GSoC’s coding period. So the major part of what I was doing this week was to add support for the secondary identifier (err!!! whats that) to BridgeDb.

Google Summer of Code 2019: Dream to Reality

Manas Awasthi
Aug 21 · 3 min read
Hey Folks, this is probably the last blog in this series outlining my journey as a GSoC student. In this blog I’ll go through the functionality I have added over the summer and why the end-users should use it.

FAIRplus Newsletter 2

Below is the opening exert from the second FAIRplus Newsletter: Though FAIRplus has been running for just six months, there is already a lot to talk about. Our two task-focused ‘Squads’ have booted up and begun the FAIRification of the first set of four pilot datasets, our industry partners in EFPIA organised the first ‘Bring Your Own Data’ […]

Below is the opening exert from the second FAIRplus Newsletter:

Though FAIRplus has been running for just six months, there is already a lot to talk about. Our two task-focused ‘Squads’ have booted up and begun the FAIRification of the first set of four pilot datasets, our industry partners in EFPIA organised the first ‘Bring Your Own Data’ workshop in London, and we’ve been busy explaining our goals and answering many questions from our stakeholders.

You can read about these activities in this second FAIRplus newsletter. On top of that, we bring you an update on upcoming events, news from our partners and also a new section ‘Track our progress’ where you can check for yourself how we are progressing towards our goals and what Deliverables and reports we’ve recently submitted.

Finally, we’ve launched our own LinkedIn page. Besides regular updates on our activities, it will also feature job opportunities and news from the FAIRplus partners.

The next FAIRplus Newsletter will come out in November 2019. In it we’ll present the FAIRplus Fellowship programme, report on the FAIR workshop in October and more.

We wish you a relaxing Summer and look forward to meeting you at our events!

Pronto – Find Predicates Fast

If you work with linked data or the semantic web, you understand how dull digging through ontologies to find concepts and predicates can be. At Wallscope we understand this too – so we created Pronto, a free tool that makes this work easier and more ef…

If you work with linked data or the semantic web, you understand how dull digging through ontologies to find concepts and predicates can be. At Wallscope we understand this too – so we created Pronto, a free tool that makes this work easier and more efficient. (If you are new to the semantic web and link data, I suggest you have a look at the type of challenges it aims to solve first.) The Problem. The objective of an ontology is to be reused. Although this is a simple concept, it can prove inconvenient in the long run. The many existing ontologies make searching for concepts and predicates tedious, labour-intensive and time-consuming. One has to iteratively and manually inspect a number of ontologies until a suitable ontological component is found. At Wallscope this issue impacts us since our work includes building data exploration systems that connect independent and diverse data sources. So we started thinking.
It would be much easier to search through all ontologies — or at least the main ones — at the same time.
As a result, we decided to invest in the creation of Pronto with the aim to overcome this challenge.
Example search of a predicate with Pronto.
The Solution. Pronto allows developers to search for concepts and predicates among a number of ontologies, originally selected from the prefix.cc user-curated “popular” list, along with some others we use. These include:
  • rdf and rdfs
  • foaf
  • schema
  • geo
  • dbo and dbp
  • owl
  • skos
  • xsd
  • vcard
  • dcat
  • dc and dcterms
  • madsrdf
  • bflc
Searching for a concept or a predicate retrieves results from the above ontologies, ordered by relevance. Try it here to see how Pronto works in practice. Thanks for reading. We would be interested to hear your feedback or suggestions for others ontologies to include. If you find Pronto useful give the article some claps (up to 50 if you hold the button 😄) so that more people can benefit from this tool!
Pronto - Find Predicates Fast was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ambient Assisted Living (AAL) Summer School

SICSA is sponsoring the Ambient Assisted Living (AAL) Summer School which is taking place on 6th– 8thAugust at Heriot Watt University.

The SICSA Ambient Assisted Living (AAL) Summer School is Scotland’s first ever summer school designed to allow students to explore key concepts for the design of advanced AAL systems.

The program includes presentations from industry representatives and healthcare organisations, in addition to lectures and tutorials on sensing, linked data, machine learning and robotics for AAL applications.

The summer school is open to students and research staff, and employees of charities, non-profit organisations and companies with relevant backgrounds. Financial support is available for students at SICSA institutes to cover their on-campus accommodation and lunch costs, generously sponsored by SICSA, SICSA CPS and AI themes, and Nexus.

The deadline for applications is 30th June. Attendance will be limited, so please apply early.

Full details of the SICSA AAL Summer School can be found here.

If you have any questions or you would like to contact the Organisers, please see here.

Comparison of Linked Data Triplestores: A New Contender

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.

After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox.

In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark.

They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up
How I Tested RDFox
First Impressions
Results
Conclusion

source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable.

In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark.

This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well.

Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others.

For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command.

RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance.

This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.

Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.

You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:

SELECT ?s ?o
WHERE {
?s example:predicate ?o .
}

This would return the triples:

person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .

Which make the use of alternate property paths less necessary.

With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms).

Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:

Others = AnzoGraph, Stardog and Virtuoso

Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).

RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.

It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.

Query 1:

This query is very simple and just counts the number of relationships in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.

Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB

Query 2:

This query returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic.

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

You will notice that this query is slightly different to query 3 in the original article.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
{
FILTER(EXISTS{ SELECT ?v WHERE {
?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.
{ ?v6 dbo:city ?v2. }
UNION
{ ?v6 dbo:location ?v2. }
{ ?v6 dbp:iata ?v5. }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5. }
OPTIONAL { ?v6 foaf:homepage ?v7. }
OPTIONAL { ?v6 dbp:nativename ?v8. }
}
LIMIT 1000
})
}
}

RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

The results of query 2 and 3 are very similar of course as query 2 is nested within query 2.

Query 4:

The two queries above were similar but query 4 is a lot more mathematical.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

AnzoGraph was the quickest to complete query 4 with RDFox in second place.

Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso

Virtuoso was the third fastest to complete this query in a time of 519.5ms.

As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark.

Query 5:

This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab .
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:

Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso

AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores.

If we instead run full text index queries, they are significantly faster than RDFox.

Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores.

Query 6:

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
:countryOfBirth ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.

RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.

Others = Blazegraph, GraphDB, Stardog and Virtuoso

For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark.

Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average.

Query 7:

Finally, this query finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.

Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso

In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:

RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.

I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily.

If you want to try it for yourself, you can request a license here.

Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.


Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

First Impressions of RDFox while the Benchmark is Developed

Note: This article is not sponsored. Oxford Semantic Technologies let me try out the new version of RDFox and are keen to be part of the future benchmark.
After reading some of my previous articles, Oxford Semantic Technologies (OST) got in touch and asked if I would like to try out their triplestore called RDFox. In this article I will share my thoughts and why I am now excited to see how they do in the future benchmark. They have just released a page on which you can request your own evaluation license to try it yourself.

Contents

Brief Benchmark Catch-up How I Tested RDFox First Impressions Results Conclusion
source

Brief Benchmark Catch-up

In December I wrote a comparison of existing triplestores on a tiny dataset. I quickly learned that there were many flaws in my methodology for the results to be truly comparable. In February I then wrote a follow up in which I describe many of the flaws and listed many of the details that I will have to pay attention to while developing an actual benchmark. This benchmark is currently in development. I am now working with developers, working with academics and talking with a high-performance computing centre to give us access to the infrastructure needed to run at scale.

How I Tested RDFox

In the above articles I evaluated five triplestores. They were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso. I would like to include all of these in the future benchmark and now RDFox as well. Obviously my previous evaluations are not completely fair comparisons (hence the development of the benchmark) but the last one can be used to get an idea of whether RDFox can compete with the others. For that reason, I loaded the same data and ran the same queries as in my larger comparison to see how RDFox fared. I of course kept all other variables the same such as the machine, using the CLI to query, same number of warm up and hot runs, etc…

First Impressions

RDFox is very easy to use and well-documented. You can initialise it with custom scripts which is extremely useful as I could start RDFox, load all my gzipped turtle files in parallel, run all my warm up queries and run all my hot queries with one command. RDFox is an in-memory solution which explains many of the differences in results but they also have a very nice rule system that can be used to precompute results that are used in later queries. These rules are not evaluated when you send a query but in advance. This allows them to be used to automatically keep consistency of your data as it is added or removed. The rules themselves can even be added or removed during the lifetime of the database.
Note: Queries 3 and 6 use these custom rules. I highlight this on the relevant queries and this is one of the reasons I didn’t just add them to the last article.
You can use these rules for a number of things, for example to precompute alternate property paths (If you are unfamiliar with those then I cover them in this SPARQL tutorial). You could do this by defining a rule that example:predicate represents example:pred1 and example:pred2 so that:
SELECT ?s ?o
WHERE {
  ?s example:predicate ?o .
}
This would return the triples:
person:A example:pred1 colour:blue .
person:A example:pred2 colour:green .
person:B example:pred2 colour:brown .
person:C example:pred1 colour:grey .
Which make the use of alternate property paths less necessary. With all that… let’s see how RDFox performs.

Results

Each of the below charts compares RDFox to the averages of the others. I exclude outliers where applicable.

Loading

Right off the bat, RDFox was the fastest at loading which was a huge surprise as AnzoGraph was significantly faster than the others originally (184667ms). Even if I exclude Blazegraph and GraphDB (which were significantly slower at loading than the others), you can see that RDFox was very fast:
Others = AnzoGraph, Stardog and Virtuoso
Note: I have added who the others are as footnotes to each chart. This is so that the images are not mistaken for fair comparison results (if they showed up in a Google image search for example).
RDFox and AnzoGraph are much newer triplestores which may be why they are so much faster at loading than the others. I am very excited to see how these speeds are impacted as we scale the number of triples we load in the benchmark.

Queries

Overall I am very impressed with RDFox’s performance with these queries.
It is important to note however that the other’s results have been public for a while. I ran these queries on the newest version of RDFox AND the previous version and did not notice any significant optimisation on these particular queries. These published results are of course from the latest version.
Query 1: This query is very simple and just counts the number of relationships in the graph.
SELECT (COUNT(*) AS ?triples)
WHERE {
  ?s ?p ?o .
}
RDFox was the second slowest to do this but as mentioned in the previous article, optimisations on this query often reduce correctness.
Faster = AnzoGraph, Blazegraph, Stardog and Virtuoso. Slower = GraphDB
Query 2: This query returns a list of 1000 settlement names which have airports with identification numbers.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
  { ?v2 a dbo:Settlement ;
        rdfs:label ?v .
    ?v6 a dbo:Airport . }
  { ?v6 dbo:city ?v2 . }
  UNION
    { ?v6 dbo:location ?v2 . }
  { ?v6 dbp:iata ?v5 . }
  UNION
    { ?v6 dbo:iataLocationIdentifier ?v5 . }
  OPTIONAL { ?v6 foaf:homepage ?v7 . }
  OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
RDFox was the fastest to complete this query by a fairly significant margin and this is likely because it is an in-memory solution. GraphDB was the second fastest in 29.6ms and then Virtuoso in 88.2ms.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
I would like to reiterate that there are several problems with these queries that will be solved in the benchmark. For example, this query has a LIMIT but no ORDER BY which is highly unrealistic. Query 3: This query nests query 2 to grab information about the 1,000 settlements returned above.
You will notice that this query is slightly different to query 3 in the original article.
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?v ?v2 ?v5 ?v6 ?v7 ?v8 WHERE {
  ?v2 a dbo:Settlement;
      rdfs:label ?v.
  ?v6 a dbo:Airport.
  { ?v6 dbo:city ?v2. }
  UNION
  { ?v6 dbo:location ?v2. }
  { ?v6 dbp:iata ?v5. }
  UNION
  { ?v6 dbo:iataLocationIdentifier ?v5. }
  OPTIONAL { ?v6 foaf:homepage ?v7. }
  OPTIONAL { ?v6 dbp:nativename ?v8. }
  {
    FILTER(EXISTS{ SELECT ?v WHERE {
      ?v2 a dbo:Settlement;
          rdfs:label ?v.
      ?v6 a dbo:Airport.
      { ?v6 dbo:city ?v2. }
      UNION
      { ?v6 dbo:location ?v2. }
      { ?v6 dbp:iata ?v5. }
      UNION
      { ?v6 dbo:iataLocationIdentifier ?v5. }
      OPTIONAL { ?v6 foaf:homepage ?v7. }
      OPTIONAL { ?v6 dbp:nativename ?v8. }
    }
    LIMIT 1000
    })
  }
}
RDFox was again the fastest to complete query 3 but it is important to reiterate that this query was modified slightly so that it could run on RDFox. The only other query that has the same issue is query 6.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
The results of query 2 and 3 are very similar of course as query 2 is nested within query 2. Query 4: The two queries above were similar but query 4 is a lot more mathematical.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {  
  {SELECT (CEIL(?a + ?b) AS ?x) WHERE {
    {SELECT (AVG(?abslat) AS ?a) WHERE {
    ?s1 geo:lat ?lat .
    BIND(ABS(?lat) AS ?abslat)
    }}
    {SELECT (SUM(?rv) AS ?b) WHERE {
    ?s2 dbo:volume ?volume .
    BIND((RAND() * ?volume) AS ?rv)
    }}
  }}
  
  {SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
      {SELECT ?c WHERE {
        BIND(MINUTES(NOW()) AS ?c)
      }}
      {SELECT (AVG(?width) AS ?d) WHERE {
        ?s3 dbo:width ?width .
        FILTER(?width > 50)
      }}
  }}
}
AnzoGraph was the quickest to complete query 4 with RDFox in second place.
Faster = AnzoGraph. Slower = Blazegraph, Stardog and Virtuoso
Virtuoso was the third fastest to complete this query in a time of 519.5ms. As with all these queries, they do not contain random seeds so I have made sure to include mathematical queries in the benchmark. Query 5: This query focuses on strings rather than mathematical function. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.
I use a CONSTRUCT query here. I wrote a second SPARQL tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ex:notglutenfree rdfs:label ?label ;
                   rdfs:comment ?sab ;
                   dbo:abstract ?lab .
} WHERE {
  {?s1 rdfs:label ?label .
  FILTER (REGEX(lcase(?label), 'venus'))
  } UNION
  {?s2 rdfs:comment ?sab .
  FILTER (REGEX(lcase(?sab), 'sleep'))
  } UNION
  {?s3 dbo:abstract ?lab .
  FILTER (REGEX(lcase(?lab), 'gluten'))
  }
}
As discussed in the previous post, it is uncommon to use REGEX queries if you can run a full text index query on the triplestore. AnzoGraph and RDFox are the only two that do not have built in full indexes, hence these results:
Faster = AnzoGraph. Slower = Blazegraph, GraphDB, Stardog and Virtuoso
AnzoGraph is a little faster than RDFox to complete this query but the two of them are significantly faster than the rest. This is of course because you would use the full text index capabilities of the other triplestores. If we instead run full text index queries, they are significantly faster than RDFox.
Note: To ensure clarity, in this chart RDFox was running the REGEX query as it does not have full text index functionality.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
Whenever I can run a full text index query I will because of the serious performance boost. Therefore this chart is definitely fairer on the other triplestores. Query 6: This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.
Note: This is the second, and final, query that is modified slightly for RDFox. The original query contained both alternate and recurring property paths which were handled by their rule system.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX : <http://ost.com/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{ 
?soccerplayer a dbo:SoccerPlayer ;
   :position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
   :countryOfBirth ?countryOfBirth ;
   dbo:team ?team .
   ?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam . 
   ?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
   ?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer
If interested in alternate property paths, I cover them in my article called Constructing More Advanced SPARQL Queries.
RDFox was fastest again to complete query 6. This speed is probably down to the rule system changes as some of the query is essentially done beforehand.
Others = Blazegraph, GraphDB, Stardog and Virtuoso
For the above reason, RDFox’s rule system will have to be investigated thoroughly before the benchmark. Virtuoso was the second fastest to complete this query in a time of 54.9ms which is still very fast compared to the average. Query 7: Finally, this query finds all people born in Berlin before 1900.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
 ?person dbo:birthPlace :Berlin .
 ?person dbo:birthDate ?birth .
 ?person foaf:name ?name .
 ?person dbo:deathDate ?death .
 FILTER (?birth < "1900-01-01"^^xsd:date)
 }
 ORDER BY ?name
Finishing with a very simple query and no custom rules (like queries 3 and 6), RDFox was once again the fastest to complete this query.
Others = AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso
In this case, the average includes every other triplestore as there were no real outliers. Virtuoso was again the second fastest and completed query 7 in 20.2ms so relatively fast compared to the average. This speed difference is again likely due to the fact that RDFox is an in-memory solution.

Conclusion

To reiterate, this is not a sound comparison so I cannot conclude that RDFox is better or worse than triplestore X and Y. What I can conclude is this:
RDFox can definitely compete with the other major triplestores and initial results suggest that they have the potential to be one of the top performers in our benchmark.
I can also say that RDFox is very easy to use, well documented and the rule system gives the opportunity for users to add custom reasoning very easily. If you want to try it for yourself, you can request a license here.
Again to summarise the notes throughout: RDFox did not sponsor this article. They did know the other’s results since I published my last article but I did test the previous version of RDFox also and didn’t notice any significant optimisation. The rule system makes queries 3 and 6 difficult to compare but that will be investigated before running the benchmark.

Comparison of Linked Data Triplestores: A New Contender was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Dementia Affects Conversation: Building a More Accessible Conversational AI

Diving into the Literature – What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation – Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood

There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer’s Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz

Note: I will refer to each paper with their corresponding number from now on.

Motivation – Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source – lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.

https://medium.com/media/05616eaceabf5537ffbda5b6811c367c/href


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diving into the Literature - What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation - Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood
There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer's Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz
Note: I will refer to each paper with their corresponding number from now on.

Motivation - Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source - lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.