Webinar: Fusion methodology for point clouds deep learning

Speaker: Fauzy Othman, Heriot-Watt University, Edinburgh
Date: 30 May 2022
Time: 15:00 – 16:00
Venue: Online

Fauzy is a PhD student working with Phil Bartie, Oliver Lemon, and Ben Kenwright. During the weekly SWeL meeting, Fauzy will be giving his first year progression presentation.

Abstract: This project aspires to address gaps in autonomous detection in aerial surveillance on ground objects. There are existing point cloud deep learning techniques, however many are having gaps in accuracy of detection and not developed for objects which are of interest to energy industry surveillance work. Hence the aim is to improve accuracy of object detection by exploring deep learning architecture, focusing on point cloud data manipulation and fusion methods. This Year 1 presentation covers literature review and initial implementation works.

How Dementia Affects Conversation: Building a More Accessible Conversational AI

Diving into the Literature – What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation – Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood

There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer’s Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz

Note: I will refer to each paper with their corresponding number from now on.

Motivation – Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source – lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.

https://medium.com/media/05616eaceabf5537ffbda5b6811c367c/href


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diving into the Literature - What do we know?

source

We all (roughly) know how to naturally converse with one another. This is mostly subconscious and only really noticeable if an interaction skews from what most consider “normal.” In the majority of cases, these are just minor differences, such as someone speaking a little too close or interrupting more often than usual.

However, more significant conversational differences can start to occur when parts of the brain begin to decline in performance.

Contents

Introduction
Overview of Dementia
Papers Covered
Motivation - Why Speech?
Datasets
Models
Important Language Features
Conclusion

Introduction

I am currently working towards creating a more natural conversational agent (such as Siri, Alexa, etc…) for those with cognitive impairments that can potentially benefit the most from these systems. Currently we have to adapt how we speak to these systems and have to know exactly how to ask for certain functions. For those that struggle to adapt, I hope to lower some of these barriers so that these people can live more independently for longer. If you want to read more about the overall project then I discussed it in more detail in an interview here.

To kick off this project with Wallscope and The Data Lab, I first investigated some of the research centered on recreating natural conversation with conversational agents. This research all related to a healthy population, but a question arose: do some of these phenomena vary when conversing with those that have forms of cognitive impairment?

In my previous article, I covered two papers that discuss end-of-turn prediction. They created brilliant models to predict when someone has finished their turn to replace models that just wait for a duration of silence.

If someone with Dementia takes a little longer to remember the word(s) they’re looking for, the silence threshold models used in current systems will interrupt them. I suspect the research models would also perform worse than with a healthy population, so I’m collecting a corpus to investigate this.

As my ultimate aim is to make conversational agents more naturally usable for those with dementia, I’ll dive into some of the related research in this article.

Overview of Dementia

I am by no means a dementia expert so this information was all collected from an amazing series of videos by the Alzheimer’s Society.

Their Website

Dementia is not a disease but the name for a group of symptoms that commonly include problems with:

  • Memory
  • Thinking
  • Problem Solving
  • Language
  • Visual Perception

For people with dementia, these symptoms have progressed enough to affect daily life and are not a natural part of aging, as they’re caused by different diseases (I highlight some of them below).

All of these diseases cause the loss of nerve cells, and this gets gradually worse over time, as these nerve cells cannot be replaced.

As more and more cells die, the brain shrinks (atrophies) and symptoms sharpen. Which symptoms set in first depends on which part of the brain atrophies—so people are impacted differently.

source — you can see the black areas expanding as nerve cells die and atrophy progresses.

For example, if the occipital lobe begins to decline, then visual symptoms would progress, whereas losing the temporal lobe would cause language problems…

Other common symptoms impact:

  • Day-to-day memory
  • Concentration
  • Organization
  • Planning
  • Language
  • Visual Perception
  • Mood
There is currently no cure…

Before moving on to cover recent research surrounding language problems, it’s important to not that most research is disease-specific. Therefore, I’lll briefly cover the four types of Dementia.

All of this information again comes from the series of videos created by the Alzheimer’s Society.

Alzheimer's Disease

The most common type of dementia is Alzheimer’s Disease (AD), and for this reason, it’s also the most understood (you’ll notice this in the research).

A healthy brain contains proteins (two of which are called amyloid and tau), but if the brain starts to function abnormally, these proteins form abnormal deposits called plaques and tangles.

source

These plaques and tangles damage nerve cells, which causes them to die and the brain to shrink, as shown above.

The hippocampus is usually the first area of the brain to decline in performance when someone has AD. This is unfortunately where memories are formed, so people will often forget what they have just done and may therefore repeat themselves in conversation.

Recent memories are lost first, whereas childhood memories can still be retrieved as they depend less on the hippocampus. Additionally, emotions can usually be recalled as the amygdala is still intact, whereas the facts surrounding those emotions can be lost.

AD gradually progresses, so symptoms worsen and become more numerous slowly over time.

Vascular Dementia

The second most common type of dementia is vascular dementia, which is caused by problems with the brain’s blood supply.

Nerve cells need oxygen and nutrients to survive, so without them they become damaged and die. Therefore, when blood supply is interrupted by a blockage or leak, significant damage can be caused.

Like with AD, symptoms depend on which parts of the brain are impacted. When the parts damaged are responsible for memory, thinking, or language, the person will have problems remembering, thinking or speaking.

source

Vascular dementia can be caused by strokes. Sometimes one major stroke can cause it, but in other cases a person may suffer from multiple smaller strokes that gradually cause damage.

The most common cause of vascular dementia is small-vessel disease, which gradually narrows the vessels in the brain. As the narrowing continues and spreads, more of the brain gets damaged.

Vascular dementia can therefore have a gradual progression like AD or, if caused by strokes, a step-like progression with symptoms worsening after each stroke.

Dementia with Lewy Bodies

Closely related to AD, but less common, is a type of dementia called dementia with Lewy bodies.

Lewy bodies are tiny clumps of protein that develop inside nerve cells in the brain. This prevents communication between cells, which causes them to die.

source

Researchers have not yet identified why Lewy bodies form or how. We do know, however, that they can form in any part of the brain, which, again, leads to varying symptoms.

People can have problems with concentration, movement, alertness, and can even have visual hallucinations. These hallucinations are often distressing and lead to sleep problems.

Dementia with Lewy bodies progresses gradually and spreads as more nerve cells get damaged, so memory is always impacted eventually.

Frontotemporal dementia

The last type of dementia I’ll cover is frontotemporal dementia (FTD), which is a range of conditions in which cells in the frontal and temporal lobes of the brain are damaged.

source

FTD is again a less common type of dementia but is surprisingly more likely to effect younger people (below 65).

The frontal and temporal lobes of the brain control behavior, emotion, and language, and symptoms occur in the opposite order depending on which lobe is impacted first.

The frontal lobe is usually the first to decline in performance, so changes begin to show through a person’s personality, behavior, and inhibitions.

Alternatively, when the temporal lobe is impacted first, a person will struggle with language. For example, they may struggle to find the right word.

FTD is thought to occur when proteins such as tau build up in nerve cells, but unlike the other causes, this is likely hereditary.

Eventually as FTD progresses, symptoms of frontal and temporal damage overlap, and both occur.

Papers Covered

That overview of dementia was fairly in depth, so we should now have a common foundation for this article and all subsequent articles.

As we now know, difficulty with language is a common symptom of dementia, so in order to understand how it changes, I’ll cover four papers that investigate this. These include the following:

[1]

A Speech Recognition Tool for Early Detection of Alzheimer’s Disease by Brianna Marlene Broderick, Si Long Tou and Emily Mower Provost

[2]

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection by Saturnino Luz, Sofia de la Fuente and Pierre Albert

[3]

Speech Processing for Early Alzheimer Disease Diagnosis: Machine Learning Based Approach by Randa Ban Ammar and Yassine Ben Ayed

[4]

Detecting Cognitive Impairments by Agreeing on Interpretations of Linguistic Features by Zining Zhu, Jekaterina Novikova and Frank Rudzicz
Note: I will refer to each paper with their corresponding number from now on.

Motivation - Why Speech?

These four papers have a common motivation: to detect dementia in a more cost-effective and less intrusive manner.

These papers tend to focus on Alzheimer’s Disease (AD) because, as [3] mentions, 60–80% of dementia cases are caused by AD. I would add that this is likely why AD features most in existing datasets, also.

Current Detection Methods

[1] points out that dementia is relatively difficult to diagnose as progression and symptoms vary widely. The diagnostic processes are therefore complex, and dementia often goes undiagnosed because of this.

[2] explains that imaging (such as PET or MRI scans) and cerebrospinal fluid analysis can be used to detect AD very accurately, but these methods are expensive and extremely invasive. A lumbar puncture must be performed to collect cerebrospinal fluid, for example.

source - lumbar puncture aka “spinal tap”

[2] also points out that neuropsychological detection methods have been developed that can, to varying levels of accuracy, detect signs of AD. [1] adds that these often require repeat testing and are therefore time-consuming and cause additional stress and confusion to the patient.

As mentioned above, [1] argues that dementia often goes undiagnosed because of these flaws. [2] agrees that it would be beneficial to detect AD pathology long before someone is actually diagnosed in order to implement secondary prevention.

Will Speech Analysis Help?

As repeatedly mentioned in the overview of dementia above, language is known to be impacted through various signs such as struggles with word-finding, understanding difficulties, and repetition. [3] points out that language relies heavily on memory, and for this reason, one of the earliest signs of AD may be in a person’s speech.

source

[2] reinforces this point by highlighting the fact that in order to communicate successfully, a person must be able to perform complex decision making, strategy planning, consequence foresight, and problem solving. These are all impaired as dementia progresses.

Practically, [2] states that speech is easy to acquire and elicit, so they (along with [1], [3], and [4]) propose that speech could be used to diagnose Dementia in a cost-effective, non-invasive, and timely manner.

To start investigating this, we need the data.

Datasets

As you can imagine, it isn’t easy to acquire suitable datasets to investigate this. For this reason [1], [3], and [4] used the same dataset from DementiaBank (a repository within TalkBank) called the Pitt Corpus. This corpus contains audio and transcriptions of people with AD and healthy elderly controls.

To elicit speech, participants (both groups) were asked to describe the Cookie Theft stimulus photo:

source

Some participants had multiple visits, so [1], [3], and [4] had audio and transcriptions for 223 control interviews and 234 AD interviews (these numbers differ slightly between them due to pre-processing, I expect).

[1] points out that the picture description task ensures the vocabulary and speech elicited is controlled around a context, but [2] wanted to investigate a different type of speech.

Instead of narrative or picture description speech, [2] used spontaneous conversational data from the Carolina Conversations Collection (CCC) to create their models.

The corpus contains 21 interviews with patients with AD and 17 dialogues with control patients. These control patients suffered from other conditions such as diabetes, heart problems, etc… None of them had any neuropsychological conditions, however.

The automatic detection of AD developed by [2] was the first use of low-level dialogue interaction data as a basis for AD detection on spontaneous spoken language.

Models

If I’m to build a more natural conversational system, then I must be aware of the noticeable differences in speech between those with dementia and healthy controls. What features inform the models in these papers the most should indicate exactly that.

[1] extracted features that are known to be impacted by AD (I run through the exact features in the next section, as that’s my primary interest). They collected many transcription-based features and acoustic features before using principal component analysis (PCA) to reduce the total number of features to train with. Using the selected features they trained a KNN & SVM to achieve an F1 of 0.73 and importantly, a recall of 0.83 as false negatives could be dangerous.

[2] decided to only rely on content-free features including speech rate, turn-taking patterns, and other parameters. They found that when they used speech rate and turn-taking patterns to train a Real AdaBoost algorithm, they achieved an accuracy of 86.5%, and adding more features reduced the number of false positives. They found that other models performed comparably well, but even though Real AdaDoost and decision trees achieved an accuracy of 86.5%, they say there’s still room for improvement.

One point to highlight about [2] is their high accuracy (comparable to the state-of-the-art) despite relying only on content-free features. Their model can therefore be used globally, as the features are not language-dependent like the more complex lexical, syntactic, and semantic features used by other models.

source

[3] ran feature extraction, feature selection, and then classification. There were many syntactic, semantic, and pragmatic features transcribed in the corpus. They tried three feature selection methods, namely: Information Gain, KNN, and SVM Recursive Feature Elimination. This feature selection step is particularly interesting for my project. Using the features selected by the KNN, their most accurate model was an SVM that achieved precision of 79%.

[4] introduces a completely different (and more interesting) approach than the other papers, as they build a Consensus Network (CN).

As [4] uses the same corpus as [1] and [3], there’s a point at which the only two ways to improve upon previous classifiers are to either add more data or calculate more features. Of course, both of those options have limits, so this is why [4] takes a novel approach.

They first split the extracted features into non-overlapping subsets and found that the three naturally occurring groups (acoustic, syntactic, and semantic) garnered the best results.

The 185 acoustic features, 117 syntactic features, and 31 semantic features (plus 80 POS features that were mainly semantic) were used to train three separate neural networks called “ePhysicians”:

[4]

Each ePhysician is a fully connected network with ten hidden layers, Leaky ReLU activations, and batch normalization. Both the classifier and discriminator were the same but without any hidden layers.

The output of each ePhysician was passed one-by-one into the discriminator (with noise), and it then tried to to tell the ePhysicians apart. This encourages the ePhysicians to output indistinguishable representations from each other (agree).

[4] indeed found that their CN, with the three naturally occurring and non-overlapping subsets of features, outperformed other models with a macro F1 of 0.7998. Additionally, [4] showed that the inclusion of noise and cooperative optimization did contribute to the performance.

In case of confusion, it’s important to reiterate that [2] used a different corpus.

Each paper, especially [4], describes their model in more detail, of course. I’m not primarily interested in the models themselves, as I don’t intend to diagnose dementia. My main focus in this article is to find out which features were used to train these models, as I’ll have to pay attention to the same features.

Important Language Features

In order for a conversational system to perform more naturally for those with cognitive impairments, how language changes must be investigated.

[4] sent all features to their ePhysicians so didn’t detail which features were most predictive. They did mention that pronoun-noun ratios were known to change, as those with cognitive impairments use more pronouns than nouns.

[2] interestingly achieved great results using just a person’s speech rate and turn-taking patterns. They did obtain less false positives by adding other features but stuck to content-free features, as mentioned. This means that their model does not depend on a specific language and can therefore be used on a global scale.

[1] extracted features that are known to be impacted by AD and additionally noted that patients’ vocabulary and semantic processing had declined.

[1] listed the following transcription-based features:

  • Lexical Richness
  • Utterance Length
  • Frequency of Filler Words
  • Frequency of Pronouns
  • Frequency of Verbs
  • Frequency of Adjectives
  • Frequency of Proper Nouns

and [1] listed the following acoustic features:

  • Word Finding Errors
  • Fluidity
  • Rhythm of Speech
  • Pause Frequency
  • Duration
  • Speech Rate
  • Articulation Rate

Brilliantly, [3] performed several feature selection methods upon the following features:

[3]

Upon all of these features, they implemented three feature selection methods to select the top eight features each: Information Gain, KNN, and SVM Recursive Feature Elimination (SVM-RFE).

They output the following:

[3]

Three features were selected by all three methods, suggesting that they’re highly predictive for detecting AD: Word Errors, Number of Prepositions, and Number of Repetitions.

It’s also important to restate that the most accurate model used the features selected by the KNN method.

Overall, we have many features identified in this section to pay attention to. In particular, however, (from both the four papers and the Alzheimer’s Society videos) we need to pay particular attention to:

  • Word Errors
  • Repetition
  • Pronoun-Noun Ratio
  • Number of Prepositions
  • Speech Rate
  • Pause Frequency

Conclusion

We’ve previously looked into the current research towards making conversational systems more natural, and we now have a relatively short list of features that must be handled if conversational systems are to perform fluidly, even if the user has a cognitive impairment like AD.

Of course, this isn’t an exhaustive list, but it’s a good place to start and points me in the right direction for what to work on next. Stay tuned!

Editor’s Note: Join Heartbeat on Slack and follow us on Twitter and LinkedIn for the all the latest content, news, and more in machine learning, mobile development, and where the two intersect.


How Dementia Affects Conversation: Building a More Accessible Conversational AI was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beginning to Replicate Natural Conversation in Real Time

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking – End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond – killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking – End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

https://medium.com/media/7f962d156a27bee0a2feb146b12778d3/href

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction – just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today’s world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied – dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking - End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond - killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking - End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction - just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today's world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied - dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.