No problem at all, I love getting comments!

No problem at all, I love getting comments!I was also surprised! In the benchmark I am developing I have no limits without orders for this reason. Also, as I mention in the article – it would make no sense in production.Thanks again,Angus

No problem at all, I love getting comments!

I was also surprised! In the benchmark I am developing I have no limits without orders for this reason. Also, as I mention in the article - it would make no sense in production.

Thanks again,

Angus

Hi Nemanja, firstly – thanks for reading and commenting!

Hi Nemanja, firstly – thanks for reading and commenting!I tested this so 99% sure that sentence is accurate. I also know from experience that the result set from Blazegraph isn’t guaranteed to be the same, but in this case it was over many iterations.T…

Hi Nemanja, firstly - thanks for reading and commenting!

I tested this so 99% sure that sentence is accurate. I also know from experience that the result set from Blazegraph isn’t guaranteed to be the same, but in this case it was over many iterations.

There could be multiple reasons for this that I can think of off the top of my head but I am not an expert on Blazegraph so can’t confirm what is happening here.

Possibly, like some of the others, the query begins and traverses the graph in the exact same way each time (therefore finding results in the same order). Additionally, as the graph and query are both static and the query is run multiple times in quick succession (never done in production), it is possible that some optimisation causes the same traversal path.

Again, I am not an expert in any one of these triplestores in particular so don’t know Blazegraph’s inner working in detail and am hypothesising.

If I get time in the coming weeks I’ll maybe retest and play around to see what’s happening!

Thanks again,

Angus

Beginning to Replicate Natural Conversation in Real Time

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking – End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond – killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking – End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

https://medium.com/media/7f962d156a27bee0a2feb146b12778d3/href

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction – just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today’s world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied – dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

A first step into the literature

To start my new project, the first thing I of course have to do is run through the current research and state of the art models.

I was interviewed recently in which I explain this new project but in short (extremely short): I aim to step towards making conversational agents more natural to talk with.

I have by no means exhausted all literature in this field, I have barely scratched the surface (link relevant papers below if you know of any I must read). Here is an overview of some of this research and the journey towards more natural conversational agents. In this I will refer to the following papers:

[1]

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs by Matthew Roddy, Gabriel Skantze and Naomi Harte

[2]

Detection of social signals for recognizing engagement in human-robot interaction by Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara

[3]

Investigating fluidity for human-robot interaction with real-time, real-world grounding strategies by Julian Hough and David Schlangen

[4]

Towards Deep End-of-Turn Predication for Situated Spoken Dialogue Systems by Angelika Maier, Julian Hough and David Schlangen

[5]

Coordination in Spoken Human-Robot Interaction by Gabriel Skantze (Lecture Presentation in Glasgow 07/03/2019)

Contents

Introduction
Turn Taking - End of Turn Prediction
Engagement
Embodiment
Fluid Incremental Grounding Strategies
Conclusion

Introduction

If we think of two humans having a fluid conversation, it is very different from conversations between humans and Siri, Google Assistant, Alexa or Cortana.

source

One reason for this loss of flow is the number of large pauses. For a conversational agent (CA) to detect that you have finished what you are saying (finished your turn), it waits for a duration of silence. If it detects a long pause, it assumes you have finished your turn and then processes your utterance.

This set duration of silence varies slightly between systems. If it is set too low, the CA will interrupt you mid-turn as human dialogue is littered with pauses. If it is set too high, the system will be more accurate at detecting your end-of-turn but the CA will take painfully long to respond - killing the flow of the conversation and frustrating the user [4].

When two humans speak, we tend to minimise the gap between turns in the conversation and this is cross-cultural. Across the globe, the gap between turns is around 200ms which is close to the limit of human response time [1]. We must therefore predict the speakers end-of-turn (EOT) while listening to someone speak.

Turn Taking - End of Turn Prediction

To recreate this fluid dialogue with fast turn-switches and slight overlap in CAs, we must first understand how we do it ourselves.

Shameless, but slightly related, self-promotion. In order to work on Computer Vision, we must first understand Human Vision

We subconsciously interpret turn-taking cues to detect when it is our turn to speak so what cues do we use? Similarly, we do this continuously while listening to someone speak so can we recreate this incremental processing?

[4] used both acoustic and linguistic features to train an LSTM to tag 10ms windows. Their system is tasked to label these windows as either Speech, mid-turn pause (MTP) or EOT but the main focus of course is the first point in a sequence which is labelled as EOT.

The acoustic features used in the LSTM were: raw pitch, smoothed F0, root mean squared signal energy, logarithmised signal energy, intensity, loudness and derived features for each frame as acoustic features.

In addition to these acoustic features, linguistic features consisted of the words and an approximation of the incremental syntactic probability called the weighted mean log trigram probability (WML).

Many other signals have been identified to indicate whether the speaker is going to continue speaking or has finished their turn in [5]:

[5]

As mentioned, a wait time of 10 seconds for a response is just as irritating as the CA cutting into you mid-turn constantly [4]. Multiple baselines were therefore considered regularly between 50ms and 6000ms to ensure multiple trade-offs were included in the baseline.

Apart from one case (linguistic features only, 500ms silence threshold), every single model beat the baselines. Using only the linguistic or acoustic features didn’t make much of a difference but performance was always best when the model used both sets of features together. The best overall system had a latency of 1195ms and cut in rate of just 18%.

[4]

[1] states that we predict EOT from multi-modal signals including: prosody, semantics, syntax, gesture and eye-gaze.

Instead of labelling 10ms windows (as speech, MTPs or EOTs), traditional models predict whether a speaker will continue speaking (HOLD) or is finished their turn (SHIFT) but only does this when it detects a pause. One major problem with this traditional approach is that backchannels are neither a HOLD or SHIFT but one of these are predicted anyway.

LSTMs have been used to make predictions continuously at 50 ms intervals and these models outperform traditional EOT models and even humans when applied to HOLD/SHIFT predictions. Their hidden layers allow them to learn long range dependencies but it is unknown exactly which features influence the performance the most.

In [1], the new system completes three different turn-taking prediction tasks: (1) prediction at pauses, (2) prediction at onset and (3) prediction at overlap.

Prediction at Pauses is the standard prediction that takes place at brief pauses in the interaction to predict whether there will be a HOLD or SHIFT. Essentially, when there is a pause above a threshold time, the person with the highest average output probability (score) is predicted to speak next. This classification model is evaluated with weighted F-scores.

Prediction at Onsets classifies the utterances during speech, not at a pause. This model is slightly different however as it predicts whether the currently ongoing utterance will be short or long. Again, as also a classifier, this model was evaluated using the weighted F-scores.

Prediction at Overlap is introduced for the first time in this paper. This is essentially a HOLD/SHIFT predication again but when an overlapping period of at least 100ms occurs. The decision to HOLD (continue speaking) is predicted when the overlap is a backchannel and SHIFT when the system should stop speaking. This again was evaluated using weighted F-scores.

Here is an example of predicted turn-taking in action:

As mentioned above, we don’t know exactly which features we use to predict when it is our turn to speak. [1] used many features in different arrangements to distinguish which are most useful. The features used were as follows:

Acoustic features are low level descriptors that include loudness, shimmer, pitch, jitter, spectral flux and MFCCs. These were extracted using the OpenSmile toolkit.

Linguistic features were investigated at two levels: part-of-speech (POS) and word. Literature often suggests that POS tags are good at predicting turn-switches but words (from an ASR system) are needed to then extract POS tags from so it is useful to check whether this extra processing is needed.

Using words instead of POS would be a great advantage for systems that need to run in real time.

Phonetic features were output from a deep neural network (DNN) that classifies senones.

Voice activity was included in their transcriptions so also used as a feature.

So what features were the most useful for EOT prediction according to [1]?

Acoustic features were great for EOT prediction, all but one experiments best results included acoustic features. This was particularly the case for prediction at overlap.

Words mostly outperformed POS tags apart from prediction at onset so use POS tags if you are wanting to predict utterance length (like backchannels).

In all cases, including voice activity improved performance.

In terms of acoustic features, the most important features were loudness, F0, low order MFCCs and spectral slope features.

Overall, the best performance was obtained by using voice activity, acoustic features and words.

As mentioned, the fact that using words instead of POS tags leads to better performance is brilliant for faster processing. This of course is beneficial for real-time incremental prediction - just like what we humans do.

All of these features are not just used to detect when we can next speak but are even used to guide what we say. We will expand on what we are saying, skip details or change topic depending on how engaged the other person is with what we are saying.

Therefore to model natural human conversation, it is important for a CA to measure engagement.

Engagement

Engagement shows interest and attention to a conversation and, as we want user’s to stay engaged, influences the dialogue strategy of the CA. This optimisation of the user experience all has to be done in real time to keep a fluid conversation.

[2] detects the following signals to measure engagement: nodding, laughter, verbal backchannels and eye gaze. The fact that these signals show attention and interest is relatively common sense but were learned from a large corpus of human-robot interactions.

[2]

[2] doesn’t just focus on recognising social signals but also on creating an engagement recognition model.

This experiment was run in Japan where nodding is particularly common. Seven features were were extracted to detect nodding: (per frame) the yaw, roll and pitch of the person’s head (per 15 frames) the average speed, average velocity, average acceleration and range of the person’s head.

Their LSTM model outperformed the other approaches to detect nodding across all metrics.

Smiling is often used to detect engagement but to avoid using a camera (they use microphones + Kinect) laughter is detected instead. Each model was tasked to classify whether an inter-pausal unit (IPU) of sound contained laughter or not. Using both prosody and linguistic features to train a two layer DNN performed the best but using other spectral features instead of linguistic features (not necessarily available from the ASR) could be used to improve the model.

Similarly to nodding, verbal backchannels are more frequent in Japan (called aizuchi). Additionally in Japan, verbal backchannels are often accompanied by head movements but only the sound was provided to the model. Similar to the laughter detection, this model classifies whether an IPU is a backchannel or the person is starting their turn (especially difficult when barging in). The best performing model was found to be a random forest, with 56 estimators, using both prosody and linguistic features. The model still performed reasonably when given only prosodic features (again because linguistic features may not be available from the ASR).

Finally, eye gaze is commonly known as a clear sign of engagement. From the inter-annotator agreement, looking at Erica’s head (the robot embodiment in this experiment) for 10 seconds continuously was considered as a engagement. Less than 10 seconds were therefore negative cases.

Erica: source

The information from the kinect sensor was used to calculate a vector from the user’s head orientation and the user was considered ‘looking at Erica’ if that vector collided with Erica’s head (plus 30cm to accommodate error). This geometry based model worked relatively well but the position of Erica’s head was estimated so this will have effected results. It is expected that this model will improve significantly when exact values are known.

This paper doesn’t aim to create the best individual systems but instead hypothesises that these models in conjunction will perform better than the individual models at detecting engagement.

[2]

The ensemble of the above models were used as a binary classifier (either a person was engaged or not). In particular, they built a hierarchical Bayesian binary classifier which judged whether the listener was engaged from the 16 possible combinations of outputs from the 4 models above.

From the annotators, a model was built to deduce which features are more or less important when detecting engagement. Some annotators found laughter to be a particularly important factor for example whereas others did not. They found that inputting a character variable with three different character types improved the models performance.

Additionally, including the previous engagement of a listener also improved the model. This makes sense as someone that is not interested currently is more likely to stay uninterested during your next turn.

Measuring engagement can only really be done when a CA is embodied (eye contact with Siri is non-existent for example). Social robots are being increasingly used in areas such as Teaching, Public Spaces, Healthcare and Manufacturing. These can all contain spoken dialogue systems but why do they have to be embodied?

[5]

Embodiment

People will travel across the globe to have a face-to-face meeting when they could just phone [5]. We don’t like to interact without seeing the other person as we miss many of the signals that we talked about above. In today's world we can also video-call but this is still avoided when possible for the same reasons. The difference between talking on the phone or face-to-face is similar to the difference between talking to Siri and an embodied dialogue system [5].

Current voice systems cannot show facial expressions, indicate attention through eye contact or move their lips. Lip reading is obviously very useful for those with impaired hearing but we all lip read during conversation (this is how we know what people are saying even in very noisy environments).

Not only can a face output these signals, it also allows the system to detect who is speaking, who is paying attention, who the actual people are (Rory, Jenny, etc…) and recognise their facial expressions.

Robot faces come in many forms however and some are better than others for use in conversation. Most robot faces, such as the face of Nao, are very static and therefore cannot show a wide range of emotion through expression like we do.

Nao: source

Some more abstract robot face depictions, such as Jibo, can show emotion using shapes and colour but some expressions must be learned.

Jibo: source

We know how to read a human face so it makes sense to show a human face. Hyper-realistic robot faces exist but are a bit creepy, like Sophia, and are very expensive.

Sophia: source

They are very realistic but just not quite right which makes conversation very uncomfortable. To combat this, avatars have been made to have conversations on screen.

source

These can mimic humans relatively closely without being creepy as it’s not a physical robot. This is almost like Skype however and this method suffers from the ‘Mona-Lisa effect’. In multi-party dialogue, it is impossible for the avatar on screen to look at one person and not the other. Either the avatar is looking ‘out’ at all parties or away at no one.

Gabrial Skantze (presenter of [5] to be clear) is the Co-Founder of Furhat robotics and argues that Furhat is the best balance between all of these systems. Furhat has been developed to be used for conversational applications as a receptionist, social-trainer, therapist, interviewer, etc…

source

Furhat needs to know where it should be looking, when it should speak, what it should say and what facial expressions it should be displaying [5].

Finally (for now), once embodied - dialogues with a robot need to be grounded in real-time with the real-world. In [3] the example given is a CA embodied in an industrial machine which [5] states is becoming more and more common.

source

Fluid, Incremental Grounding Strategies

For a conversation to be natural, human-robot conversations must be grounded in a fluid manner [3].

With non-incremental grounding, users can give positive feedback and repair but only after the robot has shown full understanding of the request. If you ask a robot to move an object somewhere for example, you must wait until the object is moved before you can correct it with an utterance like “no, move the red one”. No overlapping speech is possible so actions must be reversed entirely if a repair is needed.

With incremental grounding, overlapping is still not possible but feedback can be given at more regular intervals. Instead of the entire task being completed before feedback can be given, feedback can be given at sub-task intervals. “no, move the red one” can be said just after the robot picks up a blue object, repairing quickly. In the previous example, the blue object was then placed in a given location before the repair could be given which resulted in a reversal of the whole task! This is much more efficient but still not fluid like in human-human interactions.

Fluid incremental grounding is possible if overlaps are processed. Allowing and reasoning over concurrent speech and action is much more natural. Continuing with our repair example, “no, move the red one” can be said as soon as the robot is about to pick up the blue object, no task has to be completed and reversed as concurrency is allowed. The pickup task can be aborted and the red object picked up fluidly as you say what to do with it.

[2]

To move towards this more fluid grounding, real-time processing needs to take place. Not only does the system need to process utterances word by word, real-time context needs to be monitored such as the robots current state and planned actions (both of which can change dynamically through the course of an utterance or word).

The robot must know when it has sufficiently shown what it is doing to handle both repairs and confirmations. The robot needs to know what the user is confirming and even more importantly, what is needing repaired.

Conclusion

In this brief overview, I have covered just a tiny amount of the current work towards more natural conversational systems.

Even if turn-taking prediction, engagement measuring, embodiment and fluid grounding were all perfected, CAs would not have conversations like we humans do. I plan to write more of these overviews over the next few years so look out for them if interested.

In the meantime, please do comment with discussion points, critique my understanding and suggest papers that I (and anyone reading this) may find interesting.


Beginning to Replicate Natural Conversation in Real Time was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Ancient Secrets of Computer Vision 2 by Joseph Redmon – Condensed

Human Vision – How have we evolved to see the world?In order to understand Computer Vision, we must first understand how we have evolved to see the world. Not only is it important to investigate how we see but why our sight evolved this way.What advant…

Human Vision - How have we evolved to see the world?

In order to understand Computer Vision, we must first understand how we have evolved to see the world. Not only is it important to investigate how we see but why our sight evolved this way.

What advantages should we ensure we build into our Computer Vision systems?

We use Computer Vision in some of our solutions at Wallscope, so it was important to start from the beginning and ensure I had a solid understanding.

In case you missed the introduction to this series, Joseph Redmon released a series of 20 lectures on Computer Vision in September (2018). As he is an expert in the field, I wrote a lot of notes while going through his lectures. I am tidying my notes for my own future reference but am posting them on Medium also in case these are useful for others.

I highly recommend watching this lecture on Joseph’s Youtube Channel here.

Contents

  • The Evolution of Eyes
  • How Do Our Eyes Work?
  • The Brain - Our Visual Processor
  • 3D Vision
  • Light
  • Recreating Colour on a Screen
  • Conclusion

The Evolution of Eyes

To begin with, we need to consider why we have eyes in the first place. The obvious answer is of course to see the world but, in order to fully understand this, we must start by investigating the most basic form our eyes took.

Eyespots

Simple eyes, named eyespots, are photosensitive proteins with no other surrounding structure. Snails for example have these at the tip or base of their tentacles.

source

Our vision evolved from eyespots and they can only really detect light and a very rough sense of direction. No nerves or brain processing is required as the output is so basic but snails, for example, can use these to detect and avoid bright light to ensure they don’t dry out in the sun.

Importantly, eyespots have extremely low acuity as light from any direction hits the same area of proteins.

Visual Acuity: The relative ability of the visual organ to resolve detail.

Pit Eyes

Slightly more complex are pit eyes, these are essentially eyespots in a shallow cup shaped pit. These have slightly more acuity as light from one direction is blocked by the edge of the pit, increasing directionality. If only one side of the cells are detecting light, then the source must be to one side.

from lecture slides

These eyes still have low acuity as they are relatively simple but are very common in animals. Most animal phyla (28 of 33) developed pit eyes independently. This is due to the fact that recessed sensors are a simple mutation and increased directionality is such a huge benefit.

Phylum (singular of phyla): In Biology, a level of taxonomic rank below Kingdom and above Class.

Complex Eyes

Many different complex eye structures now exist as different animals have evolved with various needs in diverse environments.

from lecture slides

Pinhole Eyes are a further development of the pit eye as the ‘pit’ has recessed much further, only allowing light to enter through a tiny hole (much like some cameras). This tiny hole lets light through which is then projected onto the back surface of the eye or camera. As you can see in the diagram above, the projected image is inverted but the brain handles this (post-processing) and the benefits are much more important. If the hole is small enough, the light hits a very small number of receptors and we can therefore detect exactly where the light is coming from.

from lecture slides

As mentioned, eyespots have basically no acuity and pit eyes have very low acuity as some light is blocked by the edges of the ‘pit’. Complex eyes however have very high acuity so what are the advantages that evolved our eyes even further than pinhole eyes.

Humans have Refractive Cornea Eyes which are similar to pinhole eyes but curiously evolved to have a larger hole… To combat the loss of acuity that this difference causes, a cornea and lens is fitted within the opening.

source

The high acuity of the pinhole eye was a result of the fact that only a tiny amount of light could get through the hole and therefore only a few receptors in the retina get activated. As you can see in the diagram above, the lens also achieves this by focusing the incoming light to a single point on the retina.

The benefit of this structure is that high acuity is maintained, to allow accurate direction, but a lot more light is also allowed in. More light hitting the retina allows more information to be processed which is particularly useful in low-level light (hence why species tend to have at least a lens or cornea). Additionally, this structure gives us the ability to focus.

Focusing incoming light onto the retina is mainly done by the cornea but its focus is fixed. Re-focusing is possible thanks to our ability to alter the refractive index of each lens. Essentially, we can change the shape of the lens to refract light accurately from different sources onto single points on the retina.

source

This ability to change the shape of our lenses is how we can choose to focus on something close to us or in the distance. If you imagine sitting in a train and looking at some houses in the distance, you would not notice a hair on the window. Conversely, if you focused on the hair (by changing the refractive index of your lenses) the houses in the distance would be blurry.

Therefore, focusing on one depth of field sacrifices acuity in the other depths.

As you may have noticed, complex eyes have all evolved with the same goal - better visual acuity. Only 6 of the 33 animal phyla have complex eyes but 96% of all known species have them so they are clearly very beneficial. This is of course because higher acuity increases the ability to perceive food, predators and mates.

How Do Our Eyes Work?

From above we now know that light passes through our cornea, humours and lens to refract light to focus on our retina. We also know this has all evolved to increase acuity with lots of light for information but what next?

source

Once light hits the retina, it is absorbed by photosensitive cells that emit neuro-transmitters through the optical nerve to be processed by our visual cortex.

Unlike cameras, our photosensitive cells (called rods and cones) are not evenly distributed or even the same as each other.

Rods and Cones

There are around 126 million photosensitive cells in the retina that are found in different areas and used for very different purposes.

Cones are predominantly found in the centre of the retina, called the fovea, and rods are mainly in the peripherals. There is one spot of the retina that contains neither as this is where the optic nerve connects to the retina - commonly known as the blind-spot.

source

Interestingly, Octopuses have very similar eyes but do not have a blind-spot. This is because our optic nerve comes out of the retina into the eye and then back out whereas optic nerves in an octopus come out in the opposite direction. Light can not pass through nerves, hence we have a blind-spot.

source

Rods, predominantly found in our peripherals as mentioned, make up the significant majority of our photosensitive cells as we have roughly 120 million of them in each eye!

We use rods predominantly in low light conditions and they do not see colour for this reason. They respond even if hit by only a single photon so are very sensitive but respond slower. They take a relatively long time to absorb light before emitting a response to our brain so rods work together. Information is pooled by multiple rods into batches of information that get transmitted.

Rods are so adapted for low light vision that they unfortunately very poor in bright light because they saturate very quickly. This is why it takes so long for our eyes to adjust from bright to low light.

If you have ever gone stargazing for example and then glanced at your phone screen, you will notice that it takes 10 to 15 minutes for your ‘night vision’ to return. This is because the phone light saturates your rods and they have to go through the chemical process to desaturate the proteins for them to absorb light again.

Cones on the other hand are found in the fovea and are much rarer as each eye only contains around 6 million of them. This is a lot less than the number of rods but our cones are a lot more concentrated in the centre of our retina for the specific purpose of fine grained, detailed colour vision (most of our bright and colourful day to day lives).

Our cones can see quick movement and have a very fast response time (unlike rods) so brilliant in the quickly changing environments that we live in.

The Fovea is where all the cones are concentrated but is only 1.5mm wide and therefore very densely packed withup to 200,000 cones/mm².

This concentration of cones makes the fovea the area of the retina with the highest visual acuity which is why we move our eyes to read. To process text, the image must be sharp and therefore needs to be projected onto the fovea.

Our Peripheral Vision contains few cones, reducing acuity, but the majority of our rods. This is why we can see shapes moving in our peripherals but not much colour or detail. Try reading this with your peripherals for example, it is blurry and clearly does not have the same level of vision.

The advantage as mentioned above is ‘night vision’ and this is clear when stargazing as stars appear bright when looking at them in your peripheral vision, but dim when you look directly at one. Pilots are taught to not look directly at other planes for exactly this reason, they can see plane lights better in their peripherals.

There are other differences between peripheral and foveal vision. Look at this illusion and then stare at the cross in the centre:

source

If you look directly at the change in the purple dots, you can clearly see that the purple dots are simply disappearing for a brief moment in a circular motion.

If however you stare at the cross, it looks like all the purple dots disappear and a green dot is travelling in a circle… why?

When using your foveal vision, you are following the movement with your eyes. When fixating on the cross however, you are using your peripheral vision. The important difference is the fact that you are fixating!

The purple light is hitting the exact same points on your retina as you are not moving your eyes. Your rods in those points therefore adjust to the purple so you don’t see them (hence they appear to disappear) and the adjustment makes grey look green.

Our eyes adjusting and losing sensitivity over time when you look directly at something could cause major problems so how do we combat this?

Fixational Eye Movement

There are many ways that we compensate for this loss in sensitivity over time but they all essentially do the same thing - expose different parts of the retina to the light.

There are a couple of large shifts (large being used as a relative term here) and a much smaller movement.

Microsaccades (one of the large movements) are sporadic and random small versions of saccades.

Saccade: (French for jerk) a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction.

You don’t notice these happening but these tiny short movements expose new parts of the retina to the light.

Ocular Drift is a much slower movement than microsaccades, more of a roaming motion in conjunction with what you are fixating on. This is a random but constant movement.

source

This image illustrates the constant ocular drift combined with sporadic microsaccades.

Finally, Microtremors are tiny vibrations that are so small that light doesn’t always change which receptor it’s hitting, just the angle at which it hits it. Amazingly, these microtremors are synced between eyes to vibrate at the exact same speed.

These three fixational eye movements allow us to see very fine grained detail!

In fact, the resolution of our fovea is not as high as you might expect, Microsaccades, ocular drift and microtremors help our brain build a more accurate mental model of what is happening in the world.

The Brain - Our Visual Processor

All the information we have discussed so far gets transmitted through our optical nerves but then what?

Our brain takes all of these signals and processes them to give us vision!

It is predominantly thought that our brains developed after our eyes. Jellyfish for example have very complex eyes that connect directly to their muscle tissue for quick reactions.

There is very little point in having a brain without sensory input so it is probable that we developed brains because we had eyes as this allows complex responses beyond just escape reactions.

Ganglia

There are roughly 1 million ganglia in each eye that transmit info to the brain. We know that there are way more rods than there are ganglia so compression must take place at this point and our photoreceptors must complete some pre-processing.

Retinal Ganglion Cell: A type of neuron that receives visual information from photoreceptors.

There are two types of ganglia: M-cells and P-cells.

M-Cells:
Magnocellular cells transmit information that help us perceive depth, movement, orientation and position of objects.

P-Cells:
Parvocellular cells transmit information that help us perceive colour, shape and very fine details.

These different types of ganglia are connected to different kinds of photoreceptors depending on what they’re responsible for but then all connect to the visual cortex.

Visual Cortex

The visual cortex contains at least 30 different substructures but we don’t know enough to build a coherent model. We do know however that the information from the ganglia is passed to the primary visual cortex followed by the secondary visual cortex.

V1 - Primary Visual Cortex:
This area of the visual cortex performs low level image processing (discussed in part 1) like edge detection for example.

source

V2 - Secondary Visual Cortex
Following V1, this area of the visual cortex helps us recognise object sizes, colours and shapes. It is often argued that visual memory is stored in V2.

From V2, the signals are sent to V3, V4, V5 but also fed back to V1 for further processing.

source

It is theorised (and generally accepted) that the information passes through V1, through V2 and then split and streamed to both the ventral and dorsal systems for two very different purposes.

The Ventral Dorsal Hypothesis

Instead of listing the differences between the two systems, I have cut the slide from Joseph Redmon’s lecture:

from lecture slides

Ventral System
This is essentially our conscious, fine-grain detailed sight that we use for recognition and identification. This system takes the high detail foveal signals as it we need to consciously see in the greatest detail possible. As we need such high detail (and most of this detail comes from the brains visual processing), the processing speed is relatively slow when compared to the dorsal system.

Dorsal System
Why would we need unconscious vision? If someone through a ball at you right now, you would move your head to dodge it very quickly but the ventral system has a slow processing speed. We can dodge something and then look for the thrown object afterwards because we do not know what was thrown! We therefore did not consciously see the object, we reacted quickly thanks to our very fast, unconscious vision from our dorsal system.

We also use this ‘unconscious vision’ while walking and texting. Your attention is on your phone screen yet you can avoid bins, etc… on the street.

We use both systems together to pick objects up, like a glass for example. The ventral system allows us to see and locate the glass, the dorsal then guides our motor system to pick the glass up.

This split is really seen when sections of the brain are damaged!

Dorsal Damage
If people damage their dorsal system, they can recognise objects without a problem but struggle to then pick objects up for example. They find it really difficult to use vision for physical tasks.

Ventral Damage
The majority of the information in the dorsal system isn’t consciously accessible so ventral damage renders a person blind. Interestingly however, even though they cannot consciously see or recognise objects, they can still do things like walk around obstacles.

This man walks around obstacles in a corridor even though he cannot see and later, when questioned, is not consciously aware of what was in his path:

Our brain and vision have co-evolved and are tightly knit. The visual cortex is the largest system in the brain, accounting for 30% of the cerebral cortex and two thirds of its electrical activity. This tightly knit, complex system is still not fully understood so it is highly researched and new discoveries are made all the time.

3D Vision

We have covered a lot of detail about each eye but we have two. Do we need two eyes to see in three dimensions?

Short answer: No.

There are in fact many elements that help our brain model in three dimensions with information from just a single eye!

One Eye

Focusing for example provides a lot of information on depth like how much the lens has to change and how blurry parts of the image are.

Additionally, movement also helps this as a nearby car moves across our field of vision much faster than a plane (which are travelling much faster) in the distance. Finally, if you are moving (on a train for example), this parallax effect of different objects moving at different speeds still exists. We saw this being used to create 3D images in part 1.

All of this helps us judge depth using each eye individually! It is of course widely known however that our ability to see in three dimensions is greatly assisted by combining the information from both eyes.

Two Eyes

What we all mainly consider depth perception is called stereopsis. This uses the differences in the images from both eyes to judge depth. The closer something is to you, the bigger the difference in visual information from each eye. If you hold a finger up in front of you for example and change the distance from your eyes while closing each eye individually - you will see this in action.

If you move your finger really close to your face, you will go cross-eyed. The amount your eyes have to converge to see something also helps with depth perception.

All of this information is great but our brain has to tie it all together, add in its own considerations and build this world model.

Brain

In a similar fashion to stereopsis and parallax sight, our brain perceives kinetic depth. Essentially your brain infers the 3D shape of moving objects. This video illustrates this amazingly:

Our brains can also detect occlusion such as “I can only see half a person because they are behind a car” for example. We know the object that is obstructed is further away then the object that is obstructing. Additionally, our brain remembers the general size of things that we are familiar with so we can judge whether a car is near or far based on how big it is.

source

This is quite a famous illusion that plays with our brains understanding of occlusion.

Finally, our brains also use light and shadows to build our 3D model of the world. This face is a good example of this:

source

We can judge the 3D shape of this persons nose and philtrum (between the nose and upper lip) solely based on the highlights and shadows created by the light.

Tying this all together, we are very skilled at perceiving depth.

source

As mentioned earlier, we don’t fully understand our visual processing, we only recently found out that our eyes reset orientation when we blink! (Our eyes rotate a little if watching a rotating object and blinking resets this).

We have such complex eyes that use a huge amount of our resources which is likely down to how beneficial vision is to us. Without sight, we would not exist as we do in the world and without light, we couldn’t have sight (as we know it).

Light

All light is electromagnetic radiation, made up of photons that behave like particles and waves…

Light Sources

The wavelength of ‘visible light’ (what our eyes perceive and therefore what we see) is around 400 to 700 nanometres. That is also the wavelength range of sunlight thankfully. Of course, these are the same as we have evolved to see sunlight and not just chance.

We do not see x-ray as the sun doesn’t shoot x-rays at us, it sends ‘visible light’.

We see a combination of waves of different wavelengths and in the modern age (now that we have light bulbs and not just the sun), these are quite diverse.

source

As you can see, sunlight contains all wavelengths whereas bulbs have high amounts of more particular wavelengths.

We see objects as a colour based on which wavelengths are reflected off them. A red bottle absorbs most wavelengths but reflects red, hence we see it as red.

The colour of an object therefore depends on the light source. An object cannot reflect wavelengths that did not hit it in the first place so it’s colour will appear different in the sun. Our brain judges the light source and compensates for this a little which is what made this dress so famous!

source

Dive into that page (linked in the image source) and check out the scientific explanation discussing chromatic adaptation.

Colour differences are particularly strange when objects are under fluorescent light as it appears to us as white light. Sunlight appears white and contains all wavelengths whereas fluorescent light appears white but is missing many wavelengths which therefore cannot be reflected.

Colour Perception (Rods and Cones)

The photoreceptors in your eyes have different response curves and cones have much more complex response curves (hence why rods don’t perceive colour well).

There are three types of cones, short, medium and long which correspond to short (blue), medium (green) and long (red) wavelengths.

source

Long cones respond mainly to wavelengths very close to green but extend to red, this is why we can see more shades of green than any other colour. We evolved this historically to spot hunting targets and dangers in forests and grasslands.

Our perception of colour comes from these cones. Each cone has an output that is roughly calculated by multiplying the input wave by the response curve and integrating to get the area under the resulting curve. The colour we see is then the relative activation of these three types of cones.

source

We have many more red and green cones then blue (another reason why we see a lot more shades of green than any other colour) and this is why green also appears brighter than other colours. You can also see from the image above that there are very few blue cones in the fovea (centre of the image).

This is important to bear in mind when designing user interfaces as this can sometimes make a significant effect. Reading green text on a black background for example is much easier than blue.

from lecture slides

Most humans have these three cones but there is a lot of variation in nature. Some animals have more and can therefore perceive even more colours than we can!

from lecture slides

Every additional cone allows the eye to perceive 100 times as many colours as before.

As mentioned, rods don’t really contribute to our perception of colour. They are fully saturated during the day so son’t contribute to our day vision at all. This does not mean they are useless by any means, they just serve very different purposes.

Colourblindness is generally a missing type of cone or a variant in cone wavelength sensitivity. For example, if the red and green cones are even more similar than usual, it becomes very difficult for the person to distinguish between red and green (a very common form of colourblindness).

Recreating Colour on a Screen

If printers and TV’s had to duplicate the reflecting wavelengths of a colour accurately, they would be extremely hard to make! They instead find metamers that are easier to produce.

Metamerism: A perceived matching of colours with different spectral power distributions. Colours that match this way are called metamers.

Finding easy to produce metamers allow us to recreate colours by selectively stimulating cones.

To show that metamers could be created, a group of subjects were gathered and given primary light controls. These controls consisted of three dials that modified the amount of red, green and blue light (RGB) and the subjects were given a target colour. The task was of course to see if the subjects could faithfully reconstruct the target colour by only controlling three primary colours. This was easy for many colours but a touch more complicated for others as negative red light had to be added to recreate some colours.

It was concluded that, given three primary light controls, people can match any colour and additionally; people choose similar distributions to match the target colour. This means that light can be easily reproduced using combinations of individual wavelengths.

Image result for rgb colour matching gif
source

Using this information, a map of colour was then made of all humanly visible colours. To represent colours on a screen however, your images need to be represented by a colour space (of which there are many). The most commonly used is sRGB that was developed by Microsoft in 1996 but wider colour spaces have been developed since then.

Adobe RGB was developed two years later and is used in tools such as Photoshop. ProPhoto RGB was created by Kodak and is the largest current colour space, which even extends beyond what our eyes can see, so why don’t we all use this?

source

If you want to store your image as a jpeg, view your image in a browser or print your image on a non-specialist printer, you will have to use sRGB. ProPhoto RGB is simply too specialised for day to day use so standard equipment and workflow tools do not support it. Even viewing Adobe RGB images in a browser will often be converted to sRGB first which is why sRGB is still the most used today.

Images are represented by pixels and colour is represented by RGB so there are colours that we can see that cannot be recreated on a screen.

Printers use more primaries but still some colours cannot be reproduced! Unless in an illusion:

source

Finally, people have mapped colour spaces into cubes:

source

and (more human-like as hue, value, saturation) cylinders:

source

Conclusion

Hopefully you are convinced that sight is incredible and Computer Vision is no straightforward challenge!

In the next post in this series, I will cover Joseph’s lecture on basic image manipulation.


The Ancient Secrets of Computer Vision 2 by Joseph Redmon - Condensed was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Seminar: Utilising Linked Data in the Public Sector

Title: Utilising Linked Data in the Public Sector

Speaker: Angus Addlesee, PhD Student, Heriot-Watt University

Date: 11:15 on 25 March 2019

Location: CM F.17, Heriot-Watt University

Abstract: In this presentation I will explain how Wallscope (a small tech company in Edinburgh) is using linked data in public sector projects.

Bio: Angus has worked at Wallscope for two years in various roles and is now studying his PhD at Heriot-Watt which is part funded by Wallscope.

Wallscope uses Machine Learning and Semantic Technologies to build Knowledge Graphs and Linked Data applications. We are motivated to lower the barriers for accessing knowledge to improve the health, wealth and sustainability of the world we share.

Linked Data Reconciliation in GraphDB

Using DBpedia to Enhance your Data in GraphDBFollowing my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconcili…

Using DBpedia to Enhance your Data in GraphDB

Following my article on Transforming Tabular Data into Linked Data using OntoRefine in GraphDB, the founder of Ontotext (Atanas Kiryakov) suggested I write a further tutorial using GraphDB for data reconciliation.

In this tutorial we will begin with a .csv of car manufacturers and enhance this with DBpedia. This .csv can be downloaded from here if you want to follow along.

Contents

Setting Up
Constructing the Graph
Reconciling your Data
Exploring the New Graph
Conclusion

Setting Up

First things first, we need to load our tabular data into OntoRefine in GraphDB. Head to the import tab, select “Tabular (OntoRefine)” and upload cars.csv if you are following along.

Click “Next” to start creating the project.

On this screen you need to untick “Parse next 1 line(s) as column headers” as this .csv does not have a header row. Rename the project in the top right corner and click “Create Project”.

You should now have this screen (above) showing one column of car manufacturer names. The column has a space in it which is annoying when running SPARQL queries across so lets rename it.

Click the little arrow next to “Column 1”, open “Edit Column” and then click “Rename this Column”. I called it “carNames” and will use this in the queries below so remember if you name it something different.

If you ever make a mistake, remember there is and undo/redo tab.

Constructing the Graph

In the top right of the interface there is an orange button titled “SPARQL”. Click this to open the SPARQL interface from which you can query your tabular data.

In the above screenshot I have run the query we want. I have have pasted it here so you can see it all and I go through it in detail below.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
CONSTRUCT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}
}

If you are unfamiliar with SPARQL queries then I recommend reading one of my previous articles before reading on.

I start this query by defining my prefixes as usual. I am wanting to construct a graph around these car manufacturers so I design that in my CONSTRUCT clause. I am building a fairly simple graph for this tutorial so lets just run through it very quickly.

I want to have entities representing car manufacturers that have a type, label and location. This location is the headquarters of the car manufacturer. In most cases, all entities should have both a type and a human-readable label so I have ensured this here.

Each location is also an entity with an attached type, label and population.

Unlike my superhero tutorial, the .csv only contains the car company names and not all the data we want in our graph. We therefore need to reconcile our data with information in an open linked dataset. In this tutorial we will use DBpedia, the linked data representation of Wikipedia.

To get the information needed to build the graph declared in our CONSTRUCT we first grab all the names in our .csv and assign them to the variable ?cname. String literals must be language tagged to reconcile with the data in DBpedia so I BIND the English language tag “en” to each string literal. This explanation is what the lines below do:

If you didn’t name the column “carNames” above, you will have to modify the <urn:col:carNames> predicate here.
  ?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

Following this we use the SERVICE tag to send the query to DBpedia (this is called a federated query). We find every entity with the label matching our language tagged strings from the original .csv.

Once I have those entities, I need to find their locations. DBpedia is a very messy dataset so we have to use an alternative path in the query (represented by the “pipe” | symbol). This finds locations connected by any of the alternate paths given (in this case dbo:location and dbo:locationCountry) and assigns them to the variable ?location.

That explanation is referring to these lines:

    ?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

Next we want to retrieve the information about each country. The first pattern in the location ensures the entity has the type dbo:Country so that we don’t find loads of irrelevant locations.

Following this we grab the label and again use alternate property paths to extract each countries population.

It is important to note that some countries have two different populations attached by these two predicates.

We finally FILTER the country labels to only return those that are in English as that is the language our original dataset is in. Data reconciliation can also be used to extend your data into other languages if it happens to fit a multilingual linked open dataset.

That covers the final few lines of our query:

    ?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")

Next we need to insert this graph we have constructed into a GraphDB repository.

Click “SPARQL endpoint” and copy your endpoint (will be different) to be used later.

Reconciling the Data

If you have not done already, create a repository and head to the SPARQL tab.

You can see in the top right of this screenshot that I’m using a repository called “cars”.

In this query panel you want to copy the CONSTRUCT query we built and modify it a little. The full query is here:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
INSERT {
?car rdfs:label ?taggedname ;
rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus ?pop .
} WHERE { SERVICE <http://localhost:7200/rdf-bridge/yourID> {
?c <urn:col:carNames> ?cname .

BIND(STRLANG(?cname, "en") AS ?taggedname)

SERVICE <https://dbpedia.org/sparql> {

?car rdfs:label ?taggedname ;
dbo:location | dbo:locationCountry ?location .

?location rdf:type dbo:Country ;
rdfs:label ?lname ;
dbp:populationCensus | dbo:populationTotal ?pop .

FILTER (LANG(?lname) = "en")
}}
}

The first thing we do is replace CONSTRUCT with INSERT as we now want to ingest the returned graph into our repository.

The next and final thing we must do is nest the entire WHERE clause into a second SERVICE tag. This time however, the service endpoint is the endpoint you copied at the end of the construction section.

This constructs the graph and inserts it into your repository!

It should be a much larger graph but the messiness of DBpedia strikes again! Many car manufacturers are connected to the string label of their location and not the entity. Therefore, the locations do not have a population and are consequently not returned.

We started with a small .csv of car manufacturer names so lets explore this graph we now have.

Exploring the New Graph

If we head to the “Explore” tab and view Japan for example, we can see our data.

Japan has the attached type dbo:Country, label, population and has seven car manufacturers.

There is no point in linking data if we cannot gain further insight so lets head to the “SPARQL” tab of the workbench.

In this screenshot we can see the results of the below query. This query returns each country alongside the number of people per car manufacturer in that country.

There is nothing new in this query if you have read my SPARQL introduction. I have used the MAX population as some countries have two attached populations due to DBpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?name ((MAX(?pop) / COUNT(DISTINCT ?companies)) AS ?result)
WHERE {
?companies rdf:type dbo:Company ;
dbo:location ?location .

?location rdf:type dbo:Country ;
dbp:populationCensus ?pop ;
rdfs:label ?name .
}
GROUP BY ?name
ORDER BY DESC (?result)

In the screenshot above you can see that the results (ordered by result in descending order) are:

  • Indonesia
  • Pakistan
  • India
  • China

India of course has a much larger population than Indonesia but also has a lot more car manufacturers (as shown below).

If you were a car manufacturer in Asia, Indonesia might be a good market to target for export as it has a high population but very little local competition.

Conclusion

We started with a small list of car manufacturer names but, by using GraphDB and DBpedia, we managed to extend this into a small graph that we could gain actual insight from.

Of course, this example is not entirely useful but perhaps you have a list of local areas or housing statistics that you want to reconcile with mapping or government linked open data. This can be done using the above approach to help you or your business gain further insight that you could not have otherwise identified.


Linked Data Reconciliation in GraphDB was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hi Atanas,

Hi Atanas,Firstly, thank you for your detailed response!I will edit my article just now to include your explanation surrounding query 1. I plan to include INSERT and DELETE queries in my benchmark so any optimisations that reduce correctness will be a …

Hi Atanas,

Firstly, thank you for your detailed response!

I will edit my article just now to include your explanation surrounding query 1. I plan to include INSERT and DELETE queries in my benchmark so any optimisations that reduce correctness will be a problem.

To ensure this is noticed, result auditing will be very important! Thanks for mentioning this as I hadn’t given it much thought until now.

I have almost finished designing my knowledge graph and query templates. I have chosen land trading as the topic so I have people, land, trades with dates, etc… I will share this once I have fleshed it out more.

I would love to learn from your (and other LDBC members) experience so lets definitely continue that conversation!

Thanks again,

Angus

Hi Pavel,

Hi Pavel,Thanks for responding! I will edit to include this information now.Good point about truth maintenance!Thanks again,Angus

Hi Pavel,

Thanks for responding! I will edit to include this information now.

Good point about truth maintenance!

Thanks again,

Angus

Comparison of Linked Data Triplestores: Developing the Methodology

Inspecting Load and Query Times across DBPedia and Yago

Developers in small to medium scale companies are often asked to test software and decide what’s “best”. I have worked with RDF for a few years now and thought that comparing triplestores would be a relatively trivial task. I was wrong so here is what I have learned so far.

TL;DR – My original comparison had an imperfect methodology so I have developed this based on the community feedback. My queries now bias the results so I will next create data and query generators.

Contents

Introduction
Methodology –
What I am doing differently
Triplestores –
Which triplestores I tested.
Loading –
How fast does each triplestore load the data?
Queries –
Query Times (and how my queries bias these)
Next Steps –
Developing a realistic Benchmark
Conclusion
Appendix –
Versions, loading and query methods, etc…

Introduction

Over the past few months I have created a small RDF dataset and some SPARQL queries to introduce people to linked data. In December I tied these together to compare some of the existing triplestores (you can read that here). I was surprised by the amount of attention this article got and I received some really great feedback and advice from the community.

Based on this feedback, I realised that the dataset I created was simply too small to really compare these systems properly as time differences were often just a few milliseconds. Additionally, I did not run warm-up queries which proved to effect results significantly in some cases.

Methodology

I have therefore developed my methodology and run a second comparison to see how these systems perform on a larger scale (not huge due to current hardware restrictions).

I have increased the number of triples to 245,197,165 which is significantly more than the 1,781,625 triples that the original comparison was run on.

I performed three warm-up runs and then ran ten hot runs and chart the average time of those ten.

The machine I used has 32Gb Memory, 8 logical cores and was running Centos 7. I used each system one at a time so they did not interfere with each other.

I used the CLI to load and query the data in all systems so that there can be no possibility that the UI effects the time.

I split the RDF into many gzipped files containing 100k triples each. This improves loading times as the process can be optimised across cores.

If you would like to recreate this experiment, you can find my queries, results and instructions on how to get the data here.

Triplestores

In this comparison I evaluated five triplestores. These were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso.

I have listed the versions, query and load methods in the appendix of this article.

Loading

The first thing I did when evaluating each triplestore was of course load the data. Three distinct categories emerged: hours, 10’s of minutes and minutes.

In each case I loaded all the data with all of the gzipped .ttl files containing 100k triples each.

It is also important to note that loading time can be optimised in each case so these are not the fastest they can load, just the default. If you are a deciding for a business, the vendors are more than happy to help you optimise for your data structure.

Blazegraph and GraphDB load this dataset in roughly 8 hours. Stardog and Virtuoso load this in the 30 to 45 minute range but AnzoGraph loads the exact same dataset in just 3 minutes!

Why these three buckets though? Blazegraph, GraphDB and Stardog are all Java based so how does Stardog load the data so much faster (with the default settings)? This is likely due to differences in garbage collection, Stardog probably manages this more by default than the other two.

Virtuoso is written in C which doesn’t manage memory and is therefore easier to load faster than systems built in Java. AnzoGraph is developed in C/C++ so why is it so much faster?

The first reason is that it is simply newer and therefore a little more up to date. The second and more important reason is that they optimise highly for very fast loading speed as they are an OLAP database.

Initial loading speed is sometimes extremely important and sometimes relatively insignificant depending on your use case.

If you are setting up a pipeline that requires one initial big loading job to spin up a live system, that one loading time is insignificant in the long run. Basically, a loading time of minutes or hours is of little relevance to kick off a system that will run for weeks or years.

However, if you want to perform deep analysis across all of your data quickly, this loading time becomes very important. Maybe you suspect a security flaw and need to scrutinise huge amounts of your data to find it… Alternatively, you may be running your analysis on AWS as you don’t have the in-house resources to perform such a large scale investigation. In both of these scenarios, time to load your data is crucial and speed saves you money.

Queries

In this section I will analyse the results of each query and discuss why the time differences exist. As I mentioned, this article is more about why there are differences and how to avoid the causes of these differences to create a fair benchmark in the future.

This is not a speed comparison but an analysis of problems to avoid when creating a benchmark (which I am working on).

I briefly go over each query but they can be found here.

Query 1:

This query is very simple but highlights a number of issues. It simply counts the number of triples in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

To understand the problems, let’s first take a look at the results:

You can see that we again have significant differences in times (Red bar extends so far that the others were unreadable so cut vertical axis).

The first problem with this query is that it will never be run in production as it provides no valuable information. Linked data is useful to analyse relationships and grab information for interfaces, etc… not to count the number of triples.

GraphDB, likely for this reason, has not optimised for this query at all. An additional reason for this is that they have tried many optimisations to make counting fast; essentially counting based on (specific) indices, without iterating bindings/solutions. Many of those optimisations show great performance on specific queries, but are slow or return incorrect results on real queries.

AnzoGraph equally completes an actual ‘count’ of each triple every time this query is run but the difference is likely a Java vs C difference again (or they have optimised slightly for this query).

Virtuoso is interesting as it is built upon a relational database and therefore keeps a record of the number of triples in the database at all times. It can therefore translate this query to look up that record and not actually ‘count’ like the last two.

Stardog takes another approach which is to run an index to help them avoid counting at all.

Blazegraph perhaps take this further which raises another problem with this query (in fact this is a problem with all of my queries). They possibly cache the result from the warm-up runs and display that on request.

A major problem is that I run the EXACT same queries repeatedly. After the first run, the result can simply be cached and recalled. This mixed with the need for warm-up runs creates an unrealistic test.

In production, queries are usually similar but with different entities within. For example, if you click on a person in an interface to bring up a detailed page about them, the information needed is always the same. The query is therefore the same apart from the person entity (the person you click on).

To combat this, I will make sure to have at least one randomly generated seed in each of my query templates.

Query 2:

This query, grabbed from this paper, returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

This is a little more realistic when compared to query 1 but again has the problem that each run sends the exact same query.

In addition, a new issue becomes clear.

Once again, I have chopped the vertical axis so that the results can be shown clearly (and labelled at the base).

The interesting thing here is the fact that all of the triplestores return exactly the same 1,000 labels apart from one – AnzoGraph. This is almost certainly the cause of the time difference as they return a different 1,000 people each time the query is run.

This is possibly by design so that limits do not skew analytical results. AnzoGraph is the only OLAP database in this comparison so they focus on deep analytics. They therefore would not want limits to return the same results every time, potentially missing something important.

Another important point regarding this query is that we have a LIMIT but no ORDER BY which is extremely unusual in real usage. You don’t tend to want 100 random movies, for example, but the 100 highest rated movies.

On testing this, adding an ORDER BY did increase the response times. This difference then extends into query 3…

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.}
{?v6 dbo:city ?v2.}
UNION
{?v6 dbo:location ?v2.}
{?v6 dbp:iata ?v5.}
UNION
{?v6 dbo:iataLocationIdentifier ?v5.}
OPTIONAL {?v6 foaf:homepage ?v7.}
OPTIONAL {?v6 dbp:nativename ?v8.}
{In the contract it might be important to note that my legal first name is John
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
}}

As you can imagine, there is a very similar pattern between query 2 and query 3 results.

Remember that each run of this query asks for exactly the same information in each system except for AnzoGraph, which is different every time.

As with all of the other queries, returning the exact same results each run is problematic. Not only is it unrealistic but it is impossible to make a distinction between fast querying and smart caching. It is not bad to cache, it is smart to do for fast response times. The problem is the fact that this type of caching is unlikely to be needed in production.

A nice note to make is that, unlike the others, AnzoGraph is retrieving information about a different 1,000 settlements each run and only takes an additional 300ms to do this. Whether this is impressive or not cannot be known from this experiment.

If caching an answer is possible for some systems and not others, the results can not be fairly compared. This is of course a problem if developing a benchmark.

Again however, randomly generated seeds would solve this.

Query 4:

To gauge the speed of each system’s mathematical functionality, I created a nonsensical query that uses many of these (now, sum, avg, ceil, rand, etc…).

The fact that this is nonsensical is not entirely a problem in this case. The fact that the query is exactly the same each run is however (as always).

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

Essentially, this query is built from multiple nested selects that return and process numbers into a final result.

Once again, I have cut the vertical axis and labelled the bar for clarity.

This is a perfect example of query caching. I would be extremely surprised if AnzoGraph could actually run this query in 20ms. As mentioned above, caching is not cheating – just a problem when the exact same query is run repeatedly which is unrealistic.

It is also important to note that when I say caching, I do not necessarily mean result caching. Query structure can be cached for example to optimise any following queries. In fact, result caching could cause truth maintenance issues in a dynamic graph.

Blazegraph, Stardog and Virtuoso take a little longer but it is impossible to tell whether the impressive speed compared to GraphDB is due to calculation performance or some level of caching.

In conjunction with this, we can also not conclude that GraphDB is mathematically slow. It of course looks like that could be a clear conclusion but it is not.

Without knowing what causes the increased performance (likely because the query is exactly the same each run), we cannot conclude what can be deemed poor performance.

Once again (there’s a pattern here) randomly generated seeds within query templates would make this fair as result caching could not take place.

Query 5a (Regex):

This query, like query 4, is nonsensical but aims to evaluate string instead of math queries. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab ../stardog-admin db create -n
./stardog-admin db create -o search.enabled=true -n bench /root/virtuoso/dumps/*ttl.gz
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

Regex SPARQL queries are very uncommon as the majority of triplestores have a full text search implementation that is much faster!

If however, you wished to send the same string query to multiple triplestores (you want to use an OLTP and an OLAP database together for example) then you may want to use Regex so you don’t have to customise each query.

AnzoGraph is the only triplestore here that does not have a built in full text indexing tool. This can be added by integrating AnzoGraph with Anzo, a data management and analytics tool.

Blazegraph, GraphDB and Virtuoso therefore do not optimise for this type of query as it is so uncommonly used. AnzoGraph however does optimise for this as users may not want to integrate Anzo into their software.

Searching for all of these literals, constructing the graph and returning the result in half a second is incredibly fast. So fast that I believe we run into the caching problem again.

To reiterate, I am not saying caching is bad! It is just a problem to compare results because my queries are the same every run.

Comparing Regex results is unnecessary when there are better ways to write the exact same query. If you were using different triplestores in production, it would be best to add a query modifier to transform string queries into their corresponding full text search representation.

For this reason I will use full text search (where possible) in my benchmark.

Query 5b (Full Text Index):

This query is exactly the same as above but uses each triplestores full text index instead of Regex.

As these are all different, I have the Stardog implementation below (as they were the fastest in this case). The others can be found here.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .}
WHERE {
{?s1 rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> 'venus'
} UNION {?s2 rdfs:comment ?sab .
?sab <tag:stardog:api:property:textMatch> 'sleep'
} UNION {?s3 dbo:abstract ?lab .
?lab <tag:stardog:api:property:textMatch> 'gluten'
}
}

I did not integrate AnzoGraph with Anzo so they are not below.

All of these times are significantly faster than their corresponding times in query 5b. Even the slowest time here is less than half the fastest query 5b time!

This really highlights why I will not include regex queries (where possible) in my benchmark.

Once again, due to the fact that the query is exactly the same each run I cannot compare how well these systems would perform in production.

Query 6:

Queries 1, 4 and 5 (2 and 3 also to an extent) are not like real queries that would be used in a real pipeline. To add a couple more sensible queries, I grabbed the two queries listed here.

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
dbo:position|dbp:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
dbo:birthPlace/dbo:country* ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

Of course even with a more realistic query, my main problem remains…

Is the difference in time between Virtuoso and AnzoGraph due to performance or the fact that the same query is run thirteen times? It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment. As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered. This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data. How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:

This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

This is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next Steps

I learned recently that benchmarks require significantly more than three warm up runs. In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance. For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China – caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic. If the data generated was real, they would return useful insight into the data. This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system. Systems are optimised for different structures of data by default which effects the results and therefore needs to change. Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

Conclusion

Fairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed. The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!

AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…

If this was a fair experiment, which of these results would be the “best”?

It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

Appendix

Below I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:

AnzoGraph

Version: r201901292057.beta

Queried with:
azgi -silent -timer -csv -f /my/query.rq

Loaded with:
azgi -silent -f -timer /my/load.rq

Blazegraph

Version: 2.1.5

Queried with:
Rest API

Loaded with:
Using the dataloader Rest API by sending a dataloader.txt file.

GraphDB

Version: GraphDB-free 8.8.1

Queried with:
Rest API

Loaded with:
loadrdf -f -i repoName -m parallel /path/to/data/directory

It is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

Stardog

Version: 5.3.5

Queried with:
stardog query myDB query.rq

Loaded with:
stardog-admin db create -n repoName /path/to/my/data/*.ttl.gz

Virtuoso

Version: VOS 7.2.4.2

Queried within isql-v:
SPARQL PREFIX … rest of query … ;

Loaded within isql-v:
ld_dir (‘directory’, ‘*.*’, ‘http://dbpedia.org’) ;
then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:
BufferSize = 1360000
DirtyBufferSize = 1000000

This was a recommended switch in the default virtuoso.ini file.


Comparison of Linked Data Triplestores: Developing the Methodology was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inspecting Load and Query Times across DBPedia and Yago

Developers in small to medium scale companies are often asked to test software and decide what’s “best”. I have worked with RDF for a few years now and thought that comparing triplestores would be a relatively trivial task. I was wrong so here is what I have learned so far.

TL;DR - My original comparison had an imperfect methodology so I have developed this based on the community feedback. My queries now bias the results so I will next create data and query generators.

Contents

Introduction
Methodology -
What I am doing differently
Triplestores -
Which triplestores I tested.
Loading -
How fast does each triplestore load the data?
Queries -
Query Times (and how my queries bias these)
Next Steps -
Developing a realistic Benchmark
Conclusion
Appendix -
Versions, loading and query methods, etc…

Introduction

Over the past few months I have created a small RDF dataset and some SPARQL queries to introduce people to linked data. In December I tied these together to compare some of the existing triplestores (you can read that here). I was surprised by the amount of attention this article got and I received some really great feedback and advice from the community.

Based on this feedback, I realised that the dataset I created was simply too small to really compare these systems properly as time differences were often just a few milliseconds. Additionally, I did not run warm-up queries which proved to effect results significantly in some cases.

Methodology

I have therefore developed my methodology and run a second comparison to see how these systems perform on a larger scale (not huge due to current hardware restrictions).

I have increased the number of triples to 245,197,165 which is significantly more than the 1,781,625 triples that the original comparison was run on.

I performed three warm-up runs and then ran ten hot runs and chart the average time of those ten.

The machine I used has 32Gb Memory, 8 logical cores and was running Centos 7. I used each system one at a time so they did not interfere with each other.

I used the CLI to load and query the data in all systems so that there can be no possibility that the UI effects the time.

I split the RDF into many gzipped files containing 100k triples each. This improves loading times as the process can be optimised across cores.

If you would like to recreate this experiment, you can find my queries, results and instructions on how to get the data here.

Triplestores

In this comparison I evaluated five triplestores. These were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso.

I have listed the versions, query and load methods in the appendix of this article.

Loading

The first thing I did when evaluating each triplestore was of course load the data. Three distinct categories emerged: hours, 10’s of minutes and minutes.

In each case I loaded all the data with all of the gzipped .ttl files containing 100k triples each.

It is also important to note that loading time can be optimised in each case so these are not the fastest they can load, just the default. If you are a deciding for a business, the vendors are more than happy to help you optimise for your data structure.

Blazegraph and GraphDB load this dataset in roughly 8 hours. Stardog and Virtuoso load this in the 30 to 45 minute range but AnzoGraph loads the exact same dataset in just 3 minutes!

Why these three buckets though? Blazegraph, GraphDB and Stardog are all Java based so how does Stardog load the data so much faster (with the default settings)? This is likely due to differences in garbage collection, Stardog probably manages this more by default than the other two.

Virtuoso is written in C which doesn’t manage memory and is therefore easier to load faster than systems built in Java. AnzoGraph is developed in C/C++ so why is it so much faster?

The first reason is that it is simply newer and therefore a little more up to date. The second and more important reason is that they optimise highly for very fast loading speed as they are an OLAP database.

Initial loading speed is sometimes extremely important and sometimes relatively insignificant depending on your use case.

If you are setting up a pipeline that requires one initial big loading job to spin up a live system, that one loading time is insignificant in the long run. Basically, a loading time of minutes or hours is of little relevance to kick off a system that will run for weeks or years.

However, if you want to perform deep analysis across all of your data quickly, this loading time becomes very important. Maybe you suspect a security flaw and need to scrutinise huge amounts of your data to find it… Alternatively, you may be running your analysis on AWS as you don’t have the in-house resources to perform such a large scale investigation. In both of these scenarios, time to load your data is crucial and speed saves you money.

Queries

In this section I will analyse the results of each query and discuss why the time differences exist. As I mentioned, this article is more about why there are differences and how to avoid the causes of these differences to create a fair benchmark in the future.

This is not a speed comparison but an analysis of problems to avoid when creating a benchmark (which I am working on).

I briefly go over each query but they can be found here.

Query 1:

This query is very simple but highlights a number of issues. It simply counts the number of triples in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

To understand the problems, let’s first take a look at the results:

You can see that we again have significant differences in times (Red bar extends so far that the others were unreadable so cut vertical axis).

The first problem with this query is that it will never be run in production as it provides no valuable information. Linked data is useful to analyse relationships and grab information for interfaces, etc… not to count the number of triples.

GraphDB, likely for this reason, has not optimised for this query at all. An additional reason for this is that they have tried many optimisations to make counting fast; essentially counting based on (specific) indices, without iterating bindings/solutions. Many of those optimisations show great performance on specific queries, but are slow or return incorrect results on real queries.

AnzoGraph equally completes an actual ‘count’ of each triple every time this query is run but the difference is likely a Java vs C difference again (or they have optimised slightly for this query).

Virtuoso is interesting as it is built upon a relational database and therefore keeps a record of the number of triples in the database at all times. It can therefore translate this query to look up that record and not actually ‘count’ like the last two.

Stardog takes another approach which is to run an index to help them avoid counting at all.

Blazegraph perhaps take this further which raises another problem with this query (in fact this is a problem with all of my queries). They possibly cache the result from the warm-up runs and display that on request.

A major problem is that I run the EXACT same queries repeatedly. After the first run, the result can simply be cached and recalled. This mixed with the need for warm-up runs creates an unrealistic test.

In production, queries are usually similar but with different entities within. For example, if you click on a person in an interface to bring up a detailed page about them, the information needed is always the same. The query is therefore the same apart from the person entity (the person you click on).

To combat this, I will make sure to have at least one randomly generated seed in each of my query templates.

Query 2:

This query, grabbed from this paper, returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

This is a little more realistic when compared to query 1 but again has the problem that each run sends the exact same query.

In addition, a new issue becomes clear.

Once again, I have chopped the vertical axis so that the results can be shown clearly (and labelled at the base).

The interesting thing here is the fact that all of the triplestores return exactly the same 1,000 labels apart from one - AnzoGraph. This is almost certainly the cause of the time difference as they return a different 1,000 people each time the query is run.

This is possibly by design so that limits do not skew analytical results. AnzoGraph is the only OLAP database in this comparison so they focus on deep analytics. They therefore would not want limits to return the same results every time, potentially missing something important.

Another important point regarding this query is that we have a LIMIT but no ORDER BY which is extremely unusual in real usage. You don’t tend to want 100 random movies, for example, but the 100 highest rated movies.

On testing this, adding an ORDER BY did increase the response times. This difference then extends into query 3…

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.}
{?v6 dbo:city ?v2.}
UNION
{?v6 dbo:location ?v2.}
{?v6 dbp:iata ?v5.}
UNION
{?v6 dbo:iataLocationIdentifier ?v5.}
OPTIONAL {?v6 foaf:homepage ?v7.}
OPTIONAL {?v6 dbp:nativename ?v8.}
{In the contract it might be important to note that my legal first name is John
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
}}

As you can imagine, there is a very similar pattern between query 2 and query 3 results.

Remember that each run of this query asks for exactly the same information in each system except for AnzoGraph, which is different every time.

As with all of the other queries, returning the exact same results each run is problematic. Not only is it unrealistic but it is impossible to make a distinction between fast querying and smart caching. It is not bad to cache, it is smart to do for fast response times. The problem is the fact that this type of caching is unlikely to be needed in production.

A nice note to make is that, unlike the others, AnzoGraph is retrieving information about a different 1,000 settlements each run and only takes an additional 300ms to do this. Whether this is impressive or not cannot be known from this experiment.

If caching an answer is possible for some systems and not others, the results can not be fairly compared. This is of course a problem if developing a benchmark.

Again however, randomly generated seeds would solve this.

Query 4:

To gauge the speed of each system’s mathematical functionality, I created a nonsensical query that uses many of these (now, sum, avg, ceil, rand, etc…).

The fact that this is nonsensical is not entirely a problem in this case. The fact that the query is exactly the same each run is however (as always).

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

Essentially, this query is built from multiple nested selects that return and process numbers into a final result.

Once again, I have cut the vertical axis and labelled the bar for clarity.

This is a perfect example of query caching. I would be extremely surprised if AnzoGraph could actually run this query in 20ms. As mentioned above, caching is not cheating - just a problem when the exact same query is run repeatedly which is unrealistic.

It is also important to note that when I say caching, I do not necessarily mean result caching. Query structure can be cached for example to optimise any following queries. In fact, result caching could cause truth maintenance issues in a dynamic graph.

Blazegraph, Stardog and Virtuoso take a little longer but it is impossible to tell whether the impressive speed compared to GraphDB is due to calculation performance or some level of caching.

In conjunction with this, we can also not conclude that GraphDB is mathematically slow. It of course looks like that could be a clear conclusion but it is not.

Without knowing what causes the increased performance (likely because the query is exactly the same each run), we cannot conclude what can be deemed poor performance.

Once again (there’s a pattern here) randomly generated seeds within query templates would make this fair as result caching could not take place.

Query 5a (Regex):

This query, like query 4, is nonsensical but aims to evaluate string instead of math queries. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab ../stardog-admin db create -n
./stardog-admin db create -o search.enabled=true -n bench /root/virtuoso/dumps/*ttl.gz
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

Regex SPARQL queries are very uncommon as the majority of triplestores have a full text search implementation that is much faster!

If however, you wished to send the same string query to multiple triplestores (you want to use an OLTP and an OLAP database together for example) then you may want to use Regex so you don’t have to customise each query.

AnzoGraph is the only triplestore here that does not have a built in full text indexing tool. This can be added by integrating AnzoGraph with Anzo, a data management and analytics tool.

Blazegraph, GraphDB and Virtuoso therefore do not optimise for this type of query as it is so uncommonly used. AnzoGraph however does optimise for this as users may not want to integrate Anzo into their software.

Searching for all of these literals, constructing the graph and returning the result in half a second is incredibly fast. So fast that I believe we run into the caching problem again.

To reiterate, I am not saying caching is bad! It is just a problem to compare results because my queries are the same every run.

Comparing Regex results is unnecessary when there are better ways to write the exact same query. If you were using different triplestores in production, it would be best to add a query modifier to transform string queries into their corresponding full text search representation.

For this reason I will use full text search (where possible) in my benchmark.

Query 5b (Full Text Index):

This query is exactly the same as above but uses each triplestores full text index instead of Regex.

As these are all different, I have the Stardog implementation below (as they were the fastest in this case). The others can be found here.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .}
WHERE {
{?s1 rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> 'venus'
} UNION {?s2 rdfs:comment ?sab .
?sab <tag:stardog:api:property:textMatch> 'sleep'
} UNION {?s3 dbo:abstract ?lab .
?lab <tag:stardog:api:property:textMatch> 'gluten'
}
}

I did not integrate AnzoGraph with Anzo so they are not below.

All of these times are significantly faster than their corresponding times in query 5b. Even the slowest time here is less than half the fastest query 5b time!

This really highlights why I will not include regex queries (where possible) in my benchmark.

Once again, due to the fact that the query is exactly the same each run I cannot compare how well these systems would perform in production.

Query 6:

Queries 1, 4 and 5 (2 and 3 also to an extent) are not like real queries that would be used in a real pipeline. To add a couple more sensible queries, I grabbed the two queries listed here.

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
dbo:position|dbp:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
dbo:birthPlace/dbo:country* ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

Of course even with a more realistic query, my main problem remains…

Is the difference in time between Virtuoso and AnzoGraph due to performance or the fact that the same query is run thirteen times? It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment. As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered. This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data. How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:

This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

This is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next Steps

I learned recently that benchmarks require significantly more than three warm up runs. In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance. For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China - caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic. If the data generated was real, they would return useful insight into the data. This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system. Systems are optimised for different structures of data by default which effects the results and therefore needs to change. Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

Conclusion

Fairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed. The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!

AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…

If this was a fair experiment, which of these results would be the “best”?

It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

Appendix

Below I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:

AnzoGraph

Version: r201901292057.beta

Queried with:
azgi -silent -timer -csv -f /my/query.rq

Loaded with:
azgi -silent -f -timer /my/load.rq

Blazegraph

Version: 2.1.5

Queried with:
Rest API

Loaded with:
Using the dataloader Rest API by sending a dataloader.txt file.

GraphDB

Version: GraphDB-free 8.8.1

Queried with:
Rest API

Loaded with:
loadrdf -f -i repoName -m parallel /path/to/data/directory

It is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

Stardog

Version: 5.3.5

Queried with:
stardog query myDB query.rq

Loaded with:
stardog-admin db create -n repoName /path/to/my/data/*.ttl.gz

Virtuoso

Version: VOS 7.2.4.2

Queried within isql-v:
SPARQL PREFIX ... rest of query ... ;

Loaded within isql-v:
ld_dir ('directory', '*.*', 'http://dbpedia.org') ;
then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:
BufferSize = 1360000
DirtyBufferSize = 1000000

This was a recommended switch in the default virtuoso.ini file.


Comparison of Linked Data Triplestores: Developing the Methodology was originally published in Wallscope on Medium, where people are continuing the conversation by highlighting and responding to this story.