The ACM Web Conference 2022 summary

The ACM Web Conference (the previous WWW) is the most prestigious and oldest conference focusing on the World Wide Web research and technologies. The conference began at CERN in 1994, where the Web was first born and runs annually till 2022. This year, conference management was transferred from the International World Wide Web Conferences Committee (IW3C2) to ACM. I was able to attend the conference as my RQSS paper was accepted in the PhD Symposium track (after the complex problems that occurred during registration :)). It was a real pleasure. The conference was held virtually from Lyon, France. In this post, I write a report of the sessions I’ve been in and interesting points from my own experiences.

Day 1 – Mon 25 April – Graph Learning Workshop and KGTK Tutorial

Monday, April 25 was the first day of the conference. Because I was in Iran during the conference, I was experiencing a 2.5-hour time zone difference, which made it a bit difficult for me to schedule between personal matters and the conference program. On the first day (as well as other days) there were quite interesting tracks that I had to choose because of the overlap. The Web for All track (W4A) was an example that I really wanted to participate in but I couldn’t. However, because my research focuses on the semantic web and knowledge graphs right now, I tried to attend semantic webby sessions.

For the first day, I chose the Graph Learning workshop. The workshop began with a keynote by Wenjie Zhang from the University of New South Wales. The topic of their lecture was the applications and challenges of Structure-based Large-scale Dynamic Heterogeneous Graphs Processing. In this lecture, They mentioned the improvement of efficiency in Graph Learning and having a development model for Graph Learning as the advantages of this method and considered “detecting money laundering” and “efficient suggestion in social groups” as its applications.

In Technical Session 1, I watched an interesting classic Deep Learning approach in Graph Learning called CCGG. A strong team from the high-rank universities in Iran has developed an autoregressive approach for the class-conditional graph generation. The interesting thing for me was that the team is led by Dr Soleymani Baghshah. She had previously been my Algorithm Design lecturer at the Shahid Bahonar University of Kerman. Another interesting point was that one of their evaluation datasets is protein graphs (each protein in the form of a graph of amino acids). I hoped one of the authors would be there live to answer the questions, but it was not. The presentation was also pre-recorded.

I also watched another interesting work on the Heterogeneous Information Network (HIN), called SchemaWalk. The main point of that research is using a schema-aware random walk method. In this method, the desired results can be obtained without worrying about the structure of the heterogeneous graph. The first keynote and this presentation on HIN caught my attention in the field, and I will certainly read more about heterogeneous graphs in the future.

The next interesting work was on Multi-Graph based Multi-Scenario video recommendation. I didn’t get much out of the details of the presentation, but trying to learn from multiple graphs at the same time seemed like an interesting task. Keynote 3 of the workshop was about Graph Neural Networks. Michael Bronstein from the University of Oxford had a nice presentation on how they replaced the common node-vertex perspective (discrete nature) in Graph Learning with a continuous perspective derived from physics and mathematics. I found the blog post here of the study, written by Michel. Overall, the Graph Learning workshop was very dense and quite useful. The topics will be very close to my future research (postdoc possibly), so personally, I’ll return and sit again to review many of these studies soon.

During the day, I visited the KGTK tutorial two or three times. I like their team. They are very active and have been involved in almost every major web and linked-data event in the last couple of years. KGTK tutorials are especially important to me as we are currently preparing to perform several performance and reliability tests on this tool and I am in charge of performing the technical test steps, so the more I know about the tool the better I can run experiments. In this tutorial, a good initial presentation by Hans Chalupsky overviewed the tool. Like other KGTK tutorials, this session was followed by demonstrations of Kypher language capabilities and tool’ Google Collaborate notebooks. The full material of the tutorial can be found in this repository.

The hard part of the first day for me was that my presentation was on the second day and I’ve had the stress and anxiety of that throughout the day. Surely, a considerable part of my focus was consumed on my slides, my notes and my presentation style. I was tweaking them frequently.

Day 2 – Tue 26 April – PhD Symposium

On the second day, I was supposed to present my work at the PhD symposium. I was going to attend the Beyond Facts workshop at the beginning of the day. But I needed to tweak the slides and manage my presentation timing because I was exceeding 15 minutes in the first run! Therefore, I dedicated the second day only to the PhD symposium and I was completely happy with my choice, as 7 of the 10 accepted papers were related to Semantic Web.

The PhD symposium consisted of three main sessions. Knowledge Graphs, SPARQL, and Representation and Interaction. My paper was in the first session in which 22 audiences were present. Before presenting myself, I enjoyed the “Personal Knowledge Graphs” paper. It was an interesting idea. Personal knowledge graphs are small graphs of information that are formed around an individual person. The authors have used these graphs to assist e-learning systems and have been able to make a positive impact on personal recommendations.

Then, it was my turn to speak. My presentation went very well. As I had a lot of practice and edited the slides many times, I had a good dominance in the talk and finished the presentation in exactly 15 minutes. There were 22 listeners when I started. My presentation was the last presentation of the session so my Q&A part was more flexible and took more than three minutes. The chair (Elena) asked me about the number of metrics and why there are such a high number of metrics? I answered that: during the definition phase of the framework, we tried to be as comprehensive as possible, after implementing and reviewing the results, we may want to remove some of the criteria, those that do not provide useful information. Elena also asked if we ever used human participation, questionnaires, or any human opinions data to evaluate quality. I answered that our focus is on objective metrics right now and we do not need to collect human user opinions for them (which facilitate the automaticity of the framework), but we probably will do that in the next steps.

Aleksandr, the author of the first paper of the session, also introduced a tool called Wikidata Complete, which is a kind of relation extraction and linking tool for Wikidata that also can suggest references. I’d need to look at it and see if there are any evaluations of this tool. After finishing my presentation and answering the questions, my duty at the conference has been finished. So I was happy that I could continue the conference with ease of mind.

Following the PhD symposium, at the SPARQL session, Angela gave a keynote address on Query-driven Graph Processing. In this presentation, a complete overview of data graph models, property graphs, RDF graphs and different types of graph patterns used in queries was performed. In this presentation, the concepts of Robotic Queries and Organic Queries were mentioned. Organic queries are those types of queries that humans perform on the KG endpoints. Robotic queries are those queries that are executed in automated tools by scripts, bots, and algorithms. Angela believed that robotic queries have a specific pattern. For example, these queries are on average shorter in length and use fewer variables. At the end of the presentation, there was an interesting discussion about query streaming.

At the SPARQL session, we had another presentation on Predicting SPARQL Queries Dynamics. In this presentation, the authors were working to predict possible changes in the output of a query and even estimate the dynamics of query results based on the variance of changes in the data level. They evaluated their approach on Wikidata and DBPedia.

Day 3 – Wed 27 April – Semantics

I spent the third day attending the Semantics track. The track had two sessions: Semantics: Leveraging KGs and Taxonomy Learning. The first session had problems at the beginning. The first presentation, which was a pre-recorded video, had playback problems and took about ten minutes longer. The second presentation was about learning and not had much about graph-related issues, and its timing was also a bit messy. Other presentations also were related very much to Machine Learning. But the latest presentation “Path Language Modeling over Knowledge Graphs” uses reasoning over KGs to provide explainable recommendations, special for each user. In their experiments, they have used 3 categories of Amazon consumer transaction datasets. They compared their method with 6 baseline approaches, like KGAT and CAFE and in all categories and also for all metrics (Precision, Recall, NDCG, and HR) they obtained the best results compared to baselines.

In the second session, Taxonomy Learning, there were two interesting presentations on Taxonomy Completion. It made me search a bit about the Taxonomy Completion task. The task is somehow enriching the existed taxonomies (the hierarchy of things) with rapidly expanding new concepts. In the first presentation, a new framework called Quadruple Evaluation Network(QEN) was introduced that prevents using hardly accessible word vectors for new concepts, instead, benefits from the concept of the Pretrained Language Model which enhances the performance and reduces online computational costs. In the second presentation, the authors created a “Self-Supervised Taxonomy Completion” approach using Entity Extraction Tools from raw texts. For the evaluation, they used Microsoft Academic Graph and WordNet. The notions of the session were a bit out of my knowledge and also my current research focus, but the problems and the solutions that the authors had were quite interesting, so it was added to my To-Read list! Between the two presentations about Taxonomy Completion, there was a nice talk on Explainable Machine Learning on Knowledge Graphs in the presentation of the “EvoLearner” paper.

Taxonomy Completion. Taken from Suyuchen Wang slides, The ACM Web Conference.

The afternoon was the opening day ceremony of the web conference, with the introduction of the organizers, the nominees for the best paper awards, and the official transfer of the conference from IW3C2 to ACM. The remarkable thing was how the ceremony was held in a hybrid way. The high discipline of the ceremony and its real-time direction was very remarkable.

Raphaël Troncy and Frédérique Laforest at the opening ceremony of The ACM Web Conference 2022.

Our keynote speaker on Wednesday was Prabhakar Raghavan from Google, and the title of his presentation was “Search Engines: From the Lab to the Engine Room, and Back”. One of the key points was how search engines, including Google, evolved from “keyword” processing and showing just a lot of web links to processing “entities” and finally processing “actions”. In addition to the specific style of presentation and the important content that he expressed, an important part of his presentation was about the quality of information and the problem of misinformation. In this regard, I asked a question related to my research on Wikidata references. Since a lot of information on every search comes in from the KGs, I asked the question: “Does using references and data sources, as Wikidata does, increase the quality and eliminate misinformation?” Prabhakar’s answer was that: references can improve the quality of data and affect the ranking of results, but we do not use them to remove any information. Instead, only policy violations can lead to the deletion of information.

Keynote 1 at The ACM Web Conference 2022 by Prabhakar Raghavan from Google.

Day 4 – Thu 28 April – Infrastructure

Sessions on the fourth day of the conference were generally about Recommendations. Tracks like UMAP, Search, Industry, WMCA, and Web4Good have had sessions on Recommendations. However, I did not find them very relevant to my research direction. That’s why I attended the Graphs and Infrastructure session of the Industry track. My guess was that these sessions were going to say something interesting about the infrastructure needed to process and manage big graphs, and my prediction was correct.

The first research was about Eviction Prediction in Microsoft Cloud Virtual Machines. In this paper, instead of the current hierarchical Node-Cluster system in predicting, a more efficient node-level prediction method is presented. The best part of this presentation was stating the limitations and plans to overcome them. One of the limitations of this method is that there is a lack of interpretability to use in prediction at the node level. There will also be many parameters to configure compared to traditional methods. To deal with this situation, the authors suggested using a prediction distribution instead of accurate forecasting.

Another paper was about Customer Segmentation using knowledge graphs. I really liked their explanation of the problem. A good explanation at the beginning of a presentation prepares the audiences’ minds to follow the stream and connect complex concepts together correctly. Anyway, traditional methods for Customer Segmentation are still based on questionnaires and domain experts’ opinions. In large datasets that have not been annotated, these methods will be very time-consuming. Instead of traditional methods, the authors used a KG-based clustering and tested their implementation on two customer datasets from the beverage industry.

There was also another interesting paper titled “On Reliability Scores for Knowledge Graphs”. They tried to score the data stored in an industrial KG (about product taxonomy that archives data from different sources automatically) to find only reliable data. The authors used 5 criteria for impact analysis: numerical-based, constraint-based, clustering, anomaly detection and provenance based which the last is very similar to the data quality category “trust” which I’m working on right now. This paper was the closest study to my research during the event.

The last presentation was about DC-GNN, a framework for helping E-Commerce dataset training based on Graph Neural Network (GNN). Using GNN has some benefits for training, for example, the features can be extracted topologically from the data and a relational extraction can be done. However, the huge amount of data will threaten the efficiency of algorithms and their representation of them. DC-GNN enhances the efficiency by separating traditional GNNs-based two-tower CTR prediction paradigms into three simultaneous stage methods.

After the Industry track, I joined the second keynote of the conference titled “Responsible AI: From Principles To Action” presented by Virginia Dignum from Umea University (Sweden). She had a very nice talk on the current situation of AI, the things that AI nowadays can do, like identifying patterns in images, texts and videos, extrapolating the patterns to new data and taking actions based on the patterns, as well as the things AI cannot do yet, like learning from a few examples, common scene reasoning (context and meaning), learning general concepts and combining learning and reasoning from data. She also mentioned the challenges of learning from data like the impact of the wisdom of the crowd, the impact of errors and adversaries, and the impact of biases. The main point of the talk was that we need to consider who should decide about what AI can or should do in the future and which values should be considered and how the values should be prioritized. That was a very nice talk!

Day 5 – Fri 29 April – History of Web

There were some very interesting sessions on the final day of the conference, such as Structured Data and Knowledge Graphs, Graph Search, and Social Web. But I decided to participate in the History of Web track. My reason was that this way I could get a good insight into the history of the Web, the problems that existed in the beginning, and the solutions people had for the problems to be resolved, as well as information about the evolution of the WWW conference itself since the early 90. At first, the title of the track reminded me of a part of Professor Tanenbaum’s book (Computer Networks) first chapter, and also John Naughton’s book “A Brief History of the Future” which delves into the history of the Internet.

The History of Web session consisted of papers and talks that scientifically analyzed Web development during its birthday at CERN. The first presentation, entitled “A Never-Ending Project for Humanity Called the Web”, was presented by Fabien Gandon, which shows the great changes that the Web is examining in human life. The message of this presentation was that despite the profound impact of the Web on information exchange, the Semantic Web, the Web of Things, multimedia, and authentication, much of the potential of the Web has not yet been exploited.

Then we had two guest speakers, George Metakides from the EU and Jean-François Abramatic from INRIA, who shared the history of the W3C in the first ten years on the web.

Another interesting presentation was by Damien Graux, who examined the history of the web through the lens of annual Web Conferences (formerly WWW). This study examined how academia has interacted with web technology in the last 30 years. The paper analyses statistics on the main topics and areas, authors, affiliations of the authors, countries participating in the research, citations of the web conference on other topics, and much other information. This article has historical and scientific value together. I suggest you take a look at it!

After all, I participated in the closing ceremony. Like the other parts of the conference, it was also interesting. In the end, we have been introduced to the next General Chair of The Web Conference 2022, Dr Juan Sequeda who I’m quite familiar from ISWC 2021.

Last words

In the final words, The Web Conference 2022 experience was excellent. The presentations were very interesting and understandable. The timing and discipline of the sessions were great. The chairs did their work very well. The response from the authors was nice and friendly. I liked so many presentation styles and I also learned from the speakers very much. Although the opportunity to be in France and the city of Lyon was very precious and it was not possible due to the online event, participating in the conference alongside my family from Iran made an incredible memory for me. I hope to be able to attend the conference in Austin, Texas next year, and maybe I’ll have something precious to present there 😉

Webinar: SPaR.txt – a cheap Shallow Parsing approach for Regulatory texts

I’ll be presenting my slides for the NLLP (legal NLP) workshop during EMNLP 2021. We’ve created a text processing tool that can be used as a building block for ACC (Automated Compliance Checking) using semantic parsing.  This work has been done in collaboration with Ioannis Konstas, Alasdair Gray, Farhad Sadeghineko, Richard Watson and Bimal Kumar, and is part of the Intelligent Regulatory Compliance (i-ReC) project, a collaboration between Northumbria University and HWU. You can find our data and code at: https://github.com/rubenkruiper/SPaR.txt

Title: SPaR.txt – a cheap Shallow Parsing approach for Regulatory texts

Summary:
Understanding written instructions is a notoriously hard task for computers. One can imagine that the task becomes harder when more and more instructions are given at once, especially when these instructions can be ambiguous or even conflicting. These reasons, amongst others, are why it is so hard to achieve (semi-)automated regulatory compliance checking — something that has been researched in the fields of Architecture, Engineering, and Construction (AEC) since the early 80s. 

One necessary part of the puzzle is that a computer must recognise entities in a text, e.g., that a `party wall’ is not someone dressed up for halloween but a wall shared by multiple buildings. We’d like a computer to understand such domain-specific terms. But we don’t want to spend a lot of time defining exactly which words and word combinations belong to our domain lexicon. Therefore, we developed a tool that can help us identify groups of words (in the context of a sentence) that are likely a term of interest.

Example of the annotation scheme, see Figure 2 in our preprint.

We show that it is easy to train our tool — a simple annotation task, but also few training examples needed to achieve reasonable results. Even using just 200 annotated sentences the tool achieves 70,3% exact matches and 24,2% partial matches for entities. The output of this tool (currently focused on the AEC domain) can be used to improve Information Retrieval results and help assemble a lexicon of terminology in support of semantic parsing.

SPARQL For Beginners

SPARQL (Sparql Protocol And RDF Query Language) is the W3C standard. The protocol part is usally only an issue for people writing programs that pass SPARQL queries back and forth between different machines. For most people SPARQL greatest value is as a query language for RDF – another W3C standard. RDF describes data using a collection of three-part of one statement such as emp3 has a title of Vice President.

We call each statement a triple and one triple consist of three parts these are Subject, Predicate and Object.

We can also say the Subject as an Entity Identifier, Predicate as an Attribute Name and Object as an Attribute Value.

The subject and predicate are actually represented using URIs to make it absolutely clear what we are talking about. URIs (Uniform Resource Identifier) kind of URLs and often look like them but they are not locators or addresses they are just identifiers. In our example, emp3 is the person who works in a specific company so we can represent this using URI like http://www.snee.com/hr/emp3 and title is also URI from the published ontology (In our case VCard business card ontology).

The object or third part of a triple can also be a URI if you like this way that same resource can be the object of some triples and subject of the others which lets you connect up triples into networks of data called Graphs.

To make URIs simpler to write RDF popular Turtle syntax often shortens the URIs by having the abbreviated prefix stand-in for everything in the URI before the last part.

Any data can be represented as a collection of triples for example we can usually represent each entry of a table by using the Row Identifier is the Subject, Column Name is the Predicate and Value is the Object.

Convert Relational Database Table to RDF Statements


We can convert above table in RDF triple statements. The following RDF statements are in Turtle format of above table.
 
                                
    @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
    @prefix sn: <http://www.snee.com/hr/> .
    

    sn:emp1   vcard:given-name        "Heidi" .
    sn:emp1   vcard:family-name       "Peter" .
    sn:emp1   vcard:title             "CEO" .
    sn:emp1   sn:hireDate             "2016-10-21" .
    sn:emp1   sn:completedOrientation "2016-10-30" .

    sn:emp2   vcard:given-name         "John" .
    sn:emp2   vcard:family-name        "Peter" .
    sn:emp2   sn:hireDate              "2016-10-28" .
    sn:emp2   vcard:title              "Engineer" .
    sn:emp2   sn:completedOrientation  "2015-01-30" .

    sn:emp3   vcard:given-name          "Imran" .
    sn:emp3   vcard:family-name         "Asif" .
    sn:emp3   sn:hireDate               "2014-12-03" .
    sn:emp3   vcard:title               "Vice President" .

    sn:emp4   vcard:given-name          "Duke" .
    sn:emp4   vcard:family-name         "Walliam" .
    sn:emp4   vcard:title               "Sales" .
    sn:emp4   sn:hireDate               "2015-11-10" .

This information can give us triples for every fact on the table. Some of the property names here from the vCard vocabulary. For those properties that are not available in vCard vocabulary, I made up my own property names using my own domain name. RDF makes it easy to mix and mash standard vocabularies and customizations.

Let’s say that the employee in the above table, John Peter completed his employee orientation course twice and if we want to store both of his completed course orientation values in the RDF then there is not a problem with the RDF. But if we want to stored John’s second completed orientation value in a relational database table then it would have been a lot more difficult.

 
    sn:emp2   vcard:given-name          "John" .
    sn:emp2   vcard:family-name         "Peter" .
    sn:emp2   sn:hireDate               "2016-10-28" .
    sn:emp2   vcard:title               "Engineer" .
    sn:emp2   sn:completedOrientation   "2015-01-30" .
    sn:emp2   sn:completedOrientation   "2015-03-15" .

SPARQL Example Queries


WHERE Clause

let’s look at a simple SPARQL query that retrieve some of the data from the above RDF Triples.
Query 1: We want a list of all employees whose last name is Peter.

                                
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    
    SELECT ?person
    WHERE
    {
      ?person vcard:family-name "Peter" .
    }

We can define the prefixes in the start of SPARQL query due to the Turtle RDF syntax, that’s why you don’t have to write absolute URIs in your queries. For most SPARQL queries it’s best to look at the Where clause first because that describe which triples we want to pull from the dataset that we are querying. The Where clause does this with one or more triple patterns which are likely triples with variables as wildcards substituted into one, two or all three of each triples parts.

In the Query 1, one triple pattern will match against triples whose predicate is the family name property from the vCard vocabulary, whose object is string Peter and whose subject is anything at all. Because this triple pattern has a variable that I named person.

SELECT Clause

The Select clause indicates which variables values we want listed after the query executes. The Query 1 only has one variable so that’s the one we want to see.

When the query executes, it finds two triples that match the specified pattern from the RDF triples set and assigned these two triples to the person variable which are the subjects e.g., emp1 and emp2. The following triples show the two matches of the above query with green colour.

 
    @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
    @prefix sn: <http://www.snee.com/hr/> .

    sn:emp1   vcard:given-name        "Heidi" .
    sn:emp1   vcard:family-name       "Peter" .
    sn:emp1   vcard:title             "CEO" .
    sn:emp1   sn:hireDate             "2016-10-21" .
    sn:emp1   sn:completedOrientation "2016-10-30" .

    sn:emp2   vcard:given-name         "John" .
    sn:emp2   vcard:family-name        "Peter" .
    sn:emp2   sn:hireDate              "2016-10-28" .
    sn:emp2   vcard:title              "Engineer" .
    sn:emp2   sn:completedOrientation  "2015-01-30" .
    sn:emp2   sn:completedOrientation  "2015-03-15" .

    sn:emp3   vcard:given-name          "Imran" .
    sn:emp3   vcard:family-name         "Asif" .
    sn:emp3   sn:hireDate               "2014-12-03" .
    sn:emp3   vcard:title               "Vice President" .

    sn:emp4   vcard:given-name          "Duke" .
    sn:emp4   vcard:family-name         "Walliam" .
    sn:emp4   vcard:title               "Sales" .
    sn:emp4   sn:hireDate               "2015-11-10" .

Let’s executes the Query 1 and find out who are these Peters. we get the following table as a result.

From the above results, we can see that emp1 and emp2 are just identifiers. This doesn’t give us meaningful result.
Query 2: Now let’s add a second triple pattern in the WHERE clause that matches on the given name of the employee, who matches the first triple pattern and stores that value in a new givenName variable.

                                
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#> 

    SELECT ?person ?givenName
    WHERE 
    { 
       ?person vcard:family-name "Peter" . 
       ?person vcard:given-name ?givenName .
    }

The explanation of the Query 2 is that we need to adjust the SELECT Clause with the new variable givenName because now we want this in the result. The SPARQL query processor finds each triple that matches the first triple pattern and store the value in the person variable. When it looks for the second triple pattern in the WHERE clause, who have a triple that matches the first triple pattern. In easy words, SPARQL processor get all triples of family-name Peter along with given-name. So when we run the Query 2 we see given name values in the result.

Query 3: Let’s retrieve the given name, family name and hire date of all the employees.
We can do this with a WHERE Clause that has three triple patterns one for each piece of information that we want to retrieve.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?hireDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:hireDate ?hireDate .
    }

When we run the query 3, we get the following results.

FILTER Keyword

If we want to narrow down the results based on some condition, we can use a FILTER pattern.
Query 4: Let’s say we want a list of employees who are hired before November 1st so the FILTER pattern specifies that we only want HD values that are less than November 1st 2015.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?hireDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:hireDate ?hireDate .
        FILTER(?hireDate < "2015-11-01")
    }

When we run the query 4, we get the following results in the ISO 8601 date format.

Query 5: Let’s remove the FILTER condition and list the employees and their completed orientation values instead of their hire date values.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?oDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        ?person sn:completedOrientation  ?oDate .
    }

When we run the query 5, we get the following results.

We see only Heidi and John’s orientation dates but the other employees don’t appear at all in the results why not? Let’s look more closely at the query triple patterns. The query first looks for a triple with a given name value and then a triple with the same subject as the subject that it found to match the first triple pattern but with a family name value and then another triple with the same subject and a completed orientation value. John and Heidi each have triples that match all the query triple patterns but Imran and Duke cannot match all three triple patterns. You have noted that John actually had two triples that matched the third pattern of the query, so the query had two rows of results for him, one for each completed orientation value.

OPTIONAL Keyword

Query 6: Let’s take another example, list of all employees and if they have any their completed orientation values.
We can tell query processor that matching on the third triple pattern is OPTIONAL.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?oDate
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        OPTIONAL { ?person sn:completedOrientation  ?oDate . }
    }

This query asks for everyone with a given name and a family name and if they have a completed orientation value it will show the following result.

NOT EXISTS Keyword

Query 7: Next let’s say that Heidi is scheduling a new orientation meeting and wants to know who to invite, in other words she wants to list all employees who do not have a completed orientation value.
Her query asks for everyone’s given and family names but only if for any employee who matches those first two triple patterns no triple exists that lists a completed orientation value for that employee. We do this with the keywords NOT EXISTS.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName
    WHERE
    {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        NOT EXISTS { ?person sn:completedOrientation  ?oDate . }
    }

When we run the query 7, we get the following results.

BIND Keyword

So far, the only way we have seen to store a value in a variable is to include that variable in a triple pattern for the query processor to match against some part of triple. We can use the bind keyword to store whatever we like in a variable.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?someVariable
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND("some value" AS ?someVariable)
    }

When we run the above query, we get the following results.

This can be especially useful when the BIND expression uses values from other variables and calls some of SPARQL broad range of available functions to create a new value. In the following query the bind statement uses SPARQL’s concat function to concatenate the given name value stored by the first triple pattern a space and the family name value stored by the second triple pattern. It stores the result of this concatenation in a new variable called fullName.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    SELECT ?givenName ?familyName ?fullName
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following results with new full name value for each employee.

CONSTRUCT Clause

All the queries we have seen so far have been SELECT queries which are like SQL SELECT statements. A Sparql construct query uses the same kind of WHERE clauses that a SELECT query can use but it can use the values stored in the variables to create the new triples.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>

    CONSTRUCT {?person vcard:fn ?fullName
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following new triples.

Note that how the triple pattern showing the triple to construct is inside of curly braces. These curly braces can enclose multiple triple patterns which is a common practice when for example a construct query takes data conforming to one modal and creates triples conforming to another. The construction queries are great for data integration projects.


    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
    PREFIX sn: <http://www.snee.com/hr/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

    CONSTRUCT {
        ?person rdf:type foaf:Person .
        ?person foaf:givenName ?givenName .
        ?person foaf:familyName ?familyName .
        ?person foaf:name ?fullName .
    }
    WHERE {
        ?person vcard:given-name ?givenName .
        ?person vcard:family-name ?familyName .
        BIND(concat(?givenName," ",?familyName) AS ?fullName)
    }

When we run the above query, we get the following new triples.


            @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
            @prefix sn: <http://www.snee.com/hr/> .
            @prefix foaf: <http://xmlns.com/foaf/0.1/>
            @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

            sn:emp1   rdf:type          foaf:Person .
            sn:emp1   foaf:familyName   "Peter" .
            sn:emp1   foaf:givenName    "Heidi"
            sn:emp1   foaf:name		"Heidi Peter"

            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Peter" .
            sn:emp2   foaf:givenName    "John"
            sn:emp2   foaf:name		"John Peter"


            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Asif" .
            sn:emp2   foaf:givenName    "Imran"
            sn:emp2   foaf:name		"Imran Asif"

            sn:emp2   rdf:type          foaf:Person .
            sn:emp2   foaf:familyName   "Walliam" .
            sn:emp2   foaf:givenName    "Duke"
            sn:emp2   foaf:name		"Duke Walliam"

SPARQL Do More:

Sparql can do a lot more than what we have seen here. You can
  • Use Data types languages tags,
  • Sort and aggregate query results
  • Add, delete and update data
  • Retrieve JSON, XML, and delimited versions of query results from query processors
  • Send queries to remote private or public data collections and there are quite a few of those out there.

References


DuCharme, B., 2013. Learning SPARQL: querying and updating with SPARQL 1.1. ” O’Reilly Media, Inc.”.

Seminar: Utilising Linked Data in the Public Sector

Title: Utilising Linked Data in the Public Sector

Speaker: Angus Addlesee, PhD Student, Heriot-Watt University

Date: 11:15 on 25 March 2019

Location: CM F.17, Heriot-Watt University

Abstract: In this presentation I will explain how Wallscope (a small tech company in Edinburgh) is using linked data in public sector projects.

Bio: Angus has worked at Wallscope for two years in various roles and is now studying his PhD at Heriot-Watt which is part funded by Wallscope.

Wallscope uses Machine Learning and Semantic Technologies to build Knowledge Graphs and Linked Data applications. We are motivated to lower the barriers for accessing knowledge to improve the health, wealth and sustainability of the world we share.

Seminar: Exploiting Semantic Web Technologies in Open-domain Conversational Agents

Title: Exploiting Semantic Web Technologies in Open-domain Conversational Agents

Speaker: Alessandro Suglia, Heriot-Watt University

Date: 11:15 on 10 December 2018

Location: CM F.17, Heriot-Watt University

Abstract: The Amazon Alexa Prize is an international competition organised to foster the development of sophisticated open-domain conversational agents. During the competition, the systems should be able to support a conversation with a user about several topics ranging from movies to the news in an engaging and coherent way. In order to understand the entities mentioned by the user and to be able to provide interesting information about those, we relied on several Semantic Web Technologies such as the Amazon Neptune cluster for high-performance SPARQL query execution on a customised large-scale knowledge base composed of Wikidata and a fragment of the DBpedia ontology. In this talk, I will provide an overview of the system and I will describe how our system leverages the power of Linked Data in several components of the architecture.

Does Open Scottish Government Data Justify the Proposed Edinburgh Tram Extension?

The Scottish government provides open access to a range of official statistics about Scotland across 17 themes, including crime, the economy, housing, and population growths. The portal is accessible at http://statistics.gov.scot.

This post uses open Scottish government data to measure the 5 year population growth in areas in close proximity to the proposed completion of phase 1a of the tram extension, from York Place to Newhaven.

The tram infrastructure business case

The original business case for the tram infrastructure was set out in a document Edinburgh Tram Draft Final Business Case (The City of Edinburgh Council, December 2006). It justifies the need for a tram system using two broad arguments:

  1. The predicted growth in housing demands for a growing population in the areas on the proposed lines:

“Huge new developments are in the pipeline, especially on the city’s waterfront in Leith Docks and Granton. Edinburgh Waterfront is the largest brownfield development in Scotland, equivalent to a major new town in scale, with the two major development sites able to accommodate up to 29,000 new homes in the longer term (Section 3.3, Why Tram).”

2. The tram system will encourage economic regeneration and social inclusion:

“Without the tram, access to the major Waterfront developments will simply not be good enough. The Leith Docks proposals would have to be scaled down and the development prospects at Granton would be damaged (Section 3.3, Why Tram). Areas of [..] a zone around Leith Walk [..] are areas where socio economic status is considerably less affluent than surrounding areas and where employment, income levels and car ownership tend to be comparatively low. Opportunities for people living in these areas will be improved by direct connection via tram to the City Centre and other employment areas, including the new development in Granton, Leith and the West of the City at Edinburgh Park and the Airport (Section 1.18, Accessibility and Social Inclusion).”

This document, written 11 years ago, made various projections of population growths, particularly in the north and north east of Edinburgh. For example, it states that 24,000 new houses will be needed in Edinburgh by 2015 (Section 3.3, Why Tram). The Scottish government open data portal can be used to assess the accuracy of those projections.

Party political views on the proposed extension

Due to well documented economic costs and major project problems, the tram stretches only as far as York Place. Edinburgh City Council are considering an extension from York Place down Leith Walk, to Newhaven. It could be partly funded by a City Deal. A definitive yes/no decision on the extension will be made after the local council elections take place on 4th May 2017. The Labour party are in favour, the SNP have recently set conditions on their backing of the extension, whilst the Conservative party have promised to reject the business case for the extension in their council election manifesto.

Zones along the proposed tram extension

The long term proposed route connects Granton to the west end and Newhaven to York Place, and a line between Granton and Newhaven to complete the ring. The route is available on the Edinburgh Local Development Plan :

Proposed extension route

Below I focus on the 5 year plan of extending the line from York Place to Newhaven. In the Scottish government open dataset, there are 1279 “intermediate geographical zones”. Below depicts the intermediate zones in close proximity to the proposed extension down Leith Walk and to Newhaven:

Broughton
Broughton South
Broughton
Easter Road and Hawkhill Avenue
Broughton
Great Junction Street
Broughton
North Leith and Newhaven
Broughton
Pilrig
Broughton
The Shore and Constitution Street
Western Harbour and Leith Docks
Western Harbour and Leith Docks

Population growth along the proposed extension

The Scottish government open data portal allows us to query the population growth in each Scottish city, and also within each intermediate zone. Below shows the rate of growth from 2011 to 2015, first at the city level and then at the intermediate zone level.


National population growth

Tram extension proximity population growth

Here are 3 observations about these graphs:

  1. The population of Edinburgh city is growing faster than Glasgow, Dundee, Stirling and Aberdeen. Between 2011 and 2015, the population of Edinburgh has grown by 4.4%.
  2. The population of the intermediate zones in close proximity to the proposed Edinburgh tram extension is growing quickly. Of the 7 identified zones, the population of 5 zones is growing at a faster rate than the average growth rate across Edinburgh.
  3. The population of the Western Harbour and Leith Docks zone is growing at a very rapid pace, where the population grew by 22.7% between 2011 and 2015.

Reproducing this analysis

These graphs are generated by

  1. Running a SPARQL query against the Scottish government dataset service.
  2. Plotting graphs from the CSV result set using R.

The SPARQL queries and plotting scripts used for the material in this post are available online at https://github.com/robstewart57/edinburgh-trams .

Summary

This post uses the open Scottish government data to show that the population growth along the proposed tram route extension from York Place to Newhaven is greater than the average for Edinburgh between 2011 and 2015 (zonal growths of up to 23%), a city which itself has the fastest growing population compared to 4 other Scottish cities.

This data validates the 2006 projection that growth in housing demands will increase in the north east of Edinburgh, although measuring any discrepancies in exact numbers is difficult because the 2006 projections were not cross referenced to the intermediate zones that the open dataset uses.

The population data used here is only one of the 17 themes in the Scottish government’s dataset. An extension to this study would explore the other business cases for the tram, namely economic regeneration and social inclusion using the other 16 themes, to evaluate the completeness and granularity of economic and societal data at local levels in the Scottish government dataset.

Seminar: Computer-Aided Biomimetics

Title: Computer-Aided Biomimetics: Computational tools to support engineers during Biologically Inspired Design

Speaker: Ruben Kruiper, Heriot-Watt University

Date: 11:15 on 30 January 2017

Location: CM F.17, Heriot-Watt University

Abstract: Biologically inspired design concerns copying ideas from nature to various other domains, e.g. natural computing. Biomimetics is a sub-field of biologically inspired design and focuses specifically on solving technical/engineering problems. Because engineers lack biological knowledge the process of biomimetics is non-trivial and remains adventitious. Therefore, computational tools have been developed that aim to support engineers during a biomimetics process by integrating large amounts of relevant biological knowledge. Existing tools work apply NLP techniques on biological research papers to build dedicated knowledge bases. However, these existing tools impose an engineering view on biological data. I will talk about the support that ‘Computer-Aided Biomimetics’ tools should provide, introducing a theoretical basis for further research on the appropriate computational techniques.