Open PHACTS Closing Symposium

For the last 5 years I have had the pleasure of working with the Open PHACTS project. Sadly, the project is now at an end. To celebrate we are having a two day symposium to look over the contributions of the project and its future legacy. The project has been hugely successful in developing an […]

For the last 5 years I have had the pleasure of working with the Open PHACTS project. Sadly, the project is now at an end. To celebrate we are having a two day symposium to look over the contributions of the project and its future legacy.

The project has been hugely successful in developing an integrated data platform to enable drug discovery research (see a future post for details to support this claim). The result of the project is the Open PHACTS Foundation which will now own the drug discovery platform and sustain its development into the future.

Here are my slides on the state of the data in the Open PHACTS 2.0 platform.

Why is there no LearningResource type in schema.org?

A couple of times in the last month or so the question of why isn’t there a LearningResource type in schema.org as a subtype of CreativeWork. In case it comes up again, here’s my answer. We took a deliberate decision way back at the start of LRMI not to define a LearningResource as a subtype … Continue reading Why is there no LearningResource type in schema.org?

A couple of times in the last month or so the question of why isn’t there a LearningResource type in schema.org as a subtype of CreativeWork. In case it comes up again, here’s my answer.

We took a deliberate decision way back at the start of LRMI not to define a LearningResource as a subtype of CreativeWork. Essentially the problem comes when you try to define what is a Learning Resource. Everyone who has tried so far has come up with something like “a resource which is used in learning, education or training”. That doesn’t rule out anything. Whether a magazine like Germany’s Spiegel is a learning resource depends on whether you are a German speaker or an American studying German. In presentations I have compared this problem to that of defining “what is a seat”. You can get seats in all shapes and forms with many different characteristics: chairs, sofas, saddles, stools; so in the end you just have to say a seat is something you sit on. Rather than rehash the problem of deciding what is and isn’t a learning resource, we took the approach of providing a way by which people can describe the educational properties of any Creative Work.

We recognised that there are some “types” of resource that are specific for learning. You can sensibly talk about textbooks and instructional videos as being are qualitatively different to novels and the movies people watch in the cinema, without denying that novels and movies are useful in education. That’s why we have the learningResourceType property. You can think of this as describing the educational genre of the resource.

In practice there are two choices for searching for learning resources. You can search those sites that are curated collections of what someone has decided are educational resources. Or you can search for the educational properties you want. So in our attempt at creating a Google Custom Search Engine we looked for the AlignmentObject. Looking for the presence of a learningResourceType would be another way. The educationalUse property should likewise be a good indicator.

HECoS, a new subject coding system for Higher Education

You may have missed that just before Christmas HECoS (the Higher Education Classification of Subjects) was announced. I worked a little on the project that lead up to this, along with colleagues in Cetis (who lead the project), Alan Paull Serices and Gill Ferrell, so I am especially pleased to see it come to fruition. … Continue reading HECoS, a new subject coding system for Higher Education

You may have missed that just before Christmas HECoS (the Higher Education Classification of Subjects) was announced. I worked a little on the project that lead up to this, along with colleagues in Cetis (who lead the project), Alan Paull Serices and Gill Ferrell, so I am especially pleased to see it come to fruition. I believe that as a flexible classification scheme built on semantic web / linked data principles it is a significant contribution to how we share data in HE.

HECoS was commissioned as part of the Higher Education Data & Information Improvement Programme (HEDIIP) in order to find a replacement to JACS, the subject coding scheme currently used in UK HE when information from different institutions needs to be classified by subject. When I was first approached by Gill Ferrell while she was working on a preliminary study of to determine if it needed changing, my initial response was that something which was much more in tune with semantic web principles would be very welcome (see the second part of this post that I wrote back in 2013). HECoS has been designed from the outset to be semantic web friendly. Also, one of the issues identified by the initial study was that aggregation of subjects was politically sensitive. For starters, the level of funding can depend on whether a subject is, for example, a STEM subject or not; but there are also factors of how universities as institutions are organised into departments/faculties/schools and how academics identify with disciplines. These lead to unnecessary difficulties in subject classification of courses: it is easy enough to decide whether a course is about ‘actuarial science’ but deciding whether ‘actuarial science’ should be grouped under ‘business studies’ or ‘mathematics’ is strongly context dependent. One of the decisions taken in designing HECoS was to separate the politics of how to aggregate subjects from the descriptions of those subjects and their more general relationships to each other. This is in marked contrast to JACS where the aggregation was baked into the very identifiers used. That is not to say that aggregation hierarchies aren’t important or won’t exist: they are, and they will, indeed there is already one for the purpose of displaying subjects for navigation, but they will be created through a governance process that can consider the politics involved separately from describing the subjects. This should make the subject classification terms more widely usable, allowing institutions and agencies who use it to build hierarchies for presentation and analysis that meet their own needs if these are different from those represented by the process responsible for the standard hierarchy. A more widely used classification scheme will have benefits for the information improvement envisaged by HEDIIP.

The next phase of HECoS will be about implementation and adoption, for example the creation of the governance processes detailed in the reports, moving HECoS up to proper 5-star linked data, help with migration from JACS to HECoS and so on. There’s a useful summary report on the HEDIIP site, and a spreadsheet of the coding system itself. There’s also still the development version Cetis used for consultation, which better represents its semantic webbiness but is non-definitive and temporary.

Validata: An online tool for testing RDF data conformance

Validata is an online web application for validating an RDF document against a set of constraints. This is useful for data exchange applications or ensuring conformance of an RDF dataset against a community agreed standard. Constraints are expressed as a Shape Expression (ShEx) schema. Validata extends the ShEx functionality to support multiple requirement levels. Validata […]

Validata is an online web application for validating an RDF document against a set of constraints. This is useful for data exchange applications or ensuring conformance of an RDF dataset against a community agreed standard. Constraints are expressed as a Shape Expression (ShEx) schema.
Validata extends the ShEx functionality to support multiple requirement levels. Validata can be repurposed for different deployments by providing it with a new ShEx schema.

The Validata code is available from https://github.com/HW-SWeL/Validata. Existing deployments are available for:

Paper published at SWAT4LS2015.

MACS Christmas Conference

I was asked to speak at the School (Faculty) of Mathematical and Computer Sciences (MACS) Christmas conference. I decided I would have some fun with the presentation. Title: Project X Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now […]

I was asked to speak at the School (Faculty) of Mathematical and Computer Sciences (MACS) Christmas conference. I decided I would have some fun with the presentation.

Title: Project X

Abstract: For the last 11 months I have been working on a top secret project with a world renowned Scandinavian industry partner. We are now moving into the exciting operational phase of this project. I have been granted an early lifting of the embargo that has stopped me talking about this work up until now. I will talk about the data science behind this big data project and how semantic web technology has enabled the delivery of Project X.

You can find more details of flood defence work in this paper.

schema for courses

UPDATE: there is a new W3C community group schema course extend set up to progress these ideas. Please join if you are interested. This is essentially an invite to get involved with building a schema extension for educational courses, by way of a description of work so far. If you want to reply it’s sent … Continue reading schema for courses

UPDATE: there is a new W3C community group schema course extend set up to progress these ideas. Please join if you are interested.

This is essentially an invite to get involved with building a schema extension for educational courses, by way of a description of work so far. If you want to reply it’s sent as an email schema.org mail list.

About a year ago there was a flurry of discussion about wanting to markup descriptions of courses in schema. Vicky Tardiff-Holland produced a proposal which we discussed in LRMI and elsewhere as a result of which various suggestions were and comments were added to that proposal.

I also led some work in LRMI around scope, use cases, requirements, existing data; which I hope will lead to validating/refining the proposal by some example data that could be used to demonstrate that it met the use cases.

I am up for another push on courses. I share the doc I was working on in the hope that it is good starting point. It’s a bit long, so here is an overview of what it contains:

  • scope: concerning discovery of any type of educational course (online/offline, long/short, scheduled/on-demand) Educational course defined as “some sequence of events and/or creative works which aims to build knowledge, competence or ability of learners”. (out of scope: information about students and their progression etc; information needed internally for course management rather than discovery)
  • comparators: a review of some established ways of sharing similar data
  • use cases
  • requirements arising from the use cases
  • mapping to some existing examples. I used hypothes.is to annotate existing web pages that describe different types of course, e.g. from Coursera or a University, tagging the requirement that the data was relevant to. Here’s an example of a page as tagged (click on a yellow highlight to show the relevant requirement as a comment with a tag)
    hypothes.is aggregates the selected information for each tag, to give a list of the information relevant to each use case, for example cost

I think the next step would be to review the use cases and requirements in light of some of the observations from the mapping, and to look again at the proposal to see how it reflects the data available/required. But first I want to try to get more people involved, see whether anyone has a better idea for how to progress, or if anyone wants to check the work so far and help move it forward.

Finally, I’m aware the docs and discussions so far around schema for courses are a scattered set of scraps and drafts. If there is enough interest it would be really useful to have it in one place.

On the first day of Christmas

Prompted by On the second day of Christmas, my true love sent to me: Anscombe’s quartet https://t.co/0olyAiVaBY — Judy Robertson (@JudyRobertsonUK) December 2, 2015 and with apologies: On the first day of Christmas My true love gave to me A testable hypoth-e-sis On the second day of Christmas My truelove gave to me Two sample … Continue reading On the first day of Christmas

Prompted by

and with apologies:

On the first day of Christmas
My true love gave to me
A testable hypoth-e-sis

On the second day of Christmas
My truelove gave to me
Two sample means
And a testable hypothesis

On the third day of Christmas
My true love gave to me
Three peer reviews
Two sample means
And a testable hypothesis

On the fourth day of Christmas
My true love gave to me
Four scatter plots
Three peer reviews
Two sample means
And a testable hypothesis

On the fifth day of Christmas
My true love gave to me
FIIIVE SIGMAA RuuuuLE

(I always thought the carol went down hill from there)

A short project on linking course data from Sharing and learning

During the summer my colleague Phil Barker (author of the Sharing and Learning blog) and I hosted a summer intern, Anna Grant. Anna’s project was to investigate the feasibility of publishing the data about our courses as Linked Data. Phil subsequently wrote up a blog post about the work which I have been meaning to share for […]

During the summer my colleague Phil Barker (author of the Sharing and Learning blog) and I hosted a summer intern, Anna Grant.

Anna’s project was to investigate the feasibility of publishing the data about our courses as Linked Data. Phil subsequently wrote up a blog post about the work which I have been meaning to share for a long time, so here it is; long overdue.

Below I have picked out some quotes from Phil’s original blog post that describe the work that Anna did.

The objectives for Anna’s work were ambitious: survey existing HE [Higher Education] open data and ontologies in use; design an ontology that we can use; develop an interface we can use to create and publish our course data. Anna made great progress on all three fronts.

The ontologies reviewed were: AIISO, Teach, CourseWare, XCRI, MLO, ECIM and CEDS. A live working draft of the summary / review for these is available for comment as a Google Doc.

The final draft [of the extended MLO Ontology] is shown below. Key:  Green= MLO, Purple=MLO extension, Blue=ECIM / previous alteration to MLO Yellow= generic ontologies such as Dublin core and SKOS.

MLO Extension to capture taught courses and their relationships to degree programmes.

Anna has finished her work here now and returns to Edinburgh Napier University to finish her Master’s project. Alasdair and I think she has done a really impressive job, not least considering she had no previous experience with RDF and semantic technologies. We’ve also found her a pleasure to work with and would like to thank her for her efforts on this project.

Source: A short project on linking course data | Sharing and learning

A library shaped black hole in the web?

A library shaped black hole in the web? was the name of an OCLC event that was getting its second(?) run in Edinburgh last week, looking at how libraries can contribute to the web, using new technologies (for example linked data) to “re-envision, expose and share library data as entities (work, people, places, etc.) and … Continue reading A library shaped black hole in the web?

A library shaped black hole in the web? was the name of an OCLC event that was getting its second(?) run in Edinburgh last week, looking at how libraries can contribute to the web, using new technologies (for example linked data) to “re-envision, expose and share library data as entities (work, people, places, etc.) and what this means.”

Aside: to suggest that libraries act as a black hole in the web is quite a strong statement, you see black holes suck in information and at the very least mangle it, if not destroy it completely. Perhaps only a former physicist would read the title that way :-)

We were promised that we would:

learn how entity-based descriptions of library data – powered by linked data – will create new approaches to cataloguing, resource sharing and discovery. We will look at how referencing library data as entities, in Web friendly formats, enables data relationships to be rendered useful in many more contexts increasing the relevance of libraries within the wider information ecosystem.

which I wouldn’t quibble with. Here’s a summary of what I did take from the day.

Owen Stephens got us started with an introduction to the basic RDF model of triples building in to a graph, pointing out that the basic services required to start doing this are not already available to libraries. So if the statement you wish to make is about the authorship of a book, you need URIs to identify the book, the person and the “has creator” relationship: these first two of these are provided by, for example, the Library of Congress Authorities linked data service, the third by Dublin Core (among others). But Owen stressed that the linked data approach was more than another view of the same data because other people can make statements about your data. Owen drew on the distinction made in the Semantic Web community between “open world” and “closed world” approaches to illustrate how this can change your view of data. The library-catalogue-as-inventory is treated “closed world”, that is that all the relevant information could be assumed to be there, so if you don’t have information about a book in your inventory then you infer that you don’t have the book. In an open world, however, someone else might have information that would change that inference, so in an open world approach to using data you wouldn’t take lack of information about something to mean that the thing in question did not exist. The advantage of working in an open world is that further information is always being added by others from other fields, so the catalogue-as-information-source can be just one source of data for a web that goes beyond bibliographic data. Owen gave an example of this from Early English Books, where data extracted from the colophons about the booksellers who had commissioned the printing of each book had been linked to data from historical research on these book sellers (their locations and dates of operation) which greatly enhances the value of the library catalogue data for researchers into the history of publishing. We’ll come back to this theme of enhancing the value of the library catalogue for others.

Owen has a more complete summary of his presentation available.

Neil Jefferies of the Bodleian library built on what Owen had been discussing. He identified the core interest of the library as the intellectual content of the books, letting archives and museums deal with the book as an object, and he mentioned the hierarchical nature of intellectual content: data -> facts -> information -> knowledge. He also added that the libraries key strengths of the library are expertise in retention and search, and access to the physical originals. technology thought has shifted what the library may achieve, so that it should be about creating knowledge not just holding data or sharing information. He went on to give more examples of projects showing libraries using linked data to facilitate knowledge creation than I could manage to take notes on, but a among the highlights was: LD4L, Linked data for libraries, a $999k Mellon funded project involving Cornell, Harvard and Standford, which aims to create a “Scholarly Resource Semantic Information Store” which works both within individual institutions and links to other domains. the aim is to build this with OSS, and Neil mentioned the VIVO platform and community as an example of this. Neil also spoke about the richness required in order to model all the information relevant to knowledge in the library. CAMELOT is the data model used for knowledge held at the Bodleian, it includes a lot of provenance and contextual modelling: linked data is about assertions and you need context and provenance to be able to judge the truth of these (here’s a consequence of the open world nature of linked data, do you know where your data came from? do you know the assumptions made when creating it?). BIBFRAME, or MARC in RDF is not enough, it holds on to the idea of central authority of the catalogue(-as-inventory), and in linked data authority is more diffuse. The data model for LD4L will likely include BIBFRAME, FaBIO, VIVO-ISF, OpenAnnotation, PAV, OAI-ORE, SKOS, VIAF, ORCHID, ISNI, OCLC Works, circulation, citation and usage data, and will likely need a deal of entity reconciliation to deal with many people talking about the same thing.

So much for the idea and the promise of linked data for libraries. I would next like to describe a trio of talks that dealt with the question “what is to be done?”

Cathy Dolbear from Oxford University Press spoke about providing semantic and bibliographic data for libraries. The OUP provide metadata in a lot of different ways, varying from the venerable OAI-PMH (which seems to have little uptake) to RDFa embedded in product web pages (which may soon become JSON-LD). And yet most people find OUP content via direct links and search engines, a spot sample of one day’s referrers showed library discovery services accounted for ~1% of the hits. Cathy stressed that there were patches where library discovery services were more significant, but on the whole it was hard to see library use. Internally OUP have their own schema, OxMetaML, and are moving to a more graph-based approach; the transform this to the standard used by discovery services, e.g. HighWire, PRISM, JATS, PubMED etc. Cathy seemed to want to find ways that OUP metadata could be used to support the endeavours of libraries to use linked data as described above, but wanted to know that if she published Linked Data how would it be used–OUP can only spend money on doing things they know people are going to use, but it is hard to see who is using linked data. I got the strong impression that Cathy knew the ideas, and was aware of the project work being done with linked data, but her key point was that OUP need more info from libraries about what data is needed for real-world service delivery before they can be sure whether it’s worth creating & delivering metadata.

Ken Chad spoke about Linked data why care and what do we do describing the current status of linked data in terms of chasms in the technology adoption lifecycle and troughs of disillusionment in the hype cycle, both of which echo Cathy’s question about how do we get beyond interesting projects and to real world service delivery. In my own mind this is key. The initial draft of RDF is about 18 years old. The “linked data” reboot is about 9 years old. When do we stop talking about early adopters and decide we’ve got all the adopters we’re going to get? Or at least if we want more adopters we need a radically different approach. Ken spoke about approaching the problem in terms of the Jobs to be Done–the link to Ken’s presentation above describes that approach–which I have no problem with,  and I certainly would agree with Ken’s suggestion that the job to be done is to “design a library website that helps students focus less on finding and more on studying”. However, I do think there is an extra layer to this problem in that it requires other people to provide things you can link to. Buying a phone won’t help get a job done if you’re the only person with a phone.

Gill Hamilton of the National Library of Scotland to the theme of how to be ready for linked data even if you’re not convinced. This appealed to me. She gave three top tips: (1) following google, think of things not strings and record URIs not names; (2) you probably need a rich and  detailed schema for your own specialised uses of the data, don’t dumb this down to a generic ontology but publish it and map to the generic; (3) concentrate on what you have that’s unique and let other people handle the generic. To these Gill added three lesser tips: license your metadata as CC0, demand better systems and use open vocabularies.

Richard Wallis gave the final presentation ” the web of data is our oyster” which he stated by describing a view of the development of the web from a web of documents to a web of dynamic documents, to a web of information discovery to a web of data to a web of knowledge (with knowledge graphs and data mining). He suggested that libraries were engaged at the start, but became disengage, maybe even hostile, at the point of the  the web of discovery. One change that libraries had missed through this was the move from records (by definition, relating to the past) to living descriptions in terms of entities and relationships.  This he suggested had meant that many library projects on sharing data had lead to “linked data silos” which search engines cannot get into. The current approach to giving search engines access to entity and relationship data is schema.org, and Richard described his own work on extending schema for bibliographic data. Echoing Gill’s second tip he stressed that this was not intended to be an appropriate way to meet the libraries metadata needs, or even the way that libraries should use the web to share data between themselves, but it is a way that libraries can share their data with the web (of discovery, of data, of knowledge) as whole.

All in all, a good day. Nothing spectacularly new, but useful to see it all lined up and presented so coherently. Many thanks to Owen, Neil, Cathy, Ken, Gill and Richard, and to OCLC for arranging the event.

 

 

 

Presentation: LRMI – using schema.org to facilitate educational resource discovery on the web and beyond

Today I am in London for the ISKO Knowledge Organisation in Learning and Teaching meeting, where I am presenting on LRMI and schema.org to facilitate educational resource discovery on the web and beyond. My slides are here, mostly they cover similar ground to presentations I’ve given before which have been captured on video or which … Continue reading Presentation: LRMI – using schema.org to facilitate educational resource discovery on the web and beyond

Today I am in London for the ISKO Knowledge Organisation in Learning and Teaching meeting, where I am presenting on LRMI and schema.org to facilitate educational resource discovery on the web and beyond. My slides are here, mostly they cover similar ground to presentations I’ve given before which have been captured on video or which I have written up in more detail. So here I’ll just point to my slides for today (& below) and summarise the new stuff.

LRMI uptake

People always want to know how much LRMI exists in the wild, and now schema.org reports this infomation. Go to the schema.org page for any class or property and at the top it says in how many domains they find markup for it. Obviously this misses that not all domains are equal in extent or importance: finding LRMI on pjjk,net should not count as equal to finding it on bbc.co.uk, but as a broad indicator it’s OK: finding a property on 10 domains or 10,000 domains is a valid comarison. LRMI properties are mostly reported as found on 100-1000 domains (e.g. learning resource type) or 10-100 domains (e.g. educational alignment). A couple of LRMI properties have greater usage, e.g. typical age range and is based on URL (10-50,00 and 1-10,000 domains respectively), but I guess that reflects their generic usefulness beyond learning resources. We know that in some cases LRMI is used for internal systems but not exposed on web pages, but still the level of usage is not as high as we would like.

I also often get asked about support for creating LRMI metadata, this time I’m including a mention of how it is possible to write WordPress plugins and themes with schema / LRMI support, and the drupal schema.org plugin. I’m also aware of “tagging tools” associated with various repositories, e.g. the learning registry and the Illinois Shared Learning Environment. I think it’s always going to be difficult to answer this one as the best support will always come from customising whatever CMS an organisation uses to manage their content or metadata and will be tailored to their workflow and the types of resources and educational contexts they work in.

As far implementation for search I still cover google custom search, as in the previous presentations.

Current LRMI activities

The DCMI LRMI task group is active, one of our priorities is to improve the support for people who want to use LRMI. Two activities are nearing fruitition: firstly, we are hoping to provide examples for relevant properties and type on the schema.org web site. Secondly, we want to provide better support for the vocabularies used for properties such as alignment type (in the Alignment Object), learning resource type etc, by way of clear definitions and machine readable vocabulary encodings (using SKOS). We are asking for public review and comment on LRMI vocabularies, so please take a look and get in touch.

Other work in progress is around schema for courses and extending some of the vocabularies mentioned above. We have monthly calls, if you would like to lend a hand please do get in touch.