Community Recommendations

Here is a list of the community recommendations that I have been involved with.

2022

  • Ammar Ammar, Ivan Mičetić, and Alasdair J. G. Gray. An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps. Technical Report, 2022. doi:10.37044/osf.io/7f95d
    [BibTeX] [Abstract] [Download PDF]

    Schema.org and Bioschemas are lightweight vocabularies that aim at making the contents of web pages machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer. The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. This would ease publisher and customer processing requirements and accelerate data collection. In this work, we report on the implementation of a JSON-LD consumer ETL (Extract-Transform-Load) pipeline that enables data dumps to be ingested into knowledge graphs (KG). The pipeline loads scraped JSON-LD from the three sources, converts it to RDF, applies SPARQL construct queries to map the source RDF to a unified Bioschemas-based model and stores the resulting KG as a turtle file. This work was conducted during the one-week Biohackathion Europe 2022 in Paris France, under Project 23 titled, “Publishing and Consuming Schema.org DataFeeds.”

    @techReport{ammar:data-pipeline:biohackrxiv2022,
    abstract={Schema.org and Bioschemas are lightweight vocabularies that aim at making the contents of web pages machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer. The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. This would ease publisher and customer processing requirements and accelerate data collection. In this work, we report on the implementation of a JSON-LD consumer ETL (Extract-Transform-Load) pipeline that enables data dumps to be ingested into knowledge graphs (KG). The pipeline loads scraped JSON-LD from the three sources, converts it to RDF, applies SPARQL construct queries to map the source RDF to a unified Bioschemas-based model and stores the resulting KG as a turtle file. This work was conducted during the one-week Biohackathion Europe 2022 in Paris France, under Project 23 titled, “Publishing and Consuming Schema.org DataFeeds.”},
    title={An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps},
    url={https://biohackrxiv.org/7f95d/},
    DOI={10.37044/osf.io/7f95d},
    publisher={BioHackrXiv},
    author={Ammar Ammar and Ivan Mičetić and Alasdair J. G. Gray},
    year={2022},
    month=nov
    }

  • Alasdair J. G. Gray, Petros Papadopoulos, Alban Gaignard, Thomas Rosnet, and Ivan Mičetić. Bioschemas data harvesting project report. Technical Report, 2022. doi:10.37044/osf.io/y6gbq
    [BibTeX] [Abstract] [Download PDF]

    The promise of Bioschemas is that it makes consuming data from multiple resources more straightforward. However, this hypothesis has not been tested by conducting a large scale harvest of deployed markup and making this available for others to reuse. Therefore, the goal of this hackathon project is to harvest a collection of Bioschemas markup from a number of different sites listed on the Bioschemas live deploys page using the Bioschemas Markup Scraper and Extractor (BMUSE). The harvested data will be made available for others and loaded into a triplestore to allow for further exploration.

    @techReport{gray:bioschemas-harvesting:2022,
    abstract={The promise of Bioschemas is that it makes consuming data from multiple resources more straightforward. However, this hypothesis has not been tested by conducting a large scale harvest of deployed markup and making this available for others to reuse. Therefore, the goal of this hackathon project is to harvest a collection of Bioschemas markup from a number of different sites listed on the Bioschemas live deploys page using the Bioschemas Markup Scraper and Extractor (BMUSE). The harvested data will be made available for others and loaded into a triplestore to allow for further exploration.},
    title={Bioschemas data harvesting project report},
    url={https://biohackrxiv.org/y6gbq/},
    DOI={10.37044/osf.io/y6gbq},
    publisher={BioHackrXiv},
    author={Gray, Alasdair J G and Papadopoulos, Petros and Gaignard, Alban and Rosnet, Thomas and Mičetić, Ivan},
    year={2022},
    month=mar
    }

2021

  • Alasdair J. G. Gray, Petros Papadopoulos, Ivan Mičetić, and András Hatos. Exploiting Bioschemas Markup to Populate IDPcentral. Technical Report, 2021. doi:10.37044/osf.io/v3jct
    [BibTeX] [Abstract] [Download PDF]

    One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community’s specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data. At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose. The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated. As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs.

    @techReport{gray:bioschemas-idpcntral:2021,
    abstract={One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community's specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data.
    At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose.
    The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated.
    As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs.},
    title={Exploiting Bioschemas Markup to Populate IDPcentral},
    url={https://biohackrxiv.org/v3jct},
    DOI={10.37044/osf.io/v3jct},
    publisher={BioHackrXiv},
    author={Gray, Alasdair J G and Papadopoulos, Petros and Mičetić, Ivan and Hatos, András},
    year={2021},
    month=Jun
    }

  • Jose E. Labra-Gayo, Alejandro G. Hevia, Daniel F. Álvarez, Ammar Ammar, Dan Brickley, Alasdair J. G. Gray, Eric Prud’hommeaux, Denise Slenter, Harold Solbrig, Seyed A. H. Beghaeiraveri, and et al.. Knowledge graphs and wikidata subsetting. Technical Report, 2021. doi:10.37044/osf.io/wu9et
    [BibTeX] [Abstract] [Download PDF]

    Knowledge graphs have successfully been adopted by academia, governement and industry to represent large scale knowledge bases. Open and collaborative knowledge graphs such as Wikidata capture knowledge from different domains and harmonize them under a common format, making it easier for researchers to access the data while also supporting Open Science. Wikidata keeps getting bigger and better, which subsumes integration use cases. Having a large amount of data such as the one presented in a scopeless Wikidata offers some advantages, e.g., unique access point and common format, but also poses some challenges, e.g., performance. Regular wikidata users are not unfamiliar with running into frequent timeouts of submitted queries. Due to its popularity, limits have been imposed to allow for fair access to many. However this suppreses many interesting and complex queries that require more computational power and resources. Replicating Wikidata on one’s own infrastructure can be a solution which also offers a snapshot of the contents of wikidata at some given point in time. There is no need to replicate Wikidata in full, it is possible to work with subsets targeting, for instance, a particular domain. Creating those subsets has emerged as an alternative to reduce the amount and spectrum of data offered by Wikidata. Less data makes more complex queries possible while still keeping the compatibility with the whole Wikidata as the model is kept. In this paper we report the tasks done as part of a Wikidata subsetting project during the Virtual BioHackathon Europe 2020 and SWAT4(HC)LS 2021, which had already started at NBDC/DBCLS BioHackathon 2019 in Japan, SWAT4(HC)LS hackathon 2019, and Virtual COVID-19 BioHackathon 2019. We describe some of approaches we identified to create subsets and some susbsets from the Life Sciences domain as well as other use cases we also discussed.

    @techReport{labra-gayo:kg-subsetting:biohackrxiv2021,
    abstract={Knowledge graphs have successfully been adopted by academia, governement and industry to represent large scale knowledge bases.
    Open and collaborative knowledge graphs such as Wikidata capture knowledge from different domains and harmonize them under a common format, making it easier for researchers to access the data while also supporting Open Science.
    Wikidata keeps getting bigger and better, which subsumes integration use cases. Having a large amount of data such as the one presented in a scopeless Wikidata offers some advantages, e.g., unique access point and common format, but also poses some challenges, e.g., performance.
    Regular wikidata users are not unfamiliar with running into frequent timeouts of submitted queries. Due to its popularity, limits have been imposed to allow for fair access to many.
    However this suppreses many interesting and complex queries that require more computational power and resources. Replicating Wikidata on one's own infrastructure can be a solution which also offers a snapshot of the contents of wikidata at some given point in time.
    There is no need to replicate Wikidata in full, it is possible to work with subsets targeting, for instance, a particular domain. Creating those subsets has emerged as an alternative to reduce the amount and spectrum of data offered by Wikidata. Less data makes more complex queries possible while still keeping the compatibility with the whole Wikidata as the model is kept.
    In this paper we report the tasks done as part of a Wikidata subsetting project during the Virtual BioHackathon Europe 2020 and SWAT4(HC)LS 2021, which had already started at NBDC/DBCLS BioHackathon 2019 in Japan, SWAT4(HC)LS hackathon 2019, and Virtual COVID-19 BioHackathon 2019. We describe some of approaches we identified to create subsets and some susbsets from the Life Sciences domain as well as other use cases we also discussed.},
    title={Knowledge graphs and wikidata subsetting},
    url={https://biohackrxiv.org/wu9et},
    DOI={10.37044/osf.io/wu9et},
    publisher={BioHackrXiv},
    author={Labra-Gayo, Jose E and Hevia, Alejandro G and Álvarez, Daniel F and Ammar, Ammar and Brickley, Dan and Gray, Alasdair J G and Prud'hommeaux, Eric and Slenter, Denise and Solbrig, Harold and Beghaeiraveri, Seyed A H and et al.},
    year={2021},
    month=apr
    }

  • Peter Sefton, {Eoghan Ó}. Carragáin, Stian Soiland-Reyes, Oscar Corcho, Daniel Garijo, Raul Palma, Frederik Coppens, Carole Goble, {José María} Fernández, Kyle Chard, {Jose Manuel} Gomez-Perez, {Michael R}. Crusoe, Ignacio Eguinoa, Nick Juty, Kristi Holmes, {Jason A. }. Clark, Salvador Capella-Gutierrez, {Alasdair J. G. }. Gray, Stuart Owen, {Alan R. }. Williams, Giacomo Tartari, Finn Bacall, Thomas Thelen, Hervé Ménager, Laura Rodríguez-Navas, Paul Walk, brandon whitehead, Mark Wilkinson, Paul Groth, Erich Bremer, {LJ Garcia} Castro, Karl Sebby, Alexander Kanitz, Ana Trisovic, Gavin Kennedy, Mark Graves, Jasper Koehorst, Simone Leo, and Marc Portier. RO-Crate Metadata Specification 1.1.1. Technical Report, United Kingdom, 2021. Recommendation published by researchobject.org – see https://w3id.org/ro/crate/1.1 for web version. doi:10.5281/zenodo.4541002
    [BibTeX] [Abstract]

    This document specifies a method, known as RO-Crate (Research Object Crate), of aggregating and describing research data with associated metadata. RO-Crates can aggregate and describe any resource including files, URI-addressable resources, or use other addressing schemes to locate digital or physical data. RO-Crates can describe data in aggregate and at the individual resource level, with metadata to aid in discovery, re-use and long term management of data. Metadata includes the ability to describe the context of data and the entities involved in its production, use and reuse. For example: who created it, using which equipment, software and workflows, under what licenses can it be re-used, where was it collected, and/or where is it about.RO-Crate uses JSON-LD to to express this metadata using linked data, describing data resources as well as contextual entities such as people, organizations, software and equipment as a series of linked JSON-LD objects – using common published vocabularies, chiefly schema.org.The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.

    @techreport{RO-Crate-1-1,
    title = "RO-Crate Metadata Specification 1.1.1",
    abstract = "This document specifies a method, known as RO-Crate (Research Object Crate), of aggregating and describing research data with associated metadata. RO-Crates can aggregate and describe any resource including files, URI-addressable resources, or use other addressing schemes to locate digital or physical data. RO-Crates can describe data in aggregate and at the individual resource level, with metadata to aid in discovery, re-use and long term management of data. Metadata includes the ability to describe the context of data and the entities involved in its production, use and reuse. For example: who created it, using which equipment, software and workflows, under what licenses can it be re-used, where was it collected, and/or where is it about.RO-Crate uses JSON-LD to to express this metadata using linked data, describing data resources as well as contextual entities such as people, organizations, software and equipment as a series of linked JSON-LD objects - using common published vocabularies, chiefly schema.org.The core of RO-Crate is a JSON-LD file, the RO-Crate Metadata File, named ro-crate-metadata.json. This file contains structured metadata about the dataset as a whole (the Root Data Entity) and, optionally, about some or all of its files. This provides a simple way to, for example, assert the authors (e.g. people, organizations) of the RO-Crate or one its files, or to capture more complex provenance for files, such as how they were created using software and equipment.While providing the formal specification for RO-Crate, this document also aims to be a practical guide for software authors to create tools for generating and consuming research data packages, with explanation by examples.",
    author = "Peter Sefton and Carrag{\'a}in, {Eoghan {\'O}} and Stian Soiland-Reyes and Oscar Corcho and Daniel Garijo and Raul Palma and Frederik Coppens and Carole Goble and Fern{\'a}ndez, {Jos{\'e} Mar{\'i}a} and Kyle Chard and Gomez-Perez, {Jose Manuel} and Crusoe, {Michael R} and Ignacio Eguinoa and Nick Juty and Kristi Holmes and Clark, {Jason A.} and Salvador Capella-Gutierrez and Gray, {Alasdair J. G.} and Stuart Owen and Williams, {Alan R.} and Giacomo Tartari and Finn Bacall and Thomas Thelen and Herv{\'e} M{\'e}nager and Laura Rodr{\'i}guez-Navas and Paul Walk and brandon whitehead and Mark Wilkinson and Paul Groth and Erich Bremer and Castro, {LJ Garcia} and Karl Sebby and Alexander Kanitz and Ana Trisovic and Gavin Kennedy and Mark Graves and Jasper Koehorst and Simone Leo and Marc Portier",
    note = "Recommendation published by researchobject.org - see https://w3id.org/ro/crate/1.1 for web version.",
    year = "2021",
    month = feb,
    doi = "10.5281/zenodo.4541002",
    publisher = "researchobject.org",
    address = "United Kingdom",
    }

2015

  • A. J. G. Gray, Joachim Baran, Scott M. Marshall, and Michel {Dumontier (Eds)}. Dataset Descriptions: HCLS Community Profile. {W3C Interest Group Note}, 2015.
    [BibTeX] [Download PDF]
    @techreport{Gray2015HCLS,
    author = {A. J. G. Gray and Baran, Joachim and Marshall, M Scott and {Dumontier (Eds)}, Michel},
    month = may,
    publisher = {World Wide Web Consortium},
    type = {{W3C Interest Group Note}},
    title = {{Dataset Descriptions: {HCLS} Community Profile}},
    year = {2015},
    url = {https://www.w3.org/TR/hcls-dataset/}
    }

2013

  • A. J. G. Gray, C. Chichester, K. Burger, S. Kotoulas, A. Loizou, V. Tkachenko, A. Waagmeester, S. Askjaer, S. Pettifer, L. Harland, C. Haupt, C. Batchelor, M. Vazquez, María J. Fernández, J. Saito, A. Gibson, and L. Wich. Guidelines for Nanopublications. Working Draft 1.8-20130102, Concept Web Alliance, 2013.
    [BibTeX] [Download PDF]
    @TechReport{nanopubs,
    author = {A.J.G. Gray and C. Chichester and K. Burger and S. Kotoulas and A. Loizou and V. Tkachenko and A. Waagmeester and S. Askjaer and S. Pettifer and L. Harland and C. Haupt and C. Batchelor and M. Vazquez and J. Mar\'ia Fern\'andez and J. Saito and A. Gibson and L. Wich},
    title = {Guidelines for Nanopublications},
    institution = {Concept Web Alliance},
    year = {2013},
    OPTkey = {},
    type = {Working Draft},
    number = {1.8-20130102},
    OPTaddress = {},
    month = jan,
    note = {},
    url = {http://nanopub.org/guidelines/working_draft/},
    OPTannote = {}
    }

2012

  • A. J. G. Gray (Ed). Dataset descriptions for the Open Pharmacological Space. Working Draft, Open PHACTS, 2012.
    [BibTeX] [Download PDF]
    @TechReport{OPS-datadesc,
    author = {A.J.G. {Gray (Ed)}},
    title = {Dataset descriptions for the Open Pharmacological Space},
    institution = {Open PHACTS},
    year = {2012},
    OPTkey = {},
    type = {Working Draft},
    OPTnumber = {},
    OPTaddress = {},
    month = oct,
    note = {},
    url = {http://www.openphacts.org/specs/datadesc/},
    OPTannote = {}
    }

  • C. Y. A. Brenninkmeijer, C. Goble, A. J. G. Gray, P. Groth, A. Loizou, and S. Pettifer. Query Strategies to Support Context-Specific Views Through Stand-off Data Mappings. Technical Report, University of Manchester, 2012. (Alphabetic authorship)
    [BibTeX]
    @TechReport{query-expansion,
    author = {C.Y.A. Brenninkmeijer and C. Goble and A.J.G. Gray and P. Groth and A. Loizou and S. Pettifer},
    title = {Query Strategies to Support Context-Specific Views Through Stand-off Data Mappings},
    institution = {University of Manchester},
    year = {2012},
    OPTkey = {},
    OPTtype = {},
    OPTnumber = {},
    OPTaddress = {},
    OPTmonth = {},
    note = {(Alphabetic authorship)},
    OPTannote = {}
    }

2009

  • A. J. G. Gray, N. Gray, F. V. Hessman, and A. Preite Martinez (Eds). Vocabularies in the Virtual Observatory. Recommendation v1.19, IVOA, 2009. \url{http://www.ivoa.net/Documents/latest/Vocabularies.html}
    [BibTeX] [Download PDF]
    @techreport{gray-etal:vocab-VO:2009,
    Author = {A.J.G. Gray and N. Gray and F.V. Hessman and A. {Preite Martinez (Eds)}},
    Institution = {IVOA},
    Note = {\url{http://www.ivoa.net/Documents/latest/Vocabularies.html}},
    Number = {v1.19},
    Title = {Vocabularies in the Virtual Observatory},
    Type = {Recommendation},
    Year = {2009},
    url = {http://www.ivoa.net/Documents/latest/Vocabularies.html}
    }