SWAT4(HC)LS 2019: Morning Presentations, Day 2

All talk and poster papers are available for download at http://purl.org/ronco/swat The organizers would like to have everyone’s suggestions for future conferences, and also for topics and tasks for the hackathon tomorrow.

Data Ownership and protection within the FAIR Framework

Keynote by Dov Greenbaum

This is a talk about the legal issues that come with FAIR. The Zvi Meitar Institute focuses on examining the ethical, legal and social implications of new and emerging technologies. Examples include fake news and space law.

Big data: velocity, veracity, variety, volume. And a reminder that if something is free, you’re not the customer, you’re the product. The Hague declaration provides some nice numbers about academic data. By 2020, the expectation is that there will be 44 zettabytes, with 90% being generated in the last 2 years. 2.4 million scholarly articles publised in 2013, or one every 13 seconds. It’s estimated that big data is worth more than $300 billion per year to companies. UNESCO (specifically, the International Bioethics Commitee or IBC) is worried about how data is used. Health data comes from many sources: home monitoring, EHR, lab records, etc.

The IBC has a report from 2017 that asked: Do you own your own data? Is it yours to give? What about ownership and possession?

Now, thinking about this in the context of FAIR, particularly the accessible and reusable parts, which leads to a discussion of licensing. Open source licensing has one problem: in order to license data, you must own it legally. The way we own non-tangible items is mostly through intellectual property law. IP includes patents, copyright, trademark, trade secret, and sui generis (a new form of IP law). In 1916, a uk ruling stated that “the rough practical test that what is worth copying is prima facie worth protecting”. IP rights do not grant you the right to do anything, they just allow you to grant permission – a negative right rather than a positive right.

Is the big data patentable? Then you can license it… Otherwise we need to find another way (see IP methods above). To be patentable it must be patentable subject matter, utility (some kind of use, and can’t be “evil”, or a tax cheat – utility in the US is a very weak definition – allows virtually anything to have utility), novel, non-obvious. But Big Data doesn’t fit into the patent definition of “a useful process, machine, manufacture, or composition of matter”. This means also that wild plants, DNA, algorithm, laws of nature are unpatentable. In short, Big Data is not patentable subject matter.

Next form of IP law is copyright. This covers original works of authorship fixed in any tangible medium of expression – this includes things like databases. Copyright is automatic and free (unless you want to register it), unlike patents. If you are using AI to create some of your data, you probably don’t have authorship. What is copyrightable? literary works, musical works, dramatic works, choreographed works, pictorial graphic sculptural sound works, architectural works. Ideas, laws, facts, discoveries are all not copyrightable. Data is facts, and therefore not copyrightable. Copyright protects your expression of an idea, not the idea itself. You cannot copyright the building blocks of knowledge – to promote innovation. There was a supreme court case in 1991 that centered around the yellow pages – supreme court decided that there would be no more protection of databases of facts. However, they said you could protect databases if you had a contract and the other side agreed to the contract, you could copyright your data. The contract for some software would be as soon as you open a CD of software, or a EULA online, the contract is implicit. The AWS contract mentions a zombie apocalypse 🙂 (clause 47 I think I heard?!) If you have a factual database, how can you prove that someone has copied your data? The easiest way is to salt – dictionaries have fake words, google maps has fake cities, phone books have fake people etc. In short copyright is no good.

Next form of IP law is trade secrets. This is fine, but you want to be FAIR and trade secrets are the opposite of FAIR. So, if you don’t own big data under IP, how do you license it?

WTO says that you should be able to protect computer programs and compilations of code, e.g. the database itself if not the data. Most countries have not incorporated this yet. US has the digital millennium copyright act, but it’s a weird kind of IP right – you can claim ownership of your databases only if you protect it with a digital rights management tool. Then when someone has hacked into you, they have infringed on your rights (kind of backwards really). This is most prominent in DVDs and violating the region protection. The EU database directive contains database protection (1996). They were thinking mostly of betting databases! This is sui generis – a new type of IP protection for databases. However, it really only protects the structure of the database and not the content, and you have to have made a substantial investment and effort. In 2018 an analysis was performed, and academics were very unhappy with the directive as they weren’t sure what classified as violations of the law, and 2/3 thought it had a negative effect – the sui generis right was considered to have a negative effect on legal certainty.

How do you protect data right now? You contract it or you use antitrust law. Ownership grants you benefits and control over that information, but possession does not mean ownership. Possessing data doesn’t mean you own the data. The Hague Declaration is something you can “sign” if you agree, and tries to find solutions to this.

So, what we need is a complete change. We need to think about multiple owners: the patient, the doctor, the informatician. Perhaps we can’t find an ownership right in data – so perhaps we should look at it as having custody rather than ownership of the data. Custody brings liabilities (e.g. GDPR) and responsibilities. Custody also means good stewardship, privacy, security, attribution, FAIR principles. Individuals could transfer some of their rights via informed consent. But this isn’t a full solution, and this is still an ongoing problem.

Q: A lot of life science databases use CC BY, CC BY-SA – is this correct?
A: This is not legal advice. In Europe, via the database directive, databases are protected so if you take the entire databases you are infringing. If you extract data without maintaining the structure of the database, you are not infringing. This is one reason why academics are unsatisfied with the EU database directive.

Q: It’s a creative act to map data to e.g. ontologies. Is this enough to be able to have it be an act of creation and makes it protectable?
A: Modified data that is still factual (e.g. cleaning data, adding ontology terms) does not change the fact that it’s factual – and therefore cannot be protected. Might still fall under the database directive, as described above.

Q: Patients possess (not own) their data. In MS research, patient groups take custody of that data. Is this style the future direction?
A: It makes sense to get informed consent from a group – the group could use a trade secret type that requires an NDA to use, for example.

Documentation Gap in Ontology Creation: Insights into the Reality of Knowledge Formalization in a Life Science Company

Marius Michaelis (speaker) and Olga Streibel, Bayer

For them, ontology creation begins with specification / competency questions, then conceptualization of the ontology via use cases, implementation into a formal knowledge mode (ontology), then ontology testing. He will concentrate on the conceptualization phase.

Within the conceptualization phase, you have knowledge acquisition by knowledge engineers who, on the one hand, research explicit knowledge, and on the other hand also elicit tacit knowledge. This process of eliciting knowledge takes a lot of effort and is time intensive. How can you document this stage? Such documentation would encourage collaboration, prevent knowledge loss, and encourage clarity.

Within Bayer, they ran a survey of 13 knowledge engineers and 16 domain experts. He will discuss both the timepoint and nature of how knowledge engineers document. Most start documenting while or after the model is being created, which means during implementation rather than during conceptualization. The respondents also had a roughly equal mixture of structured and unstructured methods (don’t necessarily follow guidelines). But what we want is joint, timely, structured documentation.

His bachelor’s thesis (University of applied sciences Potsdam), “Documentation concept for the exchange of knowledge in the process of creating ontological knowledge models.”

Partitioning of BioPortal Ontologies: An Empirical Study

Alsayed Algergawy and Birgitta König-Ries

BioPortal, AgroPorta, OntoBee, and EcoPortal (Allyson: EcoPortal apparently last updated 2 years ago?) all exist to store publicly-available ontologies.

Most existing studies of bioportal ontologies focus on ontology reuse and ontology evaluation (the quality of the ontology). A few also consider the partitionability/modularization of ontologies (e.g. ISWC 2011). They also looked at the partitioning of BioPortal records.

Overall workflow: 1. get ontologies 2. transform into owl/obo 3. partition 4. analyse. Some existing partitioning tools include OAPT and PATO(?, not *that* PATO I suppose). They developed OAPT, which has these steps for partitioning: 1. ranking ontology concepts 2. determine cluster heads 3. partition 4 generate modules. In addition, in Manchester they developed a tool called AD (Atomic decomposition). So, they applied AD, OAPT and PATO to BioPortal’s 792 ontologies. For details, see https://github.com/fusion-jena/BioPortal-Ontology-Partitioning.

They discussed how many modules were created within each ontologies – you can specify in OAPT the optimal number of partitions. There were three 0-module ontologies (SEQ, SMASH, ENM). Both 0-module and 1-module ontologies don’t seem to be fully developed. 100 ontologies generated 2 modules. Over half of the accessible ontologies (347) can be partitioned into only 5 modules. Most of the ontologies which failed to partition seemed to be because of some particular characteristics of the ontologies rather than the tools themselves.

Annotation of existing databases using Semantic Web technologies: making data more FAIR

Johan van Soest (speaker), Ananya Choudhury, Nikhil Gaikwad, Matthijs Sloep, Michel Dumontier and Andre Dekker

They have hospital data that remains in the hospital together with the analysis, and then they submit results to a central location. This is good for patient privacy but not for researchers wishing to access the data. Therefore it relies on good semantic tagging.

Not all hospitals are academic hospitals, and therefore might not have the resources to add the data system to it. So they provided the hospitals with a tool that separates structure from the terminology conversion – this allows the IT person to do the conversion and the domain expert to do the terminology mapping (R2RML). This works but is high maintenance, so they’ve changed the approach. Instead, keep the original data structure and annotate the existing schema.

Their use case was a set of 4000 rectal cancer patient data and used Ontotext GraphDB 8.4.1. They had an ontology with 9 equivalent classes and 13 value mappings (e.g. “m” means male”). Two parent classes for each – the ontology class and the SQL table column.

They are only annotating existing data sources – although they are using non-standard (local) schemas, it does mean there would be no data loss upon any conversion and also they don’t have to make big changes to local systems. Keeping local systems also means that there is a smaller learning curve for the local IT team. They would like to develop a UI that would hide the “ontology-ness” of their work from their users.

FAIRness of openEHR Archetypes and Templates

Caroline Bönisch, Anneka Sargeant, Antje Wulff, Marcel Parciak, Christian R Bauer and Ulrich Sax

HiGHmed was begun with the aim to improve medical research and patient care, and to make data from research and patient care accessible and exchangeable. The project has a number of locations across Germany. Part of this involves the development of an open interoperable and research-compatible eHealth platform to support local and cross-institution patient care and research.

openEHR is a virtual community for transmitting physical health data in electronic form. Archetypes are aggregated into Templates, which then are published and versioned via the CKM (Clinical Knowledge Manager). They have assessed their archetypes and principles in the context of the FAIR principles, and found that they were compliant with 13/15 of the principles.

A Working Model for Data Integration of Occupation, Function and Health

Anil Adisesh (speaker), Mohammad Sadnan Al Manir, Hongchang Bao and Christopher Baker

The Occupational Heath Paradigm is a triangle of work-illness-wellness. In Canada they have a NOC Career Handbook with 16 attributes that help define the requirements of various careers. This can be helpful when someone with an injury wishes to change to a different job, but one that is similar enough to their previous job that there isn’t a lot of retraining. A semantic model is populated with coded patient data representing disease (ICD-11), functional impairment (ICF), occupation (NOC), and job attributes (NOC Career Handbook). The NOC coding for the data was done manually initially, and then they developed a coding algorithm to assign the occupations automatically. The algorithm starts with a textual job title and then progresses through a number of steps to get a NOC. They use sparql queries and semantic mapping to suggest job transitions to accommodate a functional impairment.

They did some test queries with their model to see if they could classify people in jobs according to their attributes, e.g. if a person has disease J62_8 them what jobs could they do? What job with a patient with visual impairment likely return to (previous job + impairment = new job options)?

This work could be applicable in finding work for immigrants and newcomers, finding comparable work for people with an acquired disability, or people with accidental injuries that could otherwise end up on long-term disability. The model seems fit for purpose to integrate info about occupational functioning and health.

FAIR quantitative imaging in oncology: how Semantic Web and Ontologies will support reproducible science

Alberto Traverso (speaker), Zhenwei Shi, Leonard Wee and Andre Dekker

Medical images are more than pictures, they are big data. Indeed they are the unexplored big data as many images are stored and not re-used, as well as having more information than is used in the first place. There were 153 exabytes in 2013, and 2,314 exabyte expected in 2020. Within medical imaging, the combination of the image and “AI” results in what is called quantitative imaging. Machine learning is used to create prediction models (e.g. the probability of developing a tumor, or a second tumor).

There is currently no widespread application of “radiomics” technology in the clinic. The data we produce grows much faster than what they can currently do with the models. Radiomics studies lack reproducibility of results. The challenges and some solutions are: models work but only on their own data (fix with multi-centre validation); a lack of data-driven evidence and external validation (fix with TRIPOD-IV models); lack of reproducibility (fix with sharing of metadata in a privacy-preserving way); explosion of computational packages; how can models be shared when data isn’t; poor quality of reporting; lack of standards; a need for domain knowledge (these last three can be fixed by standardized reporting guidelines, increased agreement, and data-driven meta analyses). FAIR quantitative imaging = AI + medical imaging + open science.

“Ontologies function like a brain: the work and reason with concepts and relationships in ways that are close to the way humans perceive interlinked concepts.”

Radiomics Ontology (RO) – https://github.com/albytrav/RadiomicsOntologyIBSI/wiki/1.-Home 

Appropriate questions for this system: What is the probability of having esophagitis after irradiation in certain patients that received a dose of…? What is the value of Feature X for rectal cancer patients with a poor prognosis computed on an ADC scan?

A FAIR Vocabulary Odyssey

Keynote by Helen Parkinson

The Odyssey has a non-linear plot, and Helen is using the monsters and challenges to hang her topics off of.

Helen asked the question in August: if there are no FAIR-compliant vocabularies, how can you be FAIR? If there aren’t any, then the FAIR indicator cannot be satisfied. Therefore you have a recursive, circular issue 🙂

Why do we need to be FAIR? What tools do we need to be FAIR? How do we know we are FAIR? I2 of FAIR is to use vocabularies that adhere to FAIR principles. EBI is at 273 PB of storage, with 64 million daily web requests in 2018. As with metadata in many projects, the metadata is often troublesome. They would like to build FAIR capable resources (but we’re not quite sure what FAIR capable is yet); acquire, integrate, analyse and present FAIR archival data and knowledgebases (and all the resources are very different – context is important); determine the FAIR context for our resources and data; define what it means to be a FAIR vocabulary.

Within the SPOT group, they develop a number of ontology applications, e.g. that make use of the data use ontology. For Helen, there are temptations associated with formalisms – the more complex the ontology, the more time/money it will cost but you will get a strong semantics. http://FAIRassist.org provides a list of current FAIR tools.

Which features of a FAIR vocabulary are already defined in the OBO Foundry? Many are already aligned, but some parts of the Foundry are deliberately not aligned, including: open, defined scope, commitment to collaboration, reused relations from RO, and naming conventions. These are parts of the Foundry that are not, and probably should not, be a required feature of a FAIR vocabulary. So then they mapped the aligned part of the foundry to the FAIR principles.

When talking about “deafness”, we need to consider the assay, the disease and the phenotype – and they all need to be connected – making all this interoperate is important. To help, they developed OXO which provides cross-references between ontologies, but it doesn’t tell you anything about the semantics.

FAIR Vocabulary Features- required (from Helen’s POV)

  • Vocabulary terms are assigned globally unique and persistent identifiers plus provenance and versioning information
  • Vocabularies and their terms are registered or indexed in a searchable resource
  • Vocabularies and their terms are retrievable using a standardised communications protocol
  • the protocol is open, free, and universally implementable
  • Vocabularies and their terms persistent over time and appropriately versioned
  • Vocabularies and their terms use a formal accessible and broadly applicable language for knowledge representation
  • Vocabularies and their terms use qualified references to other vocabs
  • released with a clear and accessible data usage licence
  • include terms from other vocabs – when this happens, import standards should be used.
  • meet domain relevant community standards.

Why should vocabs be FAIR? Ontologies are data too, and data should be FAIR. How do we know we are FAIR? When the vocab has use, reuse, implementation, and convergence.

Where is FAIR in the Gartner Research’s Hype Cycle? 🙂

Q: should FAIRness be transitive? Should FAIR vocabs only import FAIR vocabs?
A: I would like it to be, but it probably can’t always be.

The OHDSI community has to deal with this issue of transitive FAIRness already. Sometimes they import only certain versions of the ontology. They don’t think it’s possible/practical to move the entire healthcare industry to FAIR standards.

Q: What would be a killer application for speeding up and making data FAIR?
A: Proper public/private collaborative partnerships. Getting large organizations on board for, at a minimum, a few exemplar organizations. One field might be antimicrobials and another would be neurodegenerative diseases as they are biological difficult and traditional routes of pharma haven’t worked as well as hoped.

Afternoon Panel

This panel is about the conference itself and our thoughts on it. During the panel session, topics included:

  • How can we improve gender diversity in the organizing group? From this, corollaries came up such as making the conference more relevant to other parts of the world, e.g Asia. Equally, this workshop has been organized on a shoestring budget and via people volunteering their time. The question is – what do we want from the conference in the future?
  • Do you see food safety as a potential topic for SWAT4HCLS? Yes, but we need to consider what our scope is, and how adding something like “food safety” would impact upon the number of papers submitted. e.g. there has been a big agrisemantics group last year and this, yet the conference name wasn’t changed.
  • Tutorial logistics should be improved for next year. The community haven’t submitted many tutorials this year. Should we keep them?
  • How do we leverage our knowledge to help the community at large? Should we reach out and bring people in via training?
  • The conference has been going on for over a decade might some kind of retrospective be appropriate? Perhaps combine it with a change in direction, said one panel member. Perhaps someone present something lighthearted for next year and present the changes?
  • Should we have a student session? Well, we already have some of them, even presenting already at the conference. We should work more to get students from the local university to participate, as they haven’t really taken up the offer in the past.
  • Should we remove the HC, because we continue to expand into other areas and can’t keep adding letters to the name! If so, what should the name be? Probably shouldn’t change it.
  • Where is it going to be next year? They can’t tell you yet.
  • Should we invite a big hitter from a neighbouring area to pull in more communities? Should we include expand to end users?

Please also see the full programmePlease note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

SWAT4(HC)LS 2019: Afternoon Presentations and Panel, Day 1

Semantics for FAIR research data management

Keynote by Birgitta König-Ries, introduced by Scott Marshall

Resources covered in this talk that are in FAIRsharing: GBIF, Dryad, PANGAEA, Zenodo, Figshare, PROV-O.

FAIR doesn’t cover everything, for instance data quality. The Biodiversity Exploratories project has been running for 13 years and, as there is turnover (as with any project) you cannot guarantee that the same people will be present who will know about / where your data is. There are 3 exploratories within grassland and forest, and wish to discover what drives biodiversity. To do this, they need to be able to share data and integrate data in different regions and by different groups

They state that the FAIR requirements can be a little vague, but as far as they can tell they are findable, but the interoperability and reusability is low – they need some semantics work / standards. They have submitted to PLoS One “Do metadata in data repositories reflect scholarly information needs?” (accepted with minor revisions). They made a survey for biodiversity researchers – they are mainly looking for information on environment, organism, quality, material and substances, process and location. They were not interested in person or organization providing the data, or the time in which the data was gathered.

Some data sources include GBIF, Dryad, PANGAEA, Zenodo and Figshare. What semantic building blocks do they need? Tools for creating domain models, a data management system that supports semantics, help in providing semantic data (annotations, provenance management), making use of semantics for finding data, and then ultimately put it all together.

They used a process of ontology modularization, selection and customization. You need to align different ontologies in this step, and in aid of this she would like to point people to the Biodiversity Track in the OAEI (Ontology Alignment Evaluation Initiative). She described Subjective Logic, providing values for belief, disbelief, uncertainty, atomicity. They applied this to ontology merging to create scores for how trustworthy an information source is (discovering which ontology is causing merge inconsistencies).

BEXIS2 is a data management life cycle platform. It highlights how “varied” the variables are in data sets, even though they are semantically equivalent, e.g. different names for “date”. They provide templates for various data types. Within BEXIS2 is a feature for semi-automatic semantic annotation. Fully automatic approaches might happen as a result of the project outlined in “Semantic Web Challenge on Tabular Data to KG Matching” by Srinivas et al. The’ve also developed an ontology extending PROV-O and P-Plan. They have a website for visualizing provenance data, and the biodiversity knowledge graph that was presented earlier.

Some of their recent work has involved improving the presentation and navigation of results using facets to explore Knowledge Graphs. KGs are tricky as they are large and you are often interested in indirect relationships. You can’t precompute facets as that becomes impractical. When applying facets to KGs, you need to be able to manipulate a subgraph that is the search result. The result set is a list of IRIs, which you then create candidate results from that list, removing those that only appear in a small subset of your result. Then you find those candidate results that work well as facets (the goldilocks zone of facet size). You can also filter on facets that are not similar to each other (semantically distant), to make the results more interesting. This methodology “works” but is too slow, and you need a better way of traversing the graph a bit to get more distant information.

Germany has funded the BFDI (German National Research Data Infrastructure) – 30 consortia covering all research areas. They’ve applied to be part of this project for biodiversity data. The ultimate aim of BFDI for biodiversity is to build a unified research data commons infrastructure that provides customized semantic data to different user groups (technology services, integration, interoperability).

Making clinical trials available at the point of care – connecting Clinical trials to Electronic Health Records using SNOMED CT and HL7 InfoButton standards

Jay Kola (speaker), Wai Keong Wong and Bhavana Buddala

In FAIRsharing and in this talk: ClinicalTrials.gov, SNOMED-CT, ICD 10, MeSH, CDISC ODM.

This is about “using standards in anger”, e.g. making it work! This started because of a simple clinical problem – “I know our hospital is a large center, but I don’t know what trials we are running, or what trial my patient might be eligible for”. There are two external registries – ClinicalTrials.gov and ClinicalTrialsGateway (UK). However, they’re not up to date, and not all local studies/trials are registered on external registries. What did they do? Made it possible for a clinician to bring up all trials the patient is suitable for based on diagnosis, age and gender. It took only 20 minutes to configure/add this functionality to the EHR system.

Their solution is the Keytrials Platform, which is then integrated into the hospital EHR system. The integration with the local trial registry was done with a batch import with a text file. How do you make the trials “coded” with tags / ontology? ICD 10, SNOMED CT, MESH were considered, SNOMED CT was used. ICD10 wasn’t as rich as SNOMED CT (which is also hierarchical). They also used NLP and SNOMED CT to annotate trials.

The HL7 InfoButton allows systems to request information from knowledge resources using a pre-defined set of parameters. With their system, a query is passed from EHR to the InfoButton which then goes to the KeyTrials Backend and then sends it back to the clinician / EHR system.

Importing data is painful if we have to create a connection for every possible system. CDISC-ODM doesn’t reflect all of the clinical trial data (e.g. eligibilities), FHIR researchstudy is still under development. KeyTrials is open source, you can use it as you like. Also, post hoc annotations of clinical trials via NLP is avoidable if clinical trials data is coded at the time of creation. They used BioYODIE (GATE NLP engine). Another issue was that ICD 10 is more prevalent than SNOMED CT. Even at their hospital, ICD10 is used natively… ICD10 is less granular than SNOMED CT, which can cause issues in mapping existing terms.

KeyTrials and related work is intended to make clinical trials more visible and to increase recruitment in clinical trials. The goal is to make clinicians less likely to miss trials.

A Blueprint for Semantically Lifting Field Trial Data: Enabling Exploration using Knowledge Graphs

Daniel Burkow, Jens Hollunder, Julian Heinrich, Fuad Abdallah, Miguel Rojas-Macias, Cord Wiljes, Philipp Cimiano and Philipp Senger

FAIRsharing records for described resources: JERM, EXACT.

Field trials require a certain process with 4 stages: trial planning, field operations, assessments and analytics/reporting. Based on the type of data, each dataset might end up in one of multiple repositories. Therefore they would like to provide a workflow that overcomes these interoperability issues. To start, you harmonize the data and put it into a KG. The KG is used to find new models (digital experimentation and then analysis). First, they aggregate the data silos and using reference ontologies. The KG is then enriched to provide extra value to the end users. Field data has lots of different data types. Their data model is field-trial centric. It isn’t intended to store data, but rather just model it.

They map data resources onto their semantic business data model. Then, they map the data model onto ontologies from various domains (chemistry, experimental description, geospatial, provenance, biology etc). Each domain has their own transformation rules and ontologies (EPPO code, NCBI Taxon, PROV-O, OSPP – soil, EPPO Climatic zone, JERM – just enough results model ontology, Experimental Actions, OECD, GeoNames, Soil Triangle from USDA). They also have additional sources for cleaning, translation, harmonization and enrichment.

They have 500,000 field trials from the last 30 years with over 600 million nodes and 5.3 billion relations. They make use of Virtuoso, Neo4J, ElasticSearch, Oracle, PostgreSQL, GraphQL, SpotFire, Tableau, R/Shiny, Python/Dash). They built a custom application for data exploration to display the graph data.

Panel discussion (Spotlight on Agrisemantics II)

Chaired by Chris Baker

Introduction to GODAN – Suchith Anand

How do we bridge the digital divide? How do we make opportunities in open science available to people in the developing world? How can we use open data to help smallholder farmers? GODAN focuses on topics such as malnutrition, hunger and poverty through open innovation. GODAN supports global efforts to make agriculture and nutrition data available, accessible and usable for unrestricted use worldwide. There are over 100 GODAN partners.

Open data aids impact – his example is the Landsat data, which was made open in 2009. There are 232 Sustainable Development Goals (SDGs) Indicators. How can we link these data sets and get meaning and context from all of this information? For researchers in Africa, open data isn’t enough – you need to have open tools. Open science is a key enabler for capacity development and empowerment for all, especially in the developing world. In this context, Open Science includes open data, open standards, open access to research publications, open education resources, and open software.

It’s a very interesting time to be working in open science. (examples of linked open data was in 2012, and then FAIR in 2015 onwards.) Lots of things happening in this area, e.g. the European Open Science Cloud. A question for him is how can more opportunities be given to students in digital agriculture?

Panel Discussion

Panel speakers: Fernanda C. Dórea, Monika Solanki, Marie Angélique Laporte, Jens Hollunder

MS states that we should reuse existing work more than we do – and less time building tools & ontologies from scratch. FCD says there is a struggle for funding for projects when everyone benefits but nobody profits. Should look for initiatives that bring people together.

Q: Are you taking into account indigenous practice? How about local and neglected languages?

MS: She’s not aware of resources as such, but she can see the problem. Building these things requires a lot of investment, and the countries that would need such efforts don’t have a lot of funding. This is a good example of the need for those countries to fund such projects.
FCD: Seeing initiatives coming from developed to developing countries sometimes seems a little like they should instead come from the developed country.
CH: There is pressure in Brazil surrounding monocultures – there are little or no resources for traditional methods. The governments of developing countries have other things to spend their money on.
MAL: It’s important to keep in mind that the people speaking the local dialects/variants are our end users in many cases.

Follow-up Q: This leads onto another question.. Although in the UK and US have lots of very large farms, but elsewhere in the world it is mostly smallholders. Are we ignoring smallholders (more remote, less internet / connectivity)? I think there are implicit barriers to smallholders, as they won’t have technological access and may not have the educational tools.

JH: Precision farming would not neglect smallholders. Outcome-based models would help the small farms. We need to be constantly creating new data models to correctly match the current state of resources etc.
MS: Agrisemantics is about having a shared conceptualization across the community – anyone can consume such data. Bigger organizations have more technological resources, but within a limited capacity, even smallholders could benefit e.g. having a shared vocabulary to describe things.
MAL: When we developed the Agronomy Ontology, we are looking to see what the impact of management practices is on yield and other characteristics, and this includes smallholders as one of the use cases.
SA: Recently GEO asked members of the indigenous communities to participate in workshops etc, and this worked really well.

Q: Most of the semantic technologies that we work with deal poorly with time, and yet this seems critical, given the changes that are and will happen due to climate change. Is this a problem?

MS: I interpret this as a representation of time in the existing suite of ontologies we have – and this has been a big issue. No suitable scalable solution has come up. Version control of linked datasets is one example of such a solution. Temporal reasoning is a very abstract concept at the moment.
JH: I agree this is a problem. Currently the semantics give us a frozen snapshot.

Q: What current Ag Semantic standards are in use, which are obsolete, and what new standardization initiatives are necessary to help with this?

MAL: Crop Ontology suite of ontologies have been successful so far.
FCD: This isn’t an issue that is just seen in the ag semantics field. In her field, there’s a lot of work on structural interoperability but not a lot of work on semantic interoperability.
MS: AgroVoc is the de facto standard – very comprehensive and multilingual. It’s also lightweight. There are 403 in AgroPortal, perhaps some of them are obsolete.
JH: These 403 ontologies are each living in their little domain, when there are many more: chemistry, supply chain, management, biology etc. and they all need to come together and have each of these domains talking to each other.
MAL: Comparison to the OBO Foundry – promotion of requirements.

Follow-up Q: Do you think standards have been divided by public and private interests, or national interests?

JH: He feels what can be shared, will be shared.
MS: A lot of their ontologies are currently closed source, so it’s hard to use them – so there is a corporate divide.
MAL: totally driven by the funding and what’s available in each country.
FCD: Her experience with the cycle of funding is that if she spends the funding of a 3-year project on ontology development without focusing also on the community surrounding it, the ontology will die at the end of the 3-year project.

Q: Who should pay for the new (to be shared) agrisemantic models?

MS: We can’t expect private industry to pay for the new data model. We need a combination of experts from various areas to create a platform where such things can be discussed. Comparison with schema.org.
FCD: On the one hand, people might think what does making my data open do for me? Equally, in the EU there is more of a push towards open data providing impetus.
JH: we should have tangible use cases and start from there – money is given more easily with such use cases.
CH: In the machine learning world, many of the major frameworks are developed by the private sector and then donated to the public. Why is this not happening here?

Q: To what extent is the digitization of agriculture (together with agri semantics) a case of technology push that does not correspond to the real needs of farmers or the environment?

MAL: We need to keep in mind that our users are the farmers, and if we do so then we will make tools that will benefit them.
FCD: Make sure that what you’re doing actually matches the use case.
MS: It’s a problem with the people who are making the technology if this happens, as they are not listening to the farmers. Or alternatively, farmers might hold back and not provide all of the information they should. The Farm Business Survey is used in the UK (run by the government) to get data. Perhaps this also happens in other countries.

Q: What role can semantic technologies play in agricultural biosecurity?

FCD: You can only go so far in aggregation of data without using semantics – so at some stage any field including biosecurity would find semantics useful. However, often converting it into a format useful for semantic technologies might have a time penalty making it harder to be used in situations with a short time frame.

Q: There is a lot of hype and focus on healthy and sustainable food. Consumers want to be well informed about these things. Is there an opportunity for semantic data with respect to these questions?

MS: She has a couple of papers in this area (traceability in the food system).

Q: What is the role of semantics in limiting / accelerating the advance of Ag IOT?

JH: There are a lot of vendors / devices / “standards” – so that is one limitation. It also depends on the sensor being used.
MS: This area does need a lot more work.

Q: What is lacking in the current state of the art in agrisemantics?

MS: The livestock domain still requires some work.
FCD: I agree with this.
JH: What is also missing is the easy accessibility for end users to communicate with this technology. This is a hurdle for increased uptake.

Please also see the full programme. Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

SWAT4(HC)LS 2019: Morning Presentations, Day 1

These are my notes of the first day of talks at SWAT4HCLS.
Apologies for missing the first talk. I arrived just as the keynote by Denny Vrandečić‎ titled “Wikidata and beyond – Knowledge for everyone by everyone” was finishing (the first train from the south only arrived in Edinburgh at 9am). (FAIRsharing Wikidata record.)

Enhancing the maintainability of the Bio2RDF project using declarative mappings

Ana Iglesias-Molina (speaker), David Chaves-Fraga, Freddy Priyatna and Oscar Corcho

FAIRsharing record for Bio2RDF.

Current Bio2RDF scripts are PHP scripts on an ad hoc basis. They are proposing the use of OBDA technologies (via declarative mapping). The workflow for OBDA includes 1) mapping file for relationships between data source and ontologies (e.g. in RML, R2RML), 2) use a mapping engine to transform the data sources into knowledge. Bio2RDF data source formats are mainly CSV/XSLX, then XML. They are developing a mapping engine for CSV/XSLX using a tool called Mapeathor. Mapeathor is used to generate the knowledge graph by mapping columns from the spreadsheets into appropriate triples.

They wish to increase the maintainability of the data transformation using the OBDA options, which 1) enables the use of different engines, and 2) creates a knowledge graph ready for data transformation and query translation. They would like to improve the process to deal with the heterogeneity of the data.

Suggesting Reasonable Phenotypes to Clinicians

Laura Slaughter (speaker) and Dag Hovland

HPO in FAIRsharing.

They support communication between pediatrician and geneticist, to provide a more complete picture of the patient, but not to replace expertise of the physicians. HPO is used to specify abnormalities.

Their workflow. Pediatrician in intensive care suspects a newborn of a genetic disorder. Firstly, they need to get consent via a patient portal (DIBS – electronic health record). A requisition form has a subset of HPO codes that the pediatrician can select, and then the form and samples are sent off to the lab. Reporting the HPO codes to the lab helps the lab with their reporting and identification.

Phenotips is one HPO entry system (form-based ontology browse and search). Also extant is a natural language mapping from text. A third is Phenotero, which is a bit of a mixture of the two. When they started, the clinicians wanted to use Phenotips. Another related system is the Phenomyzer, which is a different perspective as it helps with the process of differential diagnosis. The authors thought they would just provide a service where they would suggest additional HPO codes to clinicians. But when they started to work on it, they had to make a new user interface after consultation with the clinicians. Additionally, they discovered that they would also need to provide a differential diagnosis feature.

There were a number of issues with the system that existed before they started. There was an overwhelming number of HPO codes for clinicians to sort through. There was no consistency checking or use of the HPO hierarchy. The NLP detection had a low accuracy and had to be supplemented with manual work. There was also no guidance for prompting for a complete picture of the patient or further work-up (available in Phenomizer).

They suggested that they provide a simple look-up mechanism using disease-HPO associations. Suggestions for clinicians come in the form of HPO codes that point to where further work-up might be needed. They also needed to implement ordering of HPO code candidates, and they did this by using disease information to inform priority settings, e.g. measuring the specificity of the disease given the phenotype entered by the clinician.

They order diseases in increasing order, by the ration of unselected phenotypes. There is a balance to find between giving the clinician a bias too early, or alternatively only being able to provide feedback in very specific circumstances. They implement their work using a reasoner called Sequoia, input forms and annotation files.

They are working with a geneticist and clinicians to find the best method for generating suggestions and evaluate the UI. They’re also exploring the ORDO Ontological Module (HOOM), which qualifies the annotations between a clinical entity from ORDO and phenotypic abnormalities from HPO according to frequency and by integrating the notion of diagnostic criteria.

A FHIR-to-RDF converter

Gerhard Kober and Adrian Paschke

FHIR in FAIRsharing. RDF in FAIRsharing.

FHIR is an HL7 standard with more than 100 resources defined. A FHIR-Store is a storage container for different resources, and they would like to ask SPARQL queries over the result set. Because in FHIR resources are meant to facilitate interoperability (but not semantic interoperability), the storage in RDF is not possible. They are implementing a system architecture that would have a FHIR-to-RDF converter sitting in between the client and the HL7 FHIR stores. This would allow the client to interact with RDF.

They have used the HAPI-FHIR and Apache-Jena libraries. The data is transformed from FHIR-JSON to Apache-Jena-RDF-Model. Searches of FHIR resources are returned as JSON objects. Performance is critical, and there are two time consuming steps: HTTP call to the FHIR store and the conversion from FHIR to RDF, and as such the performance might be a bottleneck. To alleviate this, queries to the FHIR store should be specialized. They also need to check if the transformation to Apache-Jena is too expensive.

A framework for representing clinical research in FHIR

Hugo Leroux, Christine Denney, Smita Hastak and Hugh Glove (speaker)

FHIR in FAIRsharing.

This covers work they’ve done as part of HL7 together over the past 6-8 months. They’ve had a FHIR Meeting “Blacing the Path Forward for Reserach” where they agreed to establish a new Accelerator Project to get a set of use cases. FHIR has been widely adopted in clinical care, mainly because of its accessibility and how it is all presented through a website. If you look at a FHIR resource, you get a display containing an identifier and some attributes. For instance, for Research Subject you would get information on state, study link, patient link, consent… Research Study includes identifier, protocol, phase, condition, site / location, inclusion/exclusion.

FHIR tooling enforces quality standards at the time of publishing the data, has publicly-available servers for testing, and others. It also provides RESTful services for the master data objects that are stateless and non-object oriented.

Much of the work involved is keeping track of the investigators and other trial management. They are looking at using FHIR resources to help with the trial management as well as the more traditional data capture and storage.

People can build tools around FHIR – one example is ConMan, which allows you to graphically pull resource objects in and link them together. With respect to linking resources together, with the resulting graph of objects looking a lot like a vocabulary/ontology relating ResearchStudies to Patients via ResearchSubject and other relationships.

The object model is quite complex. BRIDG is a domain model for FHIR in clinical research. The objective of what they’re doing is to stimulate a discussion on how clinical research semantics and data exchange use cases can be represented in FHIR.

Reconciling author names in taxonomic and publication databases

Roderic Page

LSIDs in FAIRsharing.

LSIDs were used early on in the semantic web – I remember those! However, LSIDs didn’t really work out – data didn’t just “join together” magically, unfortunately. He’s working towards a Biodiversity Knowledge Graph, as there is a lack of identifiers and a lack of links. Taxonomists often feel underappreciated, and under attack from people who are generating loads of new species and aggregated biodiversity data. Taxonomists are much less likely to have ORCIDs than the general research population, so in order to identify people you need to match people using CrossRef and ORCID either using schema:Role, or matching people in IPNI (a taxonomic database that still uses LSIDs?) and ORCID.

Not all ORCID profiles are equal – he shows us an example of one called “Ian”…, though he did figure out who it (probably) is. In conclusion, the semantic web for taxonomic data failed because of the lack of links, and making retrospective links is hard. Additionally, there is the “Gary Larson” problem of people hearing “blah blah RDF blah” 🙂

On Bringing Bioimaging Data into the Open(-World)

Josh Moore (speaking), Norio Kobayashi, Susanne Kunis, Shuichi Onami and Jason R. Swedlow

IDR in FAIRsharing. OME-TIFF in FAIRsharing. OMERO in FAIRsharing.

In imaging, the diversity is visual and you can see how different things are. They are developing a 5d image representation / model: 3d movies in multiple colors. From there it gets more complicated with multilayer plates and tissues. They develop the Image Data Resource. They are interested in well-annotated image data, e.g. from the EBI as well as controlled vocaularies. They are getting lots of CSV data coming in which is horrible to process.

They translate over 150 file formats via BioFormats by reverse engineering the file formats – big job! They tried to get everyone using OME-TIFF but it wasn’t completely successful. However, it was a good model of how such things should be done: it’s a community-supported format, for example.

This community is still a bit “closed world”. In 2016 they started development of the IDR, and needed to formalize the key/value pairs. However, the community continues to want to extend it more. As a result, they want to leave the key/value pairs and move back to something more semantic. Use cases include extension of the entire model or conversion of the entire model – Norio Kobayashi converted the entire model into OWL (OME Core ontology <= OME Data Model (which itself is OME-TIFF + OME-XML)). Extension is the 4D Nucleome ontology.

He likes the Semantic Web solutions as it reduces the cost of more adhoc XML extensions. Perhaps could use JSON-LD as it may end up being the “exciting” front end? Bio-(light) imaging is relatively new to this and lagging in investment for these sorts of things.

Please also see the full programme. Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

What’s your favorite taxonomic controversy?

In January I’ll be running an event with another STEM ambassador for Years 5 and 6 at a local primary school. One year will be getting the fantastic Mystery Boxes, which I love doing with any age group, and the other year is currently studying Taxonomy and Classification. I love the idea of talking about the big debates that scientists have, and how we scientists aren’t a bunch of homogeneous fact-tellers. Instead we’re messy humans who like having arguments, and I think taxonomy is one of those areas that has many arguments.

So, what debates (historical or modern) do you most enjoy hearing about within taxonomic research? Here are some ideas I have, but would love to hear some specific examples from you all:

  • DNA Barcoding (summarized nicely by the Dept of Sociology at Lancaster Uni, and a 2005 POV article in Systematic Biology),
  • Taxonomy “vandalism” (see this Smithsonian piece), which I hadn’t realized was a thing,
  • Where do hominids fit in with respect to great apes (e.g. this opinion piece)?

I’ll probably simplify the general idea behind this lesson plan and throw in some soft toy animals for the kids to classify, but if you have any interesting ideas please let me know!

Software Ontology – a New Release and a Shiny New Build Procedure

A new SWO release arrives with a shiny new build procedure to herald its arrival.

I had noticed that it had been a while since we had last updated SWO – the Software Ontology. To be honest, it was a little more than “a while”, but…

  • we’re a merry band of volunteers, primarily Robert Stevens (blog, Computer Science at Manchester), Helen Parkinson (EBI), James Malone (blog, SciBite), which means we are time limited
  • our build process was outdated, slow, and tricky. I’ll admit, I had to ask James to finish our 1.6 release as it just wasn’t working for me!
A small snippet of SWO – see EBI’s OLS for the full graph

Does your release spark joy?

We all enjoy talking about software, and I have particularly enjoyed beginning to work on the lovely Licence Hierarchy within SWO that’s been coming along nicely. But every time I thought about updating the external ontologies we imported, or building the release files, I got a bit of a sinking feeling. Then, feeling like I was the last in the class to notice, I had a good read about ROBOT (website, publication), an ontology build tool that lots of projects had been using. I say build tool, but it does all sorts of lovely things. I use it for the following purposes:

  • SPARQL queries: I use SELECT to create summary statistics of quite complex subdivisions of my ontology
  • Bulk annotation: UPDATE commands can also be run, allowing me to add bulk annotations to my file.
  • Bulk imports via spreadsheets: a separate project I’ve been involved in began their ontology development with a spreadsheet and then we bulk converted it to OWL with ROBOT.
  • Merging imports – going from a development file with multiple imports to a single release file
  • Release building – checking and building a release file with appropriate annotation and versioning.

And to top it all off, ROBOT suggests that you use a Makefile to control your build. What joy! The last time I used one was during my time at the EBI, and a really do enjoy using them. They are a lightweight, fun way to control a set of commands and dependencies that you need to run, and it was awesome to get back to it.

Decluttering

As it had been a while since we released SWO, it needed a spring clean. With MIREOT and Ontofox, I wasn’t tied to a simple (but crowded) import of entire ontologies. In previous versions, ontologies like EDAM were imported en masse and this causes major versioning issues when release get out of step. MIREOT solves that by outlining a procedure (implemented by Ontofox) which allows for the selective import of classes and hierarchies of interest from external ontologies.

So, we stripped out all of our external classes, and re-imported just the ones we needed. We also took the opportunity to resolve a number of inconsistencies with our IRI naming scheme (and a bunch of other housekeeping issues listed in our GitHub milestone).

Release and Indexing

We released 1.7 at the end of October, and our lovely friends at OLS, BioPortal and Ontobee quickly indexed it. Please feel free to browse it at any of these locations, or to say hello over at our GitHub repo (you’ll always find our latest release here). And with our build procedure now as streamlined as our ontology, updates will be easier and quicker – so let us know what you’d like!

UKON 2018: Session 4

Distinct effects of ontology visualizations in entailment and consistency checking

Yuri Sato (Brighton), Gem Stapleton (Brighton), Mateja Jamnik (Cambridge) and Zohreh Shams (Cambridge)

When describing world knowledge, a choice must be made about its representation. We explore the manner in which ontological knowledge is expressed in ways accessible to humans in general. The compare novice users’ performance when logical task solving using two distinct notations.

SOVA is a full ontology visualization tool for OWL – you can build syntactic units which create a SOVA graph. Other graph tools (OWLViz, VOWL, Jambalaya) are insufficient to express all OWL constructs. Also, existing systems of hygraph, compounded dygraph, and constraint diagrams are not expressive enough to deal with ontologies. Stapleton et al (2017, VL/HCC 2017) describe concept diagrams. So therefore we have two methods to explore: topo-spatial and topological. In consistency checking tasks, Topological representations were better, while in entailment judgement Topo-spatial representation performed better. In summary, topology representations are suitable for most existing ontologies representations, but there is a need to design new ontology visualizations.

Users and Document-centric workflow for ontology development

Aisha Blfgeh and Phillip Lord (Newcastle)

Ontology development is a collaborative process. Tawny OWL allows you to develop ontologies in the same way as you write programs. You can use it to build a document-centric workflow for ontology development. You start with users editing an excel spreadsheet, which is then used as input in Tawny OWL to ultimately generate OWL. This will also generate a Word Document that users can see the changes in.

But how successful would the ontological information in the form of a word document actual be? Depends on the users – there are two types – the users of the ontology and the developers of the ontology. They started by classifying their users into: newbies, students, experts and ontologists and they worked on the pizza ontology.

The participants saw both the word document and in Protege. Errors were introduced and the participants were asked to find them. Reading text in Word helps explain the structure of the ontology especially for newbies. However, the hierarchy is very useful in Protege. The ability to edit the text in the word document is quite important for non-experts.

The generation of the word document is currently not fully automated, and therefore this is one of the things they plan to do. They also want to develop a Jupyter notebook for the work. Finally, they’d like to repeat this work with ontologists rather than just newbies.

DUO: the Data Use Ontology for compliant secondary use of human genomics data

Melanie Courtot (EBI) – on behalf of The Global Alliance For Genomics And Health Data Use Workstream

Codifying controlled data access consent. Data use restrictions originate from consent forms – and as a researcher, to get the data you have to go via data access committees. The current protocol for data access is: there are data depositors and data requestors. The data access committee sits between the data and the requestors and tries to align the requestors’ needs with the data use limitations. All of this is done manually, and is quite time consuming, Often there isn’t the human capacity to go through all requests. Therefore if we can encode consent codes into an ontology, perhaps the data access process could be more automated.

The use cases for this system would include data discovery, automation of data access, and standardization of data use restrictions and research purposes forms. DUO lives in a GitHub repo where they tag each release. They aim to keep DUO small and to provide clear textual definitions augmented with examples of usage. In addition, DUO provides automated machine-readable coding.

W3C Data Exchange Working Group (DXWG) – an update

Peter Winstanley (The Scottish Government) and Alejandra Gonzalez-Beltran(Oxford)

Peter co-chairs this working group and is one of their Invited Experts. He shares the burden of chairing and ensures that the processes are adhered to. These processes involve making sure there is openness, adequate minutes, and sensible behaviour. The working group is a worldwide organization, which makes it difficult to organize the weekly meetings (time zones etc). There are also subgroups, which means two awkwardly-timed meetings. This is the context in which the work is done.

The DCAT (Data Catalog Vocabulary) has been around since 2014 as a W3C recommendation. Once people really started using it, issues became apparent. There were difficulties with describing versioning, API access, relationships between catalogs, relations between datasets and temporal aspects of datasets etc. Therefore the way that people have used it is by mixing it with other things as part of an “application profile”. Examples include DCAT-AP, GeoDCAT-AP, HCLS Dataset description, DATS. Different countries have also already started creating their own application profiles as part of a wider programme of AP development (e.g. Core Public Service Vocabulary (CPSV-AP)).a

The mission of the DXWG is to revise the DCAT and then to define and publish guidance on the use of APs, and content negotiation when requesting and serving data. There have been a few areas where reduced axiomatisation is being proposed in the re-working of DCAT to increase the flexibility of the model.

You can engage with DXWG via github, the w3c meetings and minutes, the mailing lists, and provide feedback.

Panel Session

Robert Stevens introduced the panel. He stated that one of the reason he likes this network is its diversity. Panellists: Helen Lippell, Allison Gardener, Melanie Courtot, and Peter Andras. The general area for discussion is: in the era of Big Data and I, what type of knowledge representation do we need?

Melanie Courtot: It depends on what you’re calling KR… Ontologies are time consuming and take a lot of time, and they’re typically not funded. If we’re talking about KR other than ontologies, then you want to ensure that you keep any KR solution lightweight. She liked that a lot of the talks were very practically oriented.

Helen Lippell: She doesn’t work on funded projects at the moment, but instead going into private sector companies. They have lots of projects on personalization and content filtering. You can’t really do these things without ontologies / domain models / terminologies, and without ensuring these are all referring to the same thing. She’s like to see more people in the private sector working with ontologies – shouldn’t be just academics – go out and spread your knowledge!

Allison Gardener: From the POV of a biologist coming into Computer Science, she’s primarily concerned with high quality data rather than just lots of data. What features she chose and how she defined these features was really important. Further, how you define a person (and their personal data) would determine how they are treated in a medical environment. Ontologies are really important in the context of Big Data.

Peter Andras: If you look how KR works in the context of Image Analysis – transformation of images and fed into a Neural Network – you get statistical irregularities in the data space. Your KR should look at these irregularities and structure those in a sensible way that you can use for reasoning. This works for images, but is more difficult  / much less clear when you’re looking at text instead. However, if you can add semantics into the text data, perhaps you can more meaningfully derive what transformations make sense to get those high quality irregularities from your analysis. Sociologists have several million documents of transcribed text from interviews – how you analyse this, and get out a meaningful representation of the information contained therein, is difficult and ontologies could be helpful. How can you structure theories and sociological methodologies such that you add more semantics?

Q: Have ontologies over-promised? Did we think it could do more than it has turned out that it could do? Melanie: What are we trying to do here? Trying to make sense of a big bunch of data… As long as the tools work, it doesn’t really matter if we don’t use ontologies. Phil: “Perfection is the enemy of the good.” Peter: There hasn’t been really an over-hype problem. Perhaps you’ll see the development of fewer handcrafted ontologies and more automated ontologies via statistical patterns. But what kind of logic should we use? Alternative measures of logic might apply more – the weighting of logic changes.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

UKON 2018: Session 3

A Community-based Framework for Ontology Evaluation

Marzieh Talebpour, Thomas Jackson and Martin Sykora (Loughborough)

There are many systems supporting ontology discovery and selection. She reviewed 40 systems in the literature and came up with a generic framework to describe them. They all have a collection of ontologies gained by various means and then they receive added curation. She wanted to evaluate the quality of ontologies and aid the selection process through metrics. There are three groups of such metrics – internal, metadata and social metrics. Although you can group them in this way, do knowledge and ontology engineers actually consider social metrics when evaluating the ontologies?

She interviewed ontologists to discover what they saw as important. After getting the initial list of metrics from the interviews, she did a survey of a larger group to rank the metrics.

Towards a harmonised subject and domain annotation of FAIRsharing standards, databases and policies

Allyson Lister, Peter Mcquilton, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra, Milo Thurston, Massimiliano Izzo and Susanna-Assunta Sansone (Oxford)

(This was my talk so I didn’t take any notes, so here’s a summary)

FAIRsharing (https://www.fairsharing.org) is a manually-curated, cross-discipline, searchable portal of three linked registries covering standards, databases and data policies. Every record is designed to be interlinked, providing a detailed description not only of the resource itself, but also its relationship to other resources.

As FAIRsharing has grown, over 1000 domain tags across all areas of research have been added by users and curators. This tagging system, essentially a flat list, has become unwieldy and limited. To provide a hierarchical structure and richer semantics, two application ontologies drawn from multiple community ontologies were created to supplement these user tags. FAIRsharing domain tags are now divided into three separate fields:

 

  • Subject Resource Application Ontology (SRAO) – a hierarchy of academic disciplines that formalises the re3data subject list (https://www.re3data.org/browse/by-subject/). Combined with subsets of six additional ontologies, SRAO provides over 350 classes.
  • Domain Resource Application Ontology (DRAO) – a hierarchy of specific research domains and descriptors. Fifty external ontologies are used to provide over 1000 classes.

 

  1. Free-text user tags. A small number of FAIRsharing domain tags were not mappable to external ontologies and are retained as user tags. Existing and new user tags may be promoted to either application ontology as required.

From the initial user tags to the development of the new application ontologies, our work has been led by the FAIRsharing community and has drawn on publicly-available resources. The FAIRsharing application ontologies are

  1. Community driven – our users have created the majority of the terms, providing the initial scope for DRAO and SRAO.
  2. Community derived – to describe the wide range of resources available in FAIRsharing, we imported subsets of over fifty publicly-available ontologies, many of which have been developed as part of the OBO Foundry.
  3. Community accessible – with over 1400 classes described, these cross-domain application ontologies are available from our Github repositories (https://github.com/FAIRsharing/subject-ontology, https://github.com/FAIRsharing/domain-ontology) and are covered by a CC BY-SA 4.0 licence.

Guidelines for the Minimum Information for the Reporting of an Ontology (MIRO)

Nicolas Matentzoglu (EMBL-EBI), James Malone (SciBite), Christopher Mungall (The Lawrence Berkeley National Laboratory) and Robert Stevens (Manchester)

Ontologies need metadata, and we need a minimal list of required metadata for ontologies. They started with a self-made list, and then created a survey that was widely dispersed. The stats from that survey were then used to discover what was most important to you. Reporting items include: Basics, Motivation, Scope, Knowledge acquisition, ontology content, managing change, quality assurance.

What was surprising was the amount of items that were considered very important and ended up with a MUST in MIRO. The ones with the highest score were URL, name, owner and license (clearly). The bottom three were less obvious: content selection, source knowledge location and development environment.

They then tested retrospective compliance by looking through publications – ended up with 15 papers. The scope and coverage, need, KR language, target audience, and axiom patterns were very well represented. Badly represented were ontology license, change management, testing, sustainability, and entity deprecation policy.

Testing was both not reported and not considered important.  Allyson note: I think that this is self fulfilling – there is no really good way to test other than running a reasoner, so something like Tawny OWL allows this, and therefore create an interest in actually doing so.

Tawny-OWL: A Richer Ontology Development Environment

Phillip Lord (Newcastle)

Tawny OWL is a mature environment for Ontology development. It provides a very different model than other existing methods. It allows for literate ontology development. Most people use Protege, others use the OWL API. The driving use case was the development of an ontology of the human chromosomes – complex to describe, but regular. 23 chromosomes, 1000 bands, and the Protege UI can’t really handle the number of classes required.

Tawny OWL is an interactive environment built on Clojure and you can use any IDE or editor that knows about Clojure / leiningen. You can then replace a lot of the ontology-specific tools and use more generic ones – versioning with git, unit testing with clojure, dependency management with Maven, continuous integration with Travis-CI.

It allows for literate development because it allows for fully descriptive documentation / implementation comment (stuff you’d put in code that isn’t meant to be user facing) which wasn’t really possible in the past. Version 2.0 has regularization and reimplementation of the core, patternization support (gems and tiers), a 70 page manual, project templates with integrated web-based IDE, and is internationalizable.

Automating ontology releases with ROBOT

Simon Jupp (EBI), James Overton (Knocean), Helen Parkinson (EBI) and Christopher Mungall (The Lawrence Berkeley National Laboratory)

Why do we automate ontology releases? When you have a regular release cycle which triggers release of other services. You also have the creation of various versions of the ontology.  What happens as part of the release? Pull desired sections of various ontologies – desired terms are kept in a TSV file.

ROBOT is an ontology release toolkit. It is both a library and a command-line tool. Commands can be chained together to create production workflows. Within EFO, the ROBOT commands are added to the EFO makefile, where the ontology release is treated as a compile step. This allows testing to happen prior to release.

ROBOT commands include merging, annotation, querying, reasoning, template (TSV -> OWL), and verification.

Bioschemas Community: Developing profiles over Schema.org to make life sciences resources more findable

Alasdair Gray (Heriot-Watt) and The Bioschemas Community (Bioschemas)

They are asking for developers to add 6 minimum properties. The specification is added on top of the schema.org specification. Over 200 people involved in a number of workshops. To create bioschemas, they identify use cases and then map to existing ontologies. Then a specification is created, tested and then applied.

They’ve had to create a few new types which schemas.org didn’t have (e.g. Lab Protocol, biological entity). 16 sites have deployed this, including FAIRsharing. Will other search engines respect this? The major 7 search engines are using schema markup.

 

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

UKON 2018: Session 2

Gender: Its about more than just gonads.

Phillip Lord (Newcastle)

He begins with a story – what does LGBTQIA+? How do you define this in an ontology? Perhaps start with something simpler… This is about social modelling. Modelling this is a challenge because it is important, and complicated, and sensitive.

First you need to consider gender versus sex. Newcastle has one of the 7 gender dysphoria clinics in the UK. ICD-10 has a classification of disease called “trans-sexual” which has been removed in ICD-11 because it is not a disease. You also have PATO, which describes gender – among other things. PATO’s male and female definitions has its own issues. These definitions are based on gametes, which is problematic – if you are a infertile man you are both female and male (and so on and so forth). Intact, and Castrated and other aspects of the PATO definitions have problems. The definition of Castrated Male contradicts the definition of Male.

The beginning of Phil’s ontology is Birth Assigned Gender. Other terms include Affirmed, Man, Woman, pronouns, legal gender and biological gender (biological gender will be dropped)Man and Woman are defined based on your affirmed gender, not your assigned gender.

He’s also started modelling sexuality. The entire area is difficult to model, and is critical for many IT systems, and is very interesting.

SyBiOnt: The Synthetic Biology Ontology

Christian Atallah (Newcastle), James McLaughlin (Newcastle), James Skelton(Newcastle), Goksel Misirli (Keele) and Anil Wipat (Newcastle)

Synthetic Biology: the use of engineering principles for the development of novel biological applications. The classic build -> test -> learn -> design -> build. Synthetic biology is very modular and includes many different types of biological parts. SBOL is used to visualize and build synthetic biology designs. SyBiOntKB is an example of using the ontology. You can mine SyBiOntKB to get synthetic biology parts. http://sybiont.org

SBOL-OWL: The Synthetic Biology Open Language Ontology

Goksel Misirli (Keele), Angel Goni-Moreno (Newcastle), James McLaughlin(Newcastle), Anil Wipat (Newcastle) and Phillip Lord (Newcastle)

Reproducibility of biological system designs is very important. SBOL has been adopted by over 30 universities and 14 companies worldwide as well as ACS Synthetic Biology. The designs are hierarchical and can be grouped into modules. In order to understand SBOL you need to read the User Guide – it isn’t available computationally. Validation rules are in the appendix, and SBOL refers to external ontologies and CVs to provide definitions. So, how should you formally define this? Provide an ontological representation of SBOL data model – SBOL-OWL.

Example query – return a list of ComponentDefinitions that are promoters and of type DNARegion and that have some Components.

SBOL-OWL allows computational validation of verification rules, and allows automated comparison of incremental SBOL Specifications. It provides a machine-accessible description of SBOL entities. You can annotate genetic circuit designs with rich semantics, which would allow you to use logical axioms for reasoning over design information.

Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

UKON 2018: Ivana Bartoletti on security versus privacy in data use

Session 2 began with a presentation by Ivana Bartoletti

What is a good use of data? She works a lot with smart metering, smart homes and connected cities. To address population growth and climate change, we need new thinking. Big Data can help with this, but there are serious privacy concerns about it. Individuals need to be able to discover exactly what data concerns them.

People don’t want to give up with Google Maps and Facebook. These free services have become part of our life and help with our daily tasks. Therefore transparency has become more important than ever. The new GDPR is relevant in this context. Privacy terms on websites are difficult to read and are often convoluted.

The new legislation describes the right of erasure. It states that all available technology needs to be used to delete data – this includes contacting any other companies that might have used that person’s data. Transparency is vital, especially as bias can easily creep in when analysing personal data / profiling. GDPR was created to support the single digital market. The free flow of data across the EU is an important discussion point as part of Brexit.

Must redefine the concept of personal data. We should not define personal data as something we own – it’s something we are. If we think of it as a car that we can sell, we are taking the wrong approach. Instead, it’s like our heart – it’s who we are. So, your Facebook account is part of your personality. Shifting this definition can drive and inform the debate about the transaction of data in return for free services.

This will result in a new ethical debate about how personal data is used. Corporations don’t always understand the data, and therefore struggle to govern it properly. You need an open information system with a high level of transparency and interoperability. Ontologies have a very big potential in this area.

Practically speaking, how would changing how we perceive our personal data (as who we are) change our day to day life?

It will be very challenging to remove someone’s personal data from publications / studies. Will papers have to be changed after they’ve been published? You need to consider if it is personal or anonymized data. You need to de-identify data much more carefully. This question shows how important it is to extract data.

There is a conflict between personal data and personal identity. We often use the data to establish the identity rather than just for the purpose of sharing data. Digital identities are an important part of this research.

Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

UKON 2018: Morning Session

Session 1 Chair: Dr Jennifer Warrender

This session contains short 10-minute talks.

Organising Results of Deep Learning from PubMed using the Clinical Evidence-Based Ontology (CEBO)

M. Arguello Casteleiro (Manchester), D. Maseda-Fernandez (NHS), J. Des-Diz(Hospital do Salnes), C. Wroe (BMJ), M.J. Fernandez-Prieto (Salford), G. Demetriou(Manchester), J. Keane (Manchester), G. Nenadic (Manchester) and R. Stevens (Manchester)

They are combining three areas when studying semantic deep learning: natural language programming, deep learning and the semantic web. The purpose of CEBO is to filter the strings into the ones that are the most useful for clinicians. In other words, CEBO filters and organises Deep Learning outputs. Ot has 658 axioms (177 classes).

Using Card Sorting to Design Faceted Navigation Structures

Ed de Quincey (Keele)

Some of this work is a few years old, but the technique hasn’t been used much and therefore he’d like to present it to make it more visible again. Card sorting begins with topics of content + cards + people who sort them into categories = the creation of information architecture. It can be used to (re)design websites, for example. For the new CS website at Keele, they gave 150 students (in groups of 5) 100+ slides and asked them to categorize. Pictures as well as text can be used, e.g. products. You can also do card sorting with physical objects.

Repeated Single-Criterion Sorting: Rugg and McGeorge have discussed this in a paper. With this technique, because you’re asking them to sort multiple times, you instead use about 8-20 cards at a time. Also, you can get a huge amount of information just by doing this with about 6 people. An example is sorting mobile phones. You ask people to sort the objects into groups based on a particular criterion, e.g. color. Then after sorting, you ask them to sort again, and continue on with a large number of criteria. You ask them to keep sorting until they can’t think of any other ways to sort them. Then you pick a couple at random, and ask the people to describe the main difference between them, which usually gets you another few criteria to sort on.

Overall, this allows you to elicit facets from people. Allows you to create a user-centered version of faceted navigation. For his work, he looked at music genre, and investigated whether or not it is the best way to navigate music. He asked 51 people to sort based on their own criteria. He got 289 sorts/criteria during this work. This was then reduced to 78 after grouping them into superordinate constructs by an independent judge. After a while, you found a commonality for genre, speed and song, but then after that it becomes a lot more personal, e.g. “songs I like to listen to in lectures” 😉

Then you can create a co-occurence matrix for things like gender. There was no agreement with respect to genre, which was interesting. Spotify now supports more personal facets, which wasn’t available 8 years ago when this work was first done. As such, this technique could be very useful for developing ontologies.

WikiFactMine

Peter Murray-RustCharles Matthews and Thomas Arrow (ContentMine)

Peter feels that there is a critical need for Liberation Ontology, and regain control from publishers. Wikidata has about 50 million entities and even more triples, and it’s democratic. He says it is our hope for digital freedom. WikiFactMine (his group) added 13 million new items (scientific articles) to it. There are loads of disparate categories, so if you want ontological content, WikiData is the first (and only) place to go. Good example of a typical record is Douglas Adams (Q42 – look it up!).  Scientific articles can be WikiData items. They were funded by WikiMedia to set up WikiFactMine for mining anything, but particularly the scholarly literature.

You can create WikiFactMine dictionaries. It is constructed such that there is a hierarchy of types (e.g. the entire animal kingdom in the biology subset). They created a dictionary of drugs just by searching on “drug” and pulling out the information associated with it. There are issues with mining new publications however. Then you can combine dictionaries, e.g. gene, drug, country and virus. By doing co-occurence of country + disease, you may be able to predict outbreaks.

The Right to Read is the Right to Mine. http://contentmine.org

Is there some kind of curation / moderation on WikiData? There is curation on the properties (the community has to agree to this). WRT data, if people think it’s too trivial, it can be marked as a candidate for deletion, and discussions can ensue.

A Malay Translated Qur’an Ontology using Indexing Approach for Information Retrieval

Nor Diana AhmadEric Atwell and Brandon Bennett (Leeds)

Improving the query mechanism for retrieval from Malay-translated Qur’an. Many Muslims, especially Malay readers, read the Qur’an but do not understand Arabic. Most of the Malay-translated applications only offer keyword search methods, but does not help with a deeper understanding. Further, morphological analysis is complicated in Malay, because it has a different structure. They are building an semantic search and an ontology. They wish to improve speed and performance for finding relevant documents in a search query. Also built a natural-language algorithm for the Malay language.

Ontology + relational database was used. ~150,000 words. With keyword search, there was 50% precision, and with her new method, was ~80% precision.


Towards Models of Prospective Curation in the Drug Research Industry

Samiul Hasan (GlaxoSmithKline)

As we think about making precision medicine a reality, it is much more likely that we will fail because of the challenges of data sharing and data curation (Anthony Philippakis, the Broad Institute).

2 important attributes of scientific knowledge management: persistence and vigilance (without access to the right data and prior knowledge at the right time, we risk making very costly, avoidable business decisions). Persistence requires efficient organization, and vigilance requires effective organization. What’s getting in the way of these aspirations is the inconsistent use of language at the source, which creates serious downstream problems. What about implementing reward in data capture steps? How do we not miss vital data later on? Named entity recognition, document classification, reinforcement learning, trigger event detection. You need both vision-based and user-centric software development.

Posters and Demos: 1-minute intros

  • Bioschemas – exploiting schema markup to make biological sources more findable on the web.
  • Document-centric workflow for ontology development – read from excel spreadsheet using Tawny Owl and create an ontology which can be easily rebuilt
  • Tawny OWL – a highly programmatic environment for ontology development (use software engineering tools / IDEs to build up ontologies.
  • Hypernormalising the gene ontology – as ontologies get bigger, they get harder to maintain. Can you use hypernormalization to help this? It is an extension of the normalising methodology.
  • Bootstrapping Biomedical ontologies from literature – from PubMed to ontologies.
  • meta-ontology fault detection
  • Bioschemas – show the specification and how they’re reusing existing ontologies
  • Get the phenotype community to use logical definitions to increase cohesion within the community (Monarch Consortium)

Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!