SWAT4(HC)LS 2019: Afternoon Presentations and Panel, Day 1

Semantics for FAIR research data management

Keynote by Birgitta König-Ries, introduced by Scott Marshall

Resources covered in this talk that are in FAIRsharing: GBIF, Dryad, PANGAEA, Zenodo, Figshare, PROV-O.

FAIR doesn’t cover everything, for instance data quality. The Biodiversity Exploratories project has been running for 13 years and, as there is turnover (as with any project) you cannot guarantee that the same people will be present who will know about / where your data is. There are 3 exploratories within grassland and forest, and wish to discover what drives biodiversity. To do this, they need to be able to share data and integrate data in different regions and by different groups

They state that the FAIR requirements can be a little vague, but as far as they can tell they are findable, but the interoperability and reusability is low – they need some semantics work / standards. They have submitted to PLoS One “Do metadata in data repositories reflect scholarly information needs?” (accepted with minor revisions). They made a survey for biodiversity researchers – they are mainly looking for information on environment, organism, quality, material and substances, process and location. They were not interested in person or organization providing the data, or the time in which the data was gathered.

Some data sources include GBIF, Dryad, PANGAEA, Zenodo and Figshare. What semantic building blocks do they need? Tools for creating domain models, a data management system that supports semantics, help in providing semantic data (annotations, provenance management), making use of semantics for finding data, and then ultimately put it all together.

They used a process of ontology modularization, selection and customization. You need to align different ontologies in this step, and in aid of this she would like to point people to the Biodiversity Track in the OAEI (Ontology Alignment Evaluation Initiative). She described Subjective Logic, providing values for belief, disbelief, uncertainty, atomicity. They applied this to ontology merging to create scores for how trustworthy an information source is (discovering which ontology is causing merge inconsistencies).

BEXIS2 is a data management life cycle platform. It highlights how “varied” the variables are in data sets, even though they are semantically equivalent, e.g. different names for “date”. They provide templates for various data types. Within BEXIS2 is a feature for semi-automatic semantic annotation. Fully automatic approaches might happen as a result of the project outlined in “Semantic Web Challenge on Tabular Data to KG Matching” by Srinivas et al. The’ve also developed an ontology extending PROV-O and P-Plan. They have a website for visualizing provenance data, and the biodiversity knowledge graph that was presented earlier.

Some of their recent work has involved improving the presentation and navigation of results using facets to explore Knowledge Graphs. KGs are tricky as they are large and you are often interested in indirect relationships. You can’t precompute facets as that becomes impractical. When applying facets to KGs, you need to be able to manipulate a subgraph that is the search result. The result set is a list of IRIs, which you then create candidate results from that list, removing those that only appear in a small subset of your result. Then you find those candidate results that work well as facets (the goldilocks zone of facet size). You can also filter on facets that are not similar to each other (semantically distant), to make the results more interesting. This methodology “works” but is too slow, and you need a better way of traversing the graph a bit to get more distant information.

Germany has funded the BFDI (German National Research Data Infrastructure) – 30 consortia covering all research areas. They’ve applied to be part of this project for biodiversity data. The ultimate aim of BFDI for biodiversity is to build a unified research data commons infrastructure that provides customized semantic data to different user groups (technology services, integration, interoperability).

Making clinical trials available at the point of care – connecting Clinical trials to Electronic Health Records using SNOMED CT and HL7 InfoButton standards

Jay Kola (speaker), Wai Keong Wong and Bhavana Buddala

In FAIRsharing and in this talk: ClinicalTrials.gov, SNOMED-CT, ICD 10, MeSH, CDISC ODM.

This is about “using standards in anger”, e.g. making it work! This started because of a simple clinical problem – “I know our hospital is a large center, but I don’t know what trials we are running, or what trial my patient might be eligible for”. There are two external registries – ClinicalTrials.gov and ClinicalTrialsGateway (UK). However, they’re not up to date, and not all local studies/trials are registered on external registries. What did they do? Made it possible for a clinician to bring up all trials the patient is suitable for based on diagnosis, age and gender. It took only 20 minutes to configure/add this functionality to the EHR system.

Their solution is the Keytrials Platform, which is then integrated into the hospital EHR system. The integration with the local trial registry was done with a batch import with a text file. How do you make the trials “coded” with tags / ontology? ICD 10, SNOMED CT, MESH were considered, SNOMED CT was used. ICD10 wasn’t as rich as SNOMED CT (which is also hierarchical). They also used NLP and SNOMED CT to annotate trials.

The HL7 InfoButton allows systems to request information from knowledge resources using a pre-defined set of parameters. With their system, a query is passed from EHR to the InfoButton which then goes to the KeyTrials Backend and then sends it back to the clinician / EHR system.

Importing data is painful if we have to create a connection for every possible system. CDISC-ODM doesn’t reflect all of the clinical trial data (e.g. eligibilities), FHIR researchstudy is still under development. KeyTrials is open source, you can use it as you like. Also, post hoc annotations of clinical trials via NLP is avoidable if clinical trials data is coded at the time of creation. They used BioYODIE (GATE NLP engine). Another issue was that ICD 10 is more prevalent than SNOMED CT. Even at their hospital, ICD10 is used natively… ICD10 is less granular than SNOMED CT, which can cause issues in mapping existing terms.

KeyTrials and related work is intended to make clinical trials more visible and to increase recruitment in clinical trials. The goal is to make clinicians less likely to miss trials.

A Blueprint for Semantically Lifting Field Trial Data: Enabling Exploration using Knowledge Graphs

Daniel Burkow, Jens Hollunder, Julian Heinrich, Fuad Abdallah, Miguel Rojas-Macias, Cord Wiljes, Philipp Cimiano and Philipp Senger

FAIRsharing records for described resources: JERM, EXACT.

Field trials require a certain process with 4 stages: trial planning, field operations, assessments and analytics/reporting. Based on the type of data, each dataset might end up in one of multiple repositories. Therefore they would like to provide a workflow that overcomes these interoperability issues. To start, you harmonize the data and put it into a KG. The KG is used to find new models (digital experimentation and then analysis). First, they aggregate the data silos and using reference ontologies. The KG is then enriched to provide extra value to the end users. Field data has lots of different data types. Their data model is field-trial centric. It isn’t intended to store data, but rather just model it.

They map data resources onto their semantic business data model. Then, they map the data model onto ontologies from various domains (chemistry, experimental description, geospatial, provenance, biology etc). Each domain has their own transformation rules and ontologies (EPPO code, NCBI Taxon, PROV-O, OSPP – soil, EPPO Climatic zone, JERM – just enough results model ontology, Experimental Actions, OECD, GeoNames, Soil Triangle from USDA). They also have additional sources for cleaning, translation, harmonization and enrichment.

They have 500,000 field trials from the last 30 years with over 600 million nodes and 5.3 billion relations. They make use of Virtuoso, Neo4J, ElasticSearch, Oracle, PostgreSQL, GraphQL, SpotFire, Tableau, R/Shiny, Python/Dash). They built a custom application for data exploration to display the graph data.

Panel discussion (Spotlight on Agrisemantics II)

Chaired by Chris Baker

Introduction to GODAN – Suchith Anand

How do we bridge the digital divide? How do we make opportunities in open science available to people in the developing world? How can we use open data to help smallholder farmers? GODAN focuses on topics such as malnutrition, hunger and poverty through open innovation. GODAN supports global efforts to make agriculture and nutrition data available, accessible and usable for unrestricted use worldwide. There are over 100 GODAN partners.

Open data aids impact – his example is the Landsat data, which was made open in 2009. There are 232 Sustainable Development Goals (SDGs) Indicators. How can we link these data sets and get meaning and context from all of this information? For researchers in Africa, open data isn’t enough – you need to have open tools. Open science is a key enabler for capacity development and empowerment for all, especially in the developing world. In this context, Open Science includes open data, open standards, open access to research publications, open education resources, and open software.

It’s a very interesting time to be working in open science. (examples of linked open data was in 2012, and then FAIR in 2015 onwards.) Lots of things happening in this area, e.g. the European Open Science Cloud. A question for him is how can more opportunities be given to students in digital agriculture?

Panel Discussion

Panel speakers: Fernanda C. Dórea, Monika Solanki, Marie Angélique Laporte, Jens Hollunder

MS states that we should reuse existing work more than we do – and less time building tools & ontologies from scratch. FCD says there is a struggle for funding for projects when everyone benefits but nobody profits. Should look for initiatives that bring people together.

Q: Are you taking into account indigenous practice? How about local and neglected languages?

MS: She’s not aware of resources as such, but she can see the problem. Building these things requires a lot of investment, and the countries that would need such efforts don’t have a lot of funding. This is a good example of the need for those countries to fund such projects.
FCD: Seeing initiatives coming from developed to developing countries sometimes seems a little like they should instead come from the developed country.
CH: There is pressure in Brazil surrounding monocultures – there are little or no resources for traditional methods. The governments of developing countries have other things to spend their money on.
MAL: It’s important to keep in mind that the people speaking the local dialects/variants are our end users in many cases.

Follow-up Q: This leads onto another question.. Although in the UK and US have lots of very large farms, but elsewhere in the world it is mostly smallholders. Are we ignoring smallholders (more remote, less internet / connectivity)? I think there are implicit barriers to smallholders, as they won’t have technological access and may not have the educational tools.

JH: Precision farming would not neglect smallholders. Outcome-based models would help the small farms. We need to be constantly creating new data models to correctly match the current state of resources etc.
MS: Agrisemantics is about having a shared conceptualization across the community – anyone can consume such data. Bigger organizations have more technological resources, but within a limited capacity, even smallholders could benefit e.g. having a shared vocabulary to describe things.
MAL: When we developed the Agronomy Ontology, we are looking to see what the impact of management practices is on yield and other characteristics, and this includes smallholders as one of the use cases.
SA: Recently GEO asked members of the indigenous communities to participate in workshops etc, and this worked really well.

Q: Most of the semantic technologies that we work with deal poorly with time, and yet this seems critical, given the changes that are and will happen due to climate change. Is this a problem?

MS: I interpret this as a representation of time in the existing suite of ontologies we have – and this has been a big issue. No suitable scalable solution has come up. Version control of linked datasets is one example of such a solution. Temporal reasoning is a very abstract concept at the moment.
JH: I agree this is a problem. Currently the semantics give us a frozen snapshot.

Q: What current Ag Semantic standards are in use, which are obsolete, and what new standardization initiatives are necessary to help with this?

MAL: Crop Ontology suite of ontologies have been successful so far.
FCD: This isn’t an issue that is just seen in the ag semantics field. In her field, there’s a lot of work on structural interoperability but not a lot of work on semantic interoperability.
MS: AgroVoc is the de facto standard – very comprehensive and multilingual. It’s also lightweight. There are 403 in AgroPortal, perhaps some of them are obsolete.
JH: These 403 ontologies are each living in their little domain, when there are many more: chemistry, supply chain, management, biology etc. and they all need to come together and have each of these domains talking to each other.
MAL: Comparison to the OBO Foundry – promotion of requirements.

Follow-up Q: Do you think standards have been divided by public and private interests, or national interests?

JH: He feels what can be shared, will be shared.
MS: A lot of their ontologies are currently closed source, so it’s hard to use them – so there is a corporate divide.
MAL: totally driven by the funding and what’s available in each country.
FCD: Her experience with the cycle of funding is that if she spends the funding of a 3-year project on ontology development without focusing also on the community surrounding it, the ontology will die at the end of the 3-year project.

Q: Who should pay for the new (to be shared) agrisemantic models?

MS: We can’t expect private industry to pay for the new data model. We need a combination of experts from various areas to create a platform where such things can be discussed. Comparison with schema.org.
FCD: On the one hand, people might think what does making my data open do for me? Equally, in the EU there is more of a push towards open data providing impetus.
JH: we should have tangible use cases and start from there – money is given more easily with such use cases.
CH: In the machine learning world, many of the major frameworks are developed by the private sector and then donated to the public. Why is this not happening here?

Q: To what extent is the digitization of agriculture (together with agri semantics) a case of technology push that does not correspond to the real needs of farmers or the environment?

MAL: We need to keep in mind that our users are the farmers, and if we do so then we will make tools that will benefit them.
FCD: Make sure that what you’re doing actually matches the use case.
MS: It’s a problem with the people who are making the technology if this happens, as they are not listening to the farmers. Or alternatively, farmers might hold back and not provide all of the information they should. The Farm Business Survey is used in the UK (run by the government) to get data. Perhaps this also happens in other countries.

Q: What role can semantic technologies play in agricultural biosecurity?

FCD: You can only go so far in aggregation of data without using semantics – so at some stage any field including biosecurity would find semantics useful. However, often converting it into a format useful for semantic technologies might have a time penalty making it harder to be used in situations with a short time frame.

Q: There is a lot of hype and focus on healthy and sustainable food. Consumers want to be well informed about these things. Is there an opportunity for semantic data with respect to these questions?

MS: She has a couple of papers in this area (traceability in the food system).

Q: What is the role of semantics in limiting / accelerating the advance of Ag IOT?

JH: There are a lot of vendors / devices / “standards” – so that is one limitation. It also depends on the sensor being used.
MS: This area does need a lot more work.

Q: What is lacking in the current state of the art in agrisemantics?

MS: The livestock domain still requires some work.
FCD: I agree with this.
JH: What is also missing is the easy accessibility for end users to communicate with this technology. This is a hurdle for increased uptake.

Please also see the full programme. Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

Author: Allyson Lister

Find me at https://orcid.org/0000-0002-7702-4495 and https://www.eng.ox.ac.uk/people/allyson-lister/

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s