Summer Update: busy days behind us, busy times ahead

As August looms, and most of academia in Europe slows down as researchers take leave to escape the heat, there’s a few things I’d like to draw your attention to as you scroll through your news feeds in a more relaxed manner.

  1. Apply for the second ambassador call! Are you interested in becoming an RDA / EOSC Future Domain Ambassador? The second call runs through mid September, please consider applying!
  2. My ambassadorship is off and running… I’ve tweeted and chatted, and I’ve emailed everyone I know announcing both the Ambassadorship and the FAIRsharing Community Curation Programme, and we’ve had lots of great applications to join. We’ll announce in September who our first intake of community curators will be, and it’s very exciting!
  3. What advice would I give? I was asked on Twitter this week what advice I would give to people considering applying for the ambassadorship. Trying to condense that into a tweetable reply was hard for one as verbose as I am, but I finally managed.

My advice: how can you create a network of experts to promote data sharing & for domain-specific engagement w/ @EOSCFuture products using @resdatall @RDA_Europe? You can define that domain in *any way*. What’s important TO YOU? See what we chose in call #1— Dr. Allyson Lister (@allysonlister) July 29, 2022

FAIRsharing has gone through lots of changes this academic year – we’ve completed the rehaul of our front- and back-end, and we’ve added loads of new features. We’re taking a break for August, and we’ll see you again in September! We have more ideas than you can shake a stick at, and we aim to implement a number of new features in the coming year, so stay tuned!

Housekeeping & Self References RDA

Talking about ontologies, FAIRsharing and the OeRC on the FAIR Data Podcast #17

Rory MacNeil’s FAIR Data Podcast began in February this year with Julie Goldman talking about FAIR in the context of Harvard Library Research Data Management Program. In weekly episodes, Rory has interviewed a number of interesting people and covered a variety of topics, from data stewardship, libraries, data management plans, and enabling FAIR data.

What stands out for me is that every interviewees has come to FAIR from a different starting point ( e.g. historians, biologists, neuroscientists, and even the a PhD that looked into the composition of human teeth (thank you Esther Plomp!)), we have all reached FAIR as the natural progression of our careers. This tells me that wherever you are within research, FAIR is vital.

Rory asked me to contribute an episode last month, and we were able to release it on June 8 (I’m episode #17) to coincide with the announcement that I’ve been named one of five RDA / EOSC Future Domain Ambassadors. It’s a great series, and I encourage you all to have a listen to them! I enjoyed the experience – he had some great questions and I was able to spend lots of time talking about ontologies (always a bonus!) I even borrowed my friend’s fancy microphone, so you could tell I was looking forward to it… Generally, we discussed ontologies and how they relate both to my work and to FAIR, as well as my efforts with FAIRsharing, its community curator programme, the RDA and the Data Repository Attributes WG I co-chair, and how the OeRC‘s interdisciplinary ethos helps the researchers within it.

My favourite episodes are

  • Episode 2: Chris Erdmann on FAIR in the American Geophysical Union (AGU) and his really interesting past work
  • Episode 4: Sarah Jones on FAIR in Geant
  • Episode 10: Laura Klinkhamer & Niamh MacSweeney discussing ReproducibiliTEA! in Edinburgh and engagement across all levels of the university
  • Episode 11: Esther Plomp talking about the vital role FAIR plays in data stewardship, and how the lines between stewardship, curation, and data creation (via the traditional “researcher” role) are blurring

Please also checkout the latest episode (#18) with Kazu Yamaji, Professor at the National Institute of Informatics in Japan and one of the other co-chairs of the Data Repository Attributes WG.

Housekeeping & Self References In The News Outreach

Introducing the new RDA / EOSC Future Domain Ambassadors

I’m excited to announce that I’ve been appointed one of five newly-minted RDA / EOSC Future Domain Ambassadors! We have different domains of interests and different goals (I’ll get onto those in a moment), but what brings us together is a common interest in engagement, both within our communities and globally across everyone interested in research data. In my role as the FAIRsharing Content & Community Lead, you’ll see posts on the FAIRsharing blog and various other outreach and social media methods (please find me and FAIRsharing on Twitter!). However, I’d like to also provide some occasional additional notes here, on my personal blog, in a more informal way.

Who are the RDA / EOSC Future Domain Ambassadors?

You can find us all listed on our official homepage, but here is a short summary of everyone, and what they plan to do.

Geta Mitrea RDA/EOSC Future Domain Ambassador for Social Sciences (Education): University „Stefan cel Mare” of Suceava (Romania). She will be actively promoting what EOSC Future and RDA have to offer, acting as a multiplier across disciplinary networks, at regional, national and thematic level (at regional North-East Area from Romania and also at national level for sociology/social work/social sciences scientific domain).

Pedro Frietas Mendes RDA/EOSC Future Domain Ambassador for Chemical Engineering / Catalysis. increasing FAIR awareness, and he will want to work on guidelines for FAIR data sharing in catalysis. Finally, set the groundwork for FAIR data sharing in his domain.

Anne Cambon-Thomsen RDA/EOSC Future Domain Ambassador for Medical Science. Increasing the awareness and implementation of Open Science and especially Open Data in human immunogenetics, genetics, bioethics and research ethics communities, through participation in meetings and education schemes in these domains.

Francis P. Crowley (GCPA & SIDCER): RDA / EOSC Future Domain Ambassador for Ethics & Law: the goal of this ambassadorship is to evaluate, explain, educate, and promote the use of the EOSC best practices in ethics and law, while advancing the understanding of EOSC Future with scientists and research institutions as well as with patients, communities, health advocates, and the public, providing important emphasis on those instances otherwise under-represented in initiatives for Open Science.

Allyson Lister (0000-0002-7702-4495): RDA / EOSC Future Domain Ambassador for standards, repositories and policies: I will focus on enriching the relationships among standards, repositories and policies through the development of a FAIRsharing Community Curation Programme. This programme will draw upon the collective domain expertise within the RDA and EOSC to build a network of community curators who will “champion” their domain of interest, and gain attribution and recognition in return.

What does it mean to be the RDA / EOSC Future Domain Ambassador for standards, repositories and policies?

In this first post, I am covering what it means to be a Domain Ambassador, and a short summary of how I plan to fulfil my role as the RDA / EOSC Future Domain Ambassador for standards, repositories and policies. In later posts I will look at the specific tasks that the RDA and EOSC Future have in mind when they developed the Ambassador programme, and describe how they relate to my particular appointment.

Engaging and enriching researcher networks

“[The ambassadorships targeted] recognised domain experts and skilled communicators [so that they can] serve as conduits between the RDA, EOSC Future and the wider domain specific communities they belong to.

Ambassador homepage

There are many worthy and respected projects, infrastructures and communities available worldwide, and it can seem confusing as we try to navigate them all and utilise them to the best of our ability. We ambassadors are all active within the RDA, and can use the network of RDA volunteers to achieve our goals for each of our domains. It might be interesting to note, however, that one requirement of applying for an ambassadors is that we cannot be a part of the EOSC Future consortium. The idea is to improve collaboration among our communities, the RDA and EOSC Future, but it was important that the ambassadors bring an outside perspective.

As part of my ambassador activities, I will contact the five EOSC ESFRI cluster projects and work from within the RDA itself (e.g. at the RDA Plenary later this month as part of IDW2022) with the aim of promoting these community standards, repositories and policies across research areas by enriching their descriptions within FAIRsharing. To do this, I will be expanding the newly-created FAIRsharing Community Curation Programme. You can see what this programme is by reading the account of one of our early-adopter community curators, Lindsey Anderson. (Please let us know if you’d like to be a part of this!)

Communication and knowledge exchange

By ensuring this communication and knowledge exchange, the grants will support the building of awareness-raising around the work and outputs of EOSC Future and also aim to help discipline experts grow their network and collaborate with like-minded people and organisations via the RDA.

Ambassador homepage

One of my activities as an ambassador will be, together with the community curators, to create training material, support spaces, and other material to continue the programme curators’ (and their organisations’) engagement with FAIRsharing and therefore EOSC information aggregators. The more completely each community curator’s ecosystem of standards, repositories and policies are described within FAIRsharing, the more accessible and discoverable those resources will be both to other research and research-support roles (such as data stewards), but also to information aggregators (e.g. our partnership with the OpenAIRE knowledge graph) and third-party FAIR evaluation and assessment tools (such as the Data Stewardship Wizard).


Successful applications should ensure that benefits are brought back to the disciplinary communities and the domain specific needs and input are streamlined into the work supporting the development of EOSC.

Ambassador homepage

My final activity is outreach, where I will give presentations and utilise social media to present the results of this ambassadorship. Through the domain ambassadorship, and through the coordinated efforts of a network of engaged community curators, we will enrich the community “space” within FAIRsharing, but more importantly improve discoverability, functionalities and connectivity of RDA/EOSC outputs.

You can read about all five of us on the Ambassador page, as well as a more detailed summary of my particular domain (standards, repositories and policies) and anticipated activities at my specific ambassador announcement.

Although the funding period runs from now to September 2023, FAIRsharing will continue to support and promote the Community Curation Programme as the network of curators and the growth of their own ‘subject-agnostic community curation space’ will sustain the programme for the long term.


The one where a six-year-old asked for graphs!

I call it a complete and resounding success when, while giving a STEM Ambassador presentation, a little girl at the back raises her hand to say “When will we see the graph?”

The graph in question – and the kids are right; who doesn’t want a good graph?

Early on in the presentation, I was using my hands to show how climate has natural variations (ice ages and the like), and that a changing climate is part of the Earth’s normal patterns. (This was a prelude to me saying that, since the Industrial Revolution, we have added human factors that go beyond these normal patterns.) I then, rather grandly, proclaimed that the children didn’t need to see graphs and that, as lovely as they are, it might be a couple of years before they start learning about graphs.

I should have known better than to say that in front of key stage 1 children (or perhaps it was the perfect thing to say?), because that immediately got a pantomine-like reaction of “We need to see the graphs!” from most of the 70-odd kids in the audience. I laughed and moved on, but a few minutes later, the 6-year-old in the back then raised her hand and asked her question. And so, using a slide I had been intending to gloss over, I showed them the graph.

Good graphs are awesome. And kids are pretty awesome. Now all you need to do is become a STEM ambassador and have your own awesome graph-based experiences.


RDA Data Repository Attributes Working Group (DRAWG) officially endorsed by the RDA Council

Today the Research Data Alliance (RDA) Council endorsed the Data Repository Attributes WG. This working group will begin straightaway, and run until 3rd September 2023. It’s exciting for me because, although I have been a member of the RDA for about a year (and have participated informally in RDA working groups before this date), this is the first time that I have co-chaired a group.

The DRAWG chairs are Michael Witt, Kathleen Shearer, Allyson Lister, Matthew Cannon, Washington Luís Ribeiro de Segundo, and Kazu Yamaji. Together with our members (who already number over 30!), we will create a list of common attributes for research data repositories and provide examples of current approaches for the discoverability of those attributes. There will be two documentary outputs over the course of active period of the WG. As taken from our WG homepage, these outputs are

  1. a list of common descriptive attributes of a data repository with 
    • a definition of each attribute,
    • a rationale for the use and value of each attribute,
    • the feasibility of its implementation, 
    • a gap analysis of its current availability from data repositories
  2. a selection of examples that illustrate the approaches currently being taken by repositories to express and expose these attributes to users and user agents.

Ultimately, we are looking for this list to be endorsed by the RDA and become a RDA Recommendation, with output 2 (the exemplars) being produced as a RDA Supporting Output.

You can view our full case statement, which expands upon this summary by acknowledging the full range of work in this area that has gone before (and the rationale for this particular WG) and our adoption plan and timeline. It’s rather a large job, but one that has created a large amount of interest. I really enjoy the RDA, as it has a wonderful sense of community for researchers, and I look forward to gently leveraging my fellow researchers within the RDA and make full use of their expertise in this area! Indeed, as the comments section shows, a variety of stakeholders such as OpenAIRE, FAIRsFAIR and CoreTrustSeal have already provided their thoughts on the working group.

If you’re interested, add yourself to our working group as a member, and join us for the next 18 months!

In The News Outreach SOC

Finding Ada in Scientific Data: Ada Lovelace Day 2020

Today is Ada Lovelace Day, as run by the lovely people at Finding Ada and as advocated by the STEM Ambassador Programme of which I am a part. I thought about choosing a famous woman from history, or a contemporary of ours to inspire us. But what really caught my imagination was the wonderful conglomerate of women in science that I have met, worked and become friends with since the start of my career.

But how to properly acknowledge them? As an ontologist, my mind immediately leapt to the creation of an ontology; I could describe the women, our various associations, and how we interrelated in specific and intricate detail. I was all ready to do it when I realized that just uploading an OWL file to my blog wasn’t very visually stimulating. Also, I realized that my eagerness to create an ontology would result in my spending far more time on getting it exactly right than I actually have.1 So, although only yesterday I was scoffing at spreadsheets, I ended up using exactly that kind of “unsuitable” method to quickly do what I needed. The graph below can be copied and modified to allow you to correct any errors and add any of us that I (gasp) am bound to have forgotten.

Update 17.10.20: Groups are from: Me (blue), Dagmar (yellow), Melanie S. (green), Melanie C. (purple), Jane (turquoise), and Katherine (orange), and Rachael (dark purple). Details of each person, including ORCIDs, in the graph and further down the page. Previous iterations of the graph at the end of the post.

Latest version with a few more connections added by Dagmar:

As this is about Ada Lovelace Day, this is a network of women. And, because of the way I have chosen to celebrate women in STEM, it includes all of those women with whom I have worked with directly2. (It is necessarily self centered, though I was certainly not aiming to center myself!) At every stage of my career, I was one of the lucky ones to have female and male bosses who actively sought out excellence based on merit. I want to celebrate the collective research power of women in the field of data science that I have chosen. I hope it’s the interconnectedness of this graph, rather than the small singular point it began from, that gives you an idea of how much of an impact women have had in our research area.3

I’ve included the ORCIDs, where I know them, to allow you all to peruse the outputs of their research should you so desire. For a few I don’t know their ORCIDs, but you’ll just have to take my word for how fantastic they are, with a little detail on just a few of them.

There’s Katherine C., who was my roommate back at Rice University during our undergraduate days. She majored in Computer Science while I was studying Biology; I didn’t understand that funny programming stuff at all back then. Little did I know that I was headed slowly but surely directly down that path! And Katherine’s interest and brilliance were definitely factors when I changed direction from Biology to Bioinformatics and ultimately to data curation, ontologies, and data science that I’m involved with today.There’s Melanie, who I always seek out at conferences as I look forward to her sharp mind and great company. There’s Maria-Jesus and Claire, who took a chance and gave me my very first job in this industry in my early 20s. Susanna is the most insightful, gregarious and focused woman I have ever met in my career, and constantly amazes me with her understanding of the research community and how to draw the best out of all of us. Dawn was just marvellous in many ways, and a person I felt a secret association with as we both worked in the UK and came from the US. Trish is smart, engaging and I see way too little of her.

At every stage of my career, and in every conference I went to or workshop I, uh, worked at, I found brilliant women who helped each other along. There are many ways to slice a population of researchers – only one of which is by gender – but I am proud of the women with whom I have worked over the years, and this is just one small way to say thank you. Where do I find Ada? In every single woman I’ve worked with.

Thanks, Ladies.

The STEM Ambassador Hub at DEBP challenged Ambassadors to join in today, and to write about who inspired us, though lots of other groups (such as WISE) are taking part. I’d like to think that some aspiring scientists might come across this, and realize that perhaps they wouldn’t be as alone as they might think, if they chose to take a career in STEM. There are gender issues in STEM that should not be forgotten about, but today is about raising up and celebrating.

  1. I have been researching visualization tools for OWL, RDF and similar formats and have yet to find something I am completely happy with. OLS does virtually everything that I need, but you need to install a local version of it if you want to visualize your own ontologies (I think they would be justified in not accepting conglomeration of female scientists as a community-driven ontology suitable for inclusion on their site!). WebVOWL is beautiful and allows upload of your own OWL files, but I find it difficult to do all the tweaks I would like. All I wanted to do was to provide a website with a list of ORCIDs and have it pop out a suitable bit of RDF or similar as to how all of these researchers were connected (via their publications, organizations, etc). Then I could tweak the resulting RDF and run it through WebVOWL. I even tweeted about it (without success)… But, as I couldn’t find a tool to do that, and I didn’t have the time to write it, I had to find a quicker alternative. To allow the quick conversion of a list of nodes and edges to a nice visualization, I found Flourish, which is what I used to make the graph in this post.
  2. As this day is focusing on women in STEM I have not added nodes for the many men I have worked with, but you know who you are, and you’re great. 🙂
  3. I’ve included ORCIDs where I can find them, and I all of the connections are from my memory (which as I have said is faulty), backed up by publicly-available information. In other words, I haven’t added any information that isn’t already out there on the interwebs. However, if you prefer not to be included in the graph, then please do let me know privately and I’ll remove you.

Previous Revisions:
Groups were added by: Me (blue), Dagmar (yellow), Melanie S. (green), Melanie C. (purple), Jane (turquoise), and Katherine (orange).


Update after Rachael Huntley (0000-0001-6718-3559) added her connections, : see

Latest update by Katherine James ( Please feel free to duplicate+edit, then let me know and I’ll include it here!

Update 16.10.20: Jane’s version plus extra edits by Melanie Courtot and me.

A version with additional small updates by Melanie Courtot and by me! Find it at to duplicate+edit.

Update 16.10.20: The stupendous Jane Lomax (ORCID: 0000-0001-8865-4321) has also extended the graph! Here it is – feel free to move things about, as it’s getting rather crowded now – perfect!

Jane Lomax’s graph from – you know you want to add to it! 🙂

Update 15.10.20: The fantastic Melanie Courtot (ORCID: 0000-0002-9551-6370) has also extended the graph! I’ve added hers here, with permission. Thanks!

Melanie Courtot’s graph from – thank you all!

Update 15.10.20: The amazing Melanie Stefan (ORCID: 0000-0002-6086-7357) has also extended the graph! I’ve added hers here, with permission. Thanks!

Melanie Stefan’s graph from

Update 14.10.20: The fabulous Dagmar Waltemath (ORCID: 0000-0002-5886-5563) has extended the graph! I’ve added hers here, with permission. Thanks!

Dagmar Waltemath’s graph from . Love how much she has added! (And apologies – I really should have added her in the first place!)
The highly non-scientific network of women in STEM that I have had the pleasure of working with over the years. Completely inadequate as I’m sure I’m missing people (feel free to edit my graph and republish via, but my point isn’t so much about the individuals as it is about how, every step of the way, there are women who help each other and lift each other up in science.
Meetings & Conferences Semantics and Ontologies

SWAT4(HC)LS 2019: Morning Presentations, Day 2

All talk and poster papers are available for download at The organizers would like to have everyone’s suggestions for future conferences, and also for topics and tasks for the hackathon tomorrow.

Data Ownership and protection within the FAIR Framework

Keynote by Dov Greenbaum

This is a talk about the legal issues that come with FAIR. The Zvi Meitar Institute focuses on examining the ethical, legal and social implications of new and emerging technologies. Examples include fake news and space law.

Big data: velocity, veracity, variety, volume. And a reminder that if something is free, you’re not the customer, you’re the product. The Hague declaration provides some nice numbers about academic data. By 2020, the expectation is that there will be 44 zettabytes, with 90% being generated in the last 2 years. 2.4 million scholarly articles publised in 2013, or one every 13 seconds. It’s estimated that big data is worth more than $300 billion per year to companies. UNESCO (specifically, the International Bioethics Commitee or IBC) is worried about how data is used. Health data comes from many sources: home monitoring, EHR, lab records, etc.

The IBC has a report from 2017 that asked: Do you own your own data? Is it yours to give? What about ownership and possession?

Now, thinking about this in the context of FAIR, particularly the accessible and reusable parts, which leads to a discussion of licensing. Open source licensing has one problem: in order to license data, you must own it legally. The way we own non-tangible items is mostly through intellectual property law. IP includes patents, copyright, trademark, trade secret, and sui generis (a new form of IP law). In 1916, a uk ruling stated that “the rough practical test that what is worth copying is prima facie worth protecting”. IP rights do not grant you the right to do anything, they just allow you to grant permission – a negative right rather than a positive right.

Is the big data patentable? Then you can license it… Otherwise we need to find another way (see IP methods above). To be patentable it must be patentable subject matter, utility (some kind of use, and can’t be “evil”, or a tax cheat – utility in the US is a very weak definition – allows virtually anything to have utility), novel, non-obvious. But Big Data doesn’t fit into the patent definition of “a useful process, machine, manufacture, or composition of matter”. This means also that wild plants, DNA, algorithm, laws of nature are unpatentable. In short, Big Data is not patentable subject matter.

Next form of IP law is copyright. This covers original works of authorship fixed in any tangible medium of expression – this includes things like databases. Copyright is automatic and free (unless you want to register it), unlike patents. If you are using AI to create some of your data, you probably don’t have authorship. What is copyrightable? literary works, musical works, dramatic works, choreographed works, pictorial graphic sculptural sound works, architectural works. Ideas, laws, facts, discoveries are all not copyrightable. Data is facts, and therefore not copyrightable. Copyright protects your expression of an idea, not the idea itself. You cannot copyright the building blocks of knowledge – to promote innovation. There was a supreme court case in 1991 that centered around the yellow pages – supreme court decided that there would be no more protection of databases of facts. However, they said you could protect databases if you had a contract and the other side agreed to the contract, you could copyright your data. The contract for some software would be as soon as you open a CD of software, or a EULA online, the contract is implicit. The AWS contract mentions a zombie apocalypse 🙂 (clause 47 I think I heard?!) If you have a factual database, how can you prove that someone has copied your data? The easiest way is to salt – dictionaries have fake words, google maps has fake cities, phone books have fake people etc. In short copyright is no good.

Next form of IP law is trade secrets. This is fine, but you want to be FAIR and trade secrets are the opposite of FAIR. So, if you don’t own big data under IP, how do you license it?

WTO says that you should be able to protect computer programs and compilations of code, e.g. the database itself if not the data. Most countries have not incorporated this yet. US has the digital millennium copyright act, but it’s a weird kind of IP right – you can claim ownership of your databases only if you protect it with a digital rights management tool. Then when someone has hacked into you, they have infringed on your rights (kind of backwards really). This is most prominent in DVDs and violating the region protection. The EU database directive contains database protection (1996). They were thinking mostly of betting databases! This is sui generis – a new type of IP protection for databases. However, it really only protects the structure of the database and not the content, and you have to have made a substantial investment and effort. In 2018 an analysis was performed, and academics were very unhappy with the directive as they weren’t sure what classified as violations of the law, and 2/3 thought it had a negative effect – the sui generis right was considered to have a negative effect on legal certainty.

How do you protect data right now? You contract it or you use antitrust law. Ownership grants you benefits and control over that information, but possession does not mean ownership. Possessing data doesn’t mean you own the data. The Hague Declaration is something you can “sign” if you agree, and tries to find solutions to this.

So, what we need is a complete change. We need to think about multiple owners: the patient, the doctor, the informatician. Perhaps we can’t find an ownership right in data – so perhaps we should look at it as having custody rather than ownership of the data. Custody brings liabilities (e.g. GDPR) and responsibilities. Custody also means good stewardship, privacy, security, attribution, FAIR principles. Individuals could transfer some of their rights via informed consent. But this isn’t a full solution, and this is still an ongoing problem.

Q: A lot of life science databases use CC BY, CC BY-SA – is this correct?
A: This is not legal advice. In Europe, via the database directive, databases are protected so if you take the entire databases you are infringing. If you extract data without maintaining the structure of the database, you are not infringing. This is one reason why academics are unsatisfied with the EU database directive.

Q: It’s a creative act to map data to e.g. ontologies. Is this enough to be able to have it be an act of creation and makes it protectable?
A: Modified data that is still factual (e.g. cleaning data, adding ontology terms) does not change the fact that it’s factual – and therefore cannot be protected. Might still fall under the database directive, as described above.

Q: Patients possess (not own) their data. In MS research, patient groups take custody of that data. Is this style the future direction?
A: It makes sense to get informed consent from a group – the group could use a trade secret type that requires an NDA to use, for example.

Documentation Gap in Ontology Creation: Insights into the Reality of Knowledge Formalization in a Life Science Company

Marius Michaelis (speaker) and Olga Streibel, Bayer

For them, ontology creation begins with specification / competency questions, then conceptualization of the ontology via use cases, implementation into a formal knowledge mode (ontology), then ontology testing. He will concentrate on the conceptualization phase.

Within the conceptualization phase, you have knowledge acquisition by knowledge engineers who, on the one hand, research explicit knowledge, and on the other hand also elicit tacit knowledge. This process of eliciting knowledge takes a lot of effort and is time intensive. How can you document this stage? Such documentation would encourage collaboration, prevent knowledge loss, and encourage clarity.

Within Bayer, they ran a survey of 13 knowledge engineers and 16 domain experts. He will discuss both the timepoint and nature of how knowledge engineers document. Most start documenting while or after the model is being created, which means during implementation rather than during conceptualization. The respondents also had a roughly equal mixture of structured and unstructured methods (don’t necessarily follow guidelines). But what we want is joint, timely, structured documentation.

His bachelor’s thesis (University of applied sciences Potsdam), “Documentation concept for the exchange of knowledge in the process of creating ontological knowledge models.”

Partitioning of BioPortal Ontologies: An Empirical Study

Alsayed Algergawy and Birgitta König-Ries

BioPortal, AgroPorta, OntoBee, and EcoPortal (Allyson: EcoPortal apparently last updated 2 years ago?) all exist to store publicly-available ontologies.

Most existing studies of bioportal ontologies focus on ontology reuse and ontology evaluation (the quality of the ontology). A few also consider the partitionability/modularization of ontologies (e.g. ISWC 2011). They also looked at the partitioning of BioPortal records.

Overall workflow: 1. get ontologies 2. transform into owl/obo 3. partition 4. analyse. Some existing partitioning tools include OAPT and PATO(?, not *that* PATO I suppose). They developed OAPT, which has these steps for partitioning: 1. ranking ontology concepts 2. determine cluster heads 3. partition 4 generate modules. In addition, in Manchester they developed a tool called AD (Atomic decomposition). So, they applied AD, OAPT and PATO to BioPortal’s 792 ontologies. For details, see

They discussed how many modules were created within each ontologies – you can specify in OAPT the optimal number of partitions. There were three 0-module ontologies (SEQ, SMASH, ENM). Both 0-module and 1-module ontologies don’t seem to be fully developed. 100 ontologies generated 2 modules. Over half of the accessible ontologies (347) can be partitioned into only 5 modules. Most of the ontologies which failed to partition seemed to be because of some particular characteristics of the ontologies rather than the tools themselves.

Annotation of existing databases using Semantic Web technologies: making data more FAIR

Johan van Soest (speaker), Ananya Choudhury, Nikhil Gaikwad, Matthijs Sloep, Michel Dumontier and Andre Dekker

They have hospital data that remains in the hospital together with the analysis, and then they submit results to a central location. This is good for patient privacy but not for researchers wishing to access the data. Therefore it relies on good semantic tagging.

Not all hospitals are academic hospitals, and therefore might not have the resources to add the data system to it. So they provided the hospitals with a tool that separates structure from the terminology conversion – this allows the IT person to do the conversion and the domain expert to do the terminology mapping (R2RML). This works but is high maintenance, so they’ve changed the approach. Instead, keep the original data structure and annotate the existing schema.

Their use case was a set of 4000 rectal cancer patient data and used Ontotext GraphDB 8.4.1. They had an ontology with 9 equivalent classes and 13 value mappings (e.g. “m” means male”). Two parent classes for each – the ontology class and the SQL table column.

They are only annotating existing data sources – although they are using non-standard (local) schemas, it does mean there would be no data loss upon any conversion and also they don’t have to make big changes to local systems. Keeping local systems also means that there is a smaller learning curve for the local IT team. They would like to develop a UI that would hide the “ontology-ness” of their work from their users.

FAIRness of openEHR Archetypes and Templates

Caroline Bönisch, Anneka Sargeant, Antje Wulff, Marcel Parciak, Christian R Bauer and Ulrich Sax

HiGHmed was begun with the aim to improve medical research and patient care, and to make data from research and patient care accessible and exchangeable. The project has a number of locations across Germany. Part of this involves the development of an open interoperable and research-compatible eHealth platform to support local and cross-institution patient care and research.

openEHR is a virtual community for transmitting physical health data in electronic form. Archetypes are aggregated into Templates, which then are published and versioned via the CKM (Clinical Knowledge Manager). They have assessed their archetypes and principles in the context of the FAIR principles, and found that they were compliant with 13/15 of the principles.

A Working Model for Data Integration of Occupation, Function and Health

Anil Adisesh (speaker), Mohammad Sadnan Al Manir, Hongchang Bao and Christopher Baker

The Occupational Heath Paradigm is a triangle of work-illness-wellness. In Canada they have a NOC Career Handbook with 16 attributes that help define the requirements of various careers. This can be helpful when someone with an injury wishes to change to a different job, but one that is similar enough to their previous job that there isn’t a lot of retraining. A semantic model is populated with coded patient data representing disease (ICD-11), functional impairment (ICF), occupation (NOC), and job attributes (NOC Career Handbook). The NOC coding for the data was done manually initially, and then they developed a coding algorithm to assign the occupations automatically. The algorithm starts with a textual job title and then progresses through a number of steps to get a NOC. They use sparql queries and semantic mapping to suggest job transitions to accommodate a functional impairment.

They did some test queries with their model to see if they could classify people in jobs according to their attributes, e.g. if a person has disease J62_8 them what jobs could they do? What job with a patient with visual impairment likely return to (previous job + impairment = new job options)?

This work could be applicable in finding work for immigrants and newcomers, finding comparable work for people with an acquired disability, or people with accidental injuries that could otherwise end up on long-term disability. The model seems fit for purpose to integrate info about occupational functioning and health.

FAIR quantitative imaging in oncology: how Semantic Web and Ontologies will support reproducible science

Alberto Traverso (speaker), Zhenwei Shi, Leonard Wee and Andre Dekker

Medical images are more than pictures, they are big data. Indeed they are the unexplored big data as many images are stored and not re-used, as well as having more information than is used in the first place. There were 153 exabytes in 2013, and 2,314 exabyte expected in 2020. Within medical imaging, the combination of the image and “AI” results in what is called quantitative imaging. Machine learning is used to create prediction models (e.g. the probability of developing a tumor, or a second tumor).

There is currently no widespread application of “radiomics” technology in the clinic. The data we produce grows much faster than what they can currently do with the models. Radiomics studies lack reproducibility of results. The challenges and some solutions are: models work but only on their own data (fix with multi-centre validation); a lack of data-driven evidence and external validation (fix with TRIPOD-IV models); lack of reproducibility (fix with sharing of metadata in a privacy-preserving way); explosion of computational packages; how can models be shared when data isn’t; poor quality of reporting; lack of standards; a need for domain knowledge (these last three can be fixed by standardized reporting guidelines, increased agreement, and data-driven meta analyses). FAIR quantitative imaging = AI + medical imaging + open science.

“Ontologies function like a brain: the work and reason with concepts and relationships in ways that are close to the way humans perceive interlinked concepts.”

Radiomics Ontology (RO) – 

Appropriate questions for this system: What is the probability of having esophagitis after irradiation in certain patients that received a dose of…? What is the value of Feature X for rectal cancer patients with a poor prognosis computed on an ADC scan?

A FAIR Vocabulary Odyssey

Keynote by Helen Parkinson

The Odyssey has a non-linear plot, and Helen is using the monsters and challenges to hang her topics off of.

Helen asked the question in August: if there are no FAIR-compliant vocabularies, how can you be FAIR? If there aren’t any, then the FAIR indicator cannot be satisfied. Therefore you have a recursive, circular issue 🙂

Why do we need to be FAIR? What tools do we need to be FAIR? How do we know we are FAIR? I2 of FAIR is to use vocabularies that adhere to FAIR principles. EBI is at 273 PB of storage, with 64 million daily web requests in 2018. As with metadata in many projects, the metadata is often troublesome. They would like to build FAIR capable resources (but we’re not quite sure what FAIR capable is yet); acquire, integrate, analyse and present FAIR archival data and knowledgebases (and all the resources are very different – context is important); determine the FAIR context for our resources and data; define what it means to be a FAIR vocabulary.

Within the SPOT group, they develop a number of ontology applications, e.g. that make use of the data use ontology. For Helen, there are temptations associated with formalisms – the more complex the ontology, the more time/money it will cost but you will get a strong semantics. provides a list of current FAIR tools.

Which features of a FAIR vocabulary are already defined in the OBO Foundry? Many are already aligned, but some parts of the Foundry are deliberately not aligned, including: open, defined scope, commitment to collaboration, reused relations from RO, and naming conventions. These are parts of the Foundry that are not, and probably should not, be a required feature of a FAIR vocabulary. So then they mapped the aligned part of the foundry to the FAIR principles.

When talking about “deafness”, we need to consider the assay, the disease and the phenotype – and they all need to be connected – making all this interoperate is important. To help, they developed OXO which provides cross-references between ontologies, but it doesn’t tell you anything about the semantics.

FAIR Vocabulary Features- required (from Helen’s POV)

  • Vocabulary terms are assigned globally unique and persistent identifiers plus provenance and versioning information
  • Vocabularies and their terms are registered or indexed in a searchable resource
  • Vocabularies and their terms are retrievable using a standardised communications protocol
  • the protocol is open, free, and universally implementable
  • Vocabularies and their terms persistent over time and appropriately versioned
  • Vocabularies and their terms use a formal accessible and broadly applicable language for knowledge representation
  • Vocabularies and their terms use qualified references to other vocabs
  • released with a clear and accessible data usage licence
  • include terms from other vocabs – when this happens, import standards should be used.
  • meet domain relevant community standards.

Why should vocabs be FAIR? Ontologies are data too, and data should be FAIR. How do we know we are FAIR? When the vocab has use, reuse, implementation, and convergence.

Where is FAIR in the Gartner Research’s Hype Cycle? 🙂

Q: should FAIRness be transitive? Should FAIR vocabs only import FAIR vocabs?
A: I would like it to be, but it probably can’t always be.

The OHDSI community has to deal with this issue of transitive FAIRness already. Sometimes they import only certain versions of the ontology. They don’t think it’s possible/practical to move the entire healthcare industry to FAIR standards.

Q: What would be a killer application for speeding up and making data FAIR?
A: Proper public/private collaborative partnerships. Getting large organizations on board for, at a minimum, a few exemplar organizations. One field might be antimicrobials and another would be neurodegenerative diseases as they are biological difficult and traditional routes of pharma haven’t worked as well as hoped.

Afternoon Panel

This panel is about the conference itself and our thoughts on it. During the panel session, topics included:

  • How can we improve gender diversity in the organizing group? From this, corollaries came up such as making the conference more relevant to other parts of the world, e.g Asia. Equally, this workshop has been organized on a shoestring budget and via people volunteering their time. The question is – what do we want from the conference in the future?
  • Do you see food safety as a potential topic for SWAT4HCLS? Yes, but we need to consider what our scope is, and how adding something like “food safety” would impact upon the number of papers submitted. e.g. there has been a big agrisemantics group last year and this, yet the conference name wasn’t changed.
  • Tutorial logistics should be improved for next year. The community haven’t submitted many tutorials this year. Should we keep them?
  • How do we leverage our knowledge to help the community at large? Should we reach out and bring people in via training?
  • The conference has been going on for over a decade might some kind of retrospective be appropriate? Perhaps combine it with a change in direction, said one panel member. Perhaps someone present something lighthearted for next year and present the changes?
  • Should we have a student session? Well, we already have some of them, even presenting already at the conference. We should work more to get students from the local university to participate, as they haven’t really taken up the offer in the past.
  • Should we remove the HC, because we continue to expand into other areas and can’t keep adding letters to the name! If so, what should the name be? Probably shouldn’t change it.
  • Where is it going to be next year? They can’t tell you yet.
  • Should we invite a big hitter from a neighbouring area to pull in more communities? Should we include expand to end users?

Please also see the full programmePlease note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

Meetings & Conferences Semantics and Ontologies

SWAT4(HC)LS 2019: Afternoon Presentations and Panel, Day 1

Semantics for FAIR research data management

Keynote by Birgitta König-Ries, introduced by Scott Marshall

Resources covered in this talk that are in FAIRsharing: GBIF, Dryad, PANGAEA, Zenodo, Figshare, PROV-O.

FAIR doesn’t cover everything, for instance data quality. The Biodiversity Exploratories project has been running for 13 years and, as there is turnover (as with any project) you cannot guarantee that the same people will be present who will know about / where your data is. There are 3 exploratories within grassland and forest, and wish to discover what drives biodiversity. To do this, they need to be able to share data and integrate data in different regions and by different groups

They state that the FAIR requirements can be a little vague, but as far as they can tell they are findable, but the interoperability and reusability is low – they need some semantics work / standards. They have submitted to PLoS One “Do metadata in data repositories reflect scholarly information needs?” (accepted with minor revisions). They made a survey for biodiversity researchers – they are mainly looking for information on environment, organism, quality, material and substances, process and location. They were not interested in person or organization providing the data, or the time in which the data was gathered.

Some data sources include GBIF, Dryad, PANGAEA, Zenodo and Figshare. What semantic building blocks do they need? Tools for creating domain models, a data management system that supports semantics, help in providing semantic data (annotations, provenance management), making use of semantics for finding data, and then ultimately put it all together.

They used a process of ontology modularization, selection and customization. You need to align different ontologies in this step, and in aid of this she would like to point people to the Biodiversity Track in the OAEI (Ontology Alignment Evaluation Initiative). She described Subjective Logic, providing values for belief, disbelief, uncertainty, atomicity. They applied this to ontology merging to create scores for how trustworthy an information source is (discovering which ontology is causing merge inconsistencies).

BEXIS2 is a data management life cycle platform. It highlights how “varied” the variables are in data sets, even though they are semantically equivalent, e.g. different names for “date”. They provide templates for various data types. Within BEXIS2 is a feature for semi-automatic semantic annotation. Fully automatic approaches might happen as a result of the project outlined in “Semantic Web Challenge on Tabular Data to KG Matching” by Srinivas et al. The’ve also developed an ontology extending PROV-O and P-Plan. They have a website for visualizing provenance data, and the biodiversity knowledge graph that was presented earlier.

Some of their recent work has involved improving the presentation and navigation of results using facets to explore Knowledge Graphs. KGs are tricky as they are large and you are often interested in indirect relationships. You can’t precompute facets as that becomes impractical. When applying facets to KGs, you need to be able to manipulate a subgraph that is the search result. The result set is a list of IRIs, which you then create candidate results from that list, removing those that only appear in a small subset of your result. Then you find those candidate results that work well as facets (the goldilocks zone of facet size). You can also filter on facets that are not similar to each other (semantically distant), to make the results more interesting. This methodology “works” but is too slow, and you need a better way of traversing the graph a bit to get more distant information.

Germany has funded the BFDI (German National Research Data Infrastructure) – 30 consortia covering all research areas. They’ve applied to be part of this project for biodiversity data. The ultimate aim of BFDI for biodiversity is to build a unified research data commons infrastructure that provides customized semantic data to different user groups (technology services, integration, interoperability).

Making clinical trials available at the point of care – connecting Clinical trials to Electronic Health Records using SNOMED CT and HL7 InfoButton standards

Jay Kola (speaker), Wai Keong Wong and Bhavana Buddala

In FAIRsharing and in this talk:, SNOMED-CT, ICD 10, MeSH, CDISC ODM.

This is about “using standards in anger”, e.g. making it work! This started because of a simple clinical problem – “I know our hospital is a large center, but I don’t know what trials we are running, or what trial my patient might be eligible for”. There are two external registries – and ClinicalTrialsGateway (UK). However, they’re not up to date, and not all local studies/trials are registered on external registries. What did they do? Made it possible for a clinician to bring up all trials the patient is suitable for based on diagnosis, age and gender. It took only 20 minutes to configure/add this functionality to the EHR system.

Their solution is the Keytrials Platform, which is then integrated into the hospital EHR system. The integration with the local trial registry was done with a batch import with a text file. How do you make the trials “coded” with tags / ontology? ICD 10, SNOMED CT, MESH were considered, SNOMED CT was used. ICD10 wasn’t as rich as SNOMED CT (which is also hierarchical). They also used NLP and SNOMED CT to annotate trials.

The HL7 InfoButton allows systems to request information from knowledge resources using a pre-defined set of parameters. With their system, a query is passed from EHR to the InfoButton which then goes to the KeyTrials Backend and then sends it back to the clinician / EHR system.

Importing data is painful if we have to create a connection for every possible system. CDISC-ODM doesn’t reflect all of the clinical trial data (e.g. eligibilities), FHIR researchstudy is still under development. KeyTrials is open source, you can use it as you like. Also, post hoc annotations of clinical trials via NLP is avoidable if clinical trials data is coded at the time of creation. They used BioYODIE (GATE NLP engine). Another issue was that ICD 10 is more prevalent than SNOMED CT. Even at their hospital, ICD10 is used natively… ICD10 is less granular than SNOMED CT, which can cause issues in mapping existing terms.

KeyTrials and related work is intended to make clinical trials more visible and to increase recruitment in clinical trials. The goal is to make clinicians less likely to miss trials.

A Blueprint for Semantically Lifting Field Trial Data: Enabling Exploration using Knowledge Graphs

Daniel Burkow, Jens Hollunder, Julian Heinrich, Fuad Abdallah, Miguel Rojas-Macias, Cord Wiljes, Philipp Cimiano and Philipp Senger

FAIRsharing records for described resources: JERM, EXACT.

Field trials require a certain process with 4 stages: trial planning, field operations, assessments and analytics/reporting. Based on the type of data, each dataset might end up in one of multiple repositories. Therefore they would like to provide a workflow that overcomes these interoperability issues. To start, you harmonize the data and put it into a KG. The KG is used to find new models (digital experimentation and then analysis). First, they aggregate the data silos and using reference ontologies. The KG is then enriched to provide extra value to the end users. Field data has lots of different data types. Their data model is field-trial centric. It isn’t intended to store data, but rather just model it.

They map data resources onto their semantic business data model. Then, they map the data model onto ontologies from various domains (chemistry, experimental description, geospatial, provenance, biology etc). Each domain has their own transformation rules and ontologies (EPPO code, NCBI Taxon, PROV-O, OSPP – soil, EPPO Climatic zone, JERM – just enough results model ontology, Experimental Actions, OECD, GeoNames, Soil Triangle from USDA). They also have additional sources for cleaning, translation, harmonization and enrichment.

They have 500,000 field trials from the last 30 years with over 600 million nodes and 5.3 billion relations. They make use of Virtuoso, Neo4J, ElasticSearch, Oracle, PostgreSQL, GraphQL, SpotFire, Tableau, R/Shiny, Python/Dash). They built a custom application for data exploration to display the graph data.

Panel discussion (Spotlight on Agrisemantics II)

Chaired by Chris Baker

Introduction to GODAN – Suchith Anand

How do we bridge the digital divide? How do we make opportunities in open science available to people in the developing world? How can we use open data to help smallholder farmers? GODAN focuses on topics such as malnutrition, hunger and poverty through open innovation. GODAN supports global efforts to make agriculture and nutrition data available, accessible and usable for unrestricted use worldwide. There are over 100 GODAN partners.

Open data aids impact – his example is the Landsat data, which was made open in 2009. There are 232 Sustainable Development Goals (SDGs) Indicators. How can we link these data sets and get meaning and context from all of this information? For researchers in Africa, open data isn’t enough – you need to have open tools. Open science is a key enabler for capacity development and empowerment for all, especially in the developing world. In this context, Open Science includes open data, open standards, open access to research publications, open education resources, and open software.

It’s a very interesting time to be working in open science. (examples of linked open data was in 2012, and then FAIR in 2015 onwards.) Lots of things happening in this area, e.g. the European Open Science Cloud. A question for him is how can more opportunities be given to students in digital agriculture?

Panel Discussion

Panel speakers: Fernanda C. Dórea, Monika Solanki, Marie Angélique Laporte, Jens Hollunder

MS states that we should reuse existing work more than we do – and less time building tools & ontologies from scratch. FCD says there is a struggle for funding for projects when everyone benefits but nobody profits. Should look for initiatives that bring people together.

Q: Are you taking into account indigenous practice? How about local and neglected languages?

MS: She’s not aware of resources as such, but she can see the problem. Building these things requires a lot of investment, and the countries that would need such efforts don’t have a lot of funding. This is a good example of the need for those countries to fund such projects.
FCD: Seeing initiatives coming from developed to developing countries sometimes seems a little like they should instead come from the developed country.
CH: There is pressure in Brazil surrounding monocultures – there are little or no resources for traditional methods. The governments of developing countries have other things to spend their money on.
MAL: It’s important to keep in mind that the people speaking the local dialects/variants are our end users in many cases.

Follow-up Q: This leads onto another question.. Although in the UK and US have lots of very large farms, but elsewhere in the world it is mostly smallholders. Are we ignoring smallholders (more remote, less internet / connectivity)? I think there are implicit barriers to smallholders, as they won’t have technological access and may not have the educational tools.

JH: Precision farming would not neglect smallholders. Outcome-based models would help the small farms. We need to be constantly creating new data models to correctly match the current state of resources etc.
MS: Agrisemantics is about having a shared conceptualization across the community – anyone can consume such data. Bigger organizations have more technological resources, but within a limited capacity, even smallholders could benefit e.g. having a shared vocabulary to describe things.
MAL: When we developed the Agronomy Ontology, we are looking to see what the impact of management practices is on yield and other characteristics, and this includes smallholders as one of the use cases.
SA: Recently GEO asked members of the indigenous communities to participate in workshops etc, and this worked really well.

Q: Most of the semantic technologies that we work with deal poorly with time, and yet this seems critical, given the changes that are and will happen due to climate change. Is this a problem?

MS: I interpret this as a representation of time in the existing suite of ontologies we have – and this has been a big issue. No suitable scalable solution has come up. Version control of linked datasets is one example of such a solution. Temporal reasoning is a very abstract concept at the moment.
JH: I agree this is a problem. Currently the semantics give us a frozen snapshot.

Q: What current Ag Semantic standards are in use, which are obsolete, and what new standardization initiatives are necessary to help with this?

MAL: Crop Ontology suite of ontologies have been successful so far.
FCD: This isn’t an issue that is just seen in the ag semantics field. In her field, there’s a lot of work on structural interoperability but not a lot of work on semantic interoperability.
MS: AgroVoc is the de facto standard – very comprehensive and multilingual. It’s also lightweight. There are 403 in AgroPortal, perhaps some of them are obsolete.
JH: These 403 ontologies are each living in their little domain, when there are many more: chemistry, supply chain, management, biology etc. and they all need to come together and have each of these domains talking to each other.
MAL: Comparison to the OBO Foundry – promotion of requirements.

Follow-up Q: Do you think standards have been divided by public and private interests, or national interests?

JH: He feels what can be shared, will be shared.
MS: A lot of their ontologies are currently closed source, so it’s hard to use them – so there is a corporate divide.
MAL: totally driven by the funding and what’s available in each country.
FCD: Her experience with the cycle of funding is that if she spends the funding of a 3-year project on ontology development without focusing also on the community surrounding it, the ontology will die at the end of the 3-year project.

Q: Who should pay for the new (to be shared) agrisemantic models?

MS: We can’t expect private industry to pay for the new data model. We need a combination of experts from various areas to create a platform where such things can be discussed. Comparison with
FCD: On the one hand, people might think what does making my data open do for me? Equally, in the EU there is more of a push towards open data providing impetus.
JH: we should have tangible use cases and start from there – money is given more easily with such use cases.
CH: In the machine learning world, many of the major frameworks are developed by the private sector and then donated to the public. Why is this not happening here?

Q: To what extent is the digitization of agriculture (together with agri semantics) a case of technology push that does not correspond to the real needs of farmers or the environment?

MAL: We need to keep in mind that our users are the farmers, and if we do so then we will make tools that will benefit them.
FCD: Make sure that what you’re doing actually matches the use case.
MS: It’s a problem with the people who are making the technology if this happens, as they are not listening to the farmers. Or alternatively, farmers might hold back and not provide all of the information they should. The Farm Business Survey is used in the UK (run by the government) to get data. Perhaps this also happens in other countries.

Q: What role can semantic technologies play in agricultural biosecurity?

FCD: You can only go so far in aggregation of data without using semantics – so at some stage any field including biosecurity would find semantics useful. However, often converting it into a format useful for semantic technologies might have a time penalty making it harder to be used in situations with a short time frame.

Q: There is a lot of hype and focus on healthy and sustainable food. Consumers want to be well informed about these things. Is there an opportunity for semantic data with respect to these questions?

MS: She has a couple of papers in this area (traceability in the food system).

Q: What is the role of semantics in limiting / accelerating the advance of Ag IOT?

JH: There are a lot of vendors / devices / “standards” – so that is one limitation. It also depends on the sensor being used.
MS: This area does need a lot more work.

Q: What is lacking in the current state of the art in agrisemantics?

MS: The livestock domain still requires some work.
FCD: I agree with this.
JH: What is also missing is the easy accessibility for end users to communicate with this technology. This is a hurdle for increased uptake.

Please also see the full programme. Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

Meetings & Conferences Semantics and Ontologies

SWAT4(HC)LS 2019: Morning Presentations, Day 1

These are my notes of the first day of talks at SWAT4HCLS.
Apologies for missing the first talk. I arrived just as the keynote by Denny Vrandečić‎ titled “Wikidata and beyond – Knowledge for everyone by everyone” was finishing (the first train from the south only arrived in Edinburgh at 9am). (FAIRsharing Wikidata record.)

Enhancing the maintainability of the Bio2RDF project using declarative mappings

Ana Iglesias-Molina (speaker), David Chaves-Fraga, Freddy Priyatna and Oscar Corcho

FAIRsharing record for Bio2RDF.

Current Bio2RDF scripts are PHP scripts on an ad hoc basis. They are proposing the use of OBDA technologies (via declarative mapping). The workflow for OBDA includes 1) mapping file for relationships between data source and ontologies (e.g. in RML, R2RML), 2) use a mapping engine to transform the data sources into knowledge. Bio2RDF data source formats are mainly CSV/XSLX, then XML. They are developing a mapping engine for CSV/XSLX using a tool called Mapeathor. Mapeathor is used to generate the knowledge graph by mapping columns from the spreadsheets into appropriate triples.

They wish to increase the maintainability of the data transformation using the OBDA options, which 1) enables the use of different engines, and 2) creates a knowledge graph ready for data transformation and query translation. They would like to improve the process to deal with the heterogeneity of the data.

Suggesting Reasonable Phenotypes to Clinicians

Laura Slaughter (speaker) and Dag Hovland

HPO in FAIRsharing.

They support communication between pediatrician and geneticist, to provide a more complete picture of the patient, but not to replace expertise of the physicians. HPO is used to specify abnormalities.

Their workflow. Pediatrician in intensive care suspects a newborn of a genetic disorder. Firstly, they need to get consent via a patient portal (DIBS – electronic health record). A requisition form has a subset of HPO codes that the pediatrician can select, and then the form and samples are sent off to the lab. Reporting the HPO codes to the lab helps the lab with their reporting and identification.

Phenotips is one HPO entry system (form-based ontology browse and search). Also extant is a natural language mapping from text. A third is Phenotero, which is a bit of a mixture of the two. When they started, the clinicians wanted to use Phenotips. Another related system is the Phenomyzer, which is a different perspective as it helps with the process of differential diagnosis. The authors thought they would just provide a service where they would suggest additional HPO codes to clinicians. But when they started to work on it, they had to make a new user interface after consultation with the clinicians. Additionally, they discovered that they would also need to provide a differential diagnosis feature.

There were a number of issues with the system that existed before they started. There was an overwhelming number of HPO codes for clinicians to sort through. There was no consistency checking or use of the HPO hierarchy. The NLP detection had a low accuracy and had to be supplemented with manual work. There was also no guidance for prompting for a complete picture of the patient or further work-up (available in Phenomizer).

They suggested that they provide a simple look-up mechanism using disease-HPO associations. Suggestions for clinicians come in the form of HPO codes that point to where further work-up might be needed. They also needed to implement ordering of HPO code candidates, and they did this by using disease information to inform priority settings, e.g. measuring the specificity of the disease given the phenotype entered by the clinician.

They order diseases in increasing order, by the ration of unselected phenotypes. There is a balance to find between giving the clinician a bias too early, or alternatively only being able to provide feedback in very specific circumstances. They implement their work using a reasoner called Sequoia, input forms and annotation files.

They are working with a geneticist and clinicians to find the best method for generating suggestions and evaluate the UI. They’re also exploring the ORDO Ontological Module (HOOM), which qualifies the annotations between a clinical entity from ORDO and phenotypic abnormalities from HPO according to frequency and by integrating the notion of diagnostic criteria.

A FHIR-to-RDF converter

Gerhard Kober and Adrian Paschke

FHIR in FAIRsharing. RDF in FAIRsharing.

FHIR is an HL7 standard with more than 100 resources defined. A FHIR-Store is a storage container for different resources, and they would like to ask SPARQL queries over the result set. Because in FHIR resources are meant to facilitate interoperability (but not semantic interoperability), the storage in RDF is not possible. They are implementing a system architecture that would have a FHIR-to-RDF converter sitting in between the client and the HL7 FHIR stores. This would allow the client to interact with RDF.

They have used the HAPI-FHIR and Apache-Jena libraries. The data is transformed from FHIR-JSON to Apache-Jena-RDF-Model. Searches of FHIR resources are returned as JSON objects. Performance is critical, and there are two time consuming steps: HTTP call to the FHIR store and the conversion from FHIR to RDF, and as such the performance might be a bottleneck. To alleviate this, queries to the FHIR store should be specialized. They also need to check if the transformation to Apache-Jena is too expensive.

A framework for representing clinical research in FHIR

Hugo Leroux, Christine Denney, Smita Hastak and Hugh Glove (speaker)

FHIR in FAIRsharing.

This covers work they’ve done as part of HL7 together over the past 6-8 months. They’ve had a FHIR Meeting “Blacing the Path Forward for Reserach” where they agreed to establish a new Accelerator Project to get a set of use cases. FHIR has been widely adopted in clinical care, mainly because of its accessibility and how it is all presented through a website. If you look at a FHIR resource, you get a display containing an identifier and some attributes. For instance, for Research Subject you would get information on state, study link, patient link, consent… Research Study includes identifier, protocol, phase, condition, site / location, inclusion/exclusion.

FHIR tooling enforces quality standards at the time of publishing the data, has publicly-available servers for testing, and others. It also provides RESTful services for the master data objects that are stateless and non-object oriented.

Much of the work involved is keeping track of the investigators and other trial management. They are looking at using FHIR resources to help with the trial management as well as the more traditional data capture and storage.

People can build tools around FHIR – one example is ConMan, which allows you to graphically pull resource objects in and link them together. With respect to linking resources together, with the resulting graph of objects looking a lot like a vocabulary/ontology relating ResearchStudies to Patients via ResearchSubject and other relationships.

The object model is quite complex. BRIDG is a domain model for FHIR in clinical research. The objective of what they’re doing is to stimulate a discussion on how clinical research semantics and data exchange use cases can be represented in FHIR.

Reconciling author names in taxonomic and publication databases

Roderic Page

LSIDs in FAIRsharing.

LSIDs were used early on in the semantic web – I remember those! However, LSIDs didn’t really work out – data didn’t just “join together” magically, unfortunately. He’s working towards a Biodiversity Knowledge Graph, as there is a lack of identifiers and a lack of links. Taxonomists often feel underappreciated, and under attack from people who are generating loads of new species and aggregated biodiversity data. Taxonomists are much less likely to have ORCIDs than the general research population, so in order to identify people you need to match people using CrossRef and ORCID either using schema:Role, or matching people in IPNI (a taxonomic database that still uses LSIDs?) and ORCID.

Not all ORCID profiles are equal – he shows us an example of one called “Ian”…, though he did figure out who it (probably) is. In conclusion, the semantic web for taxonomic data failed because of the lack of links, and making retrospective links is hard. Additionally, there is the “Gary Larson” problem of people hearing “blah blah RDF blah” 🙂

On Bringing Bioimaging Data into the Open(-World)

Josh Moore (speaking), Norio Kobayashi, Susanne Kunis, Shuichi Onami and Jason R. Swedlow

IDR in FAIRsharing. OME-TIFF in FAIRsharing. OMERO in FAIRsharing.

In imaging, the diversity is visual and you can see how different things are. They are developing a 5d image representation / model: 3d movies in multiple colors. From there it gets more complicated with multilayer plates and tissues. They develop the Image Data Resource. They are interested in well-annotated image data, e.g. from the EBI as well as controlled vocaularies. They are getting lots of CSV data coming in which is horrible to process.

They translate over 150 file formats via BioFormats by reverse engineering the file formats – big job! They tried to get everyone using OME-TIFF but it wasn’t completely successful. However, it was a good model of how such things should be done: it’s a community-supported format, for example.

This community is still a bit “closed world”. In 2016 they started development of the IDR, and needed to formalize the key/value pairs. However, the community continues to want to extend it more. As a result, they want to leave the key/value pairs and move back to something more semantic. Use cases include extension of the entire model or conversion of the entire model – Norio Kobayashi converted the entire model into OWL (OME Core ontology <= OME Data Model (which itself is OME-TIFF + OME-XML)). Extension is the 4D Nucleome ontology.

He likes the Semantic Web solutions as it reduces the cost of more adhoc XML extensions. Perhaps could use JSON-LD as it may end up being the “exciting” front end? Bio-(light) imaging is relatively new to this and lagging in investment for these sorts of things.

Please also see the full programme. Please note that this post is merely my notes on the presentations. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any issues you may spot – just let me know!

In The News Outreach

What’s your favorite taxonomic controversy?

In January I’ll be running an event with another STEM ambassador for Years 5 and 6 at a local primary school. One year will be getting the fantastic Mystery Boxes, which I love doing with any age group, and the other year is currently studying Taxonomy and Classification. I love the idea of talking about the big debates that scientists have, and how we scientists aren’t a bunch of homogeneous fact-tellers. Instead we’re messy humans who like having arguments, and I think taxonomy is one of those areas that has many arguments.

So, what debates (historical or modern) do you most enjoy hearing about within taxonomic research? Here are some ideas I have, but would love to hear some specific examples from you all:

  • DNA Barcoding (summarized nicely by the Dept of Sociology at Lancaster Uni, and a 2005 POV article in Systematic Biology),
  • Taxonomy “vandalism” (see this Smithsonian piece), which I hadn’t realized was a thing,
  • Where do hominids fit in with respect to great apes (e.g. this opinion piece)?

I’ll probably simplify the general idea behind this lesson plan and throw in some soft toy animals for the kids to classify, but if you have any interesting ideas please let me know!