SysMO-DB and Carole Goble, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Systems Biology of Microorganisms. 11 projects from 91 institutes, whose aim is to record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way. These projects have no one concept of experimentation or modelling, which makes it tough for information exchange. Further, there are issues of people having their own solutions, suspicions (about sharing data, for instance), data issues (many don't have data or don't store it in a standard way) and resource issues (no extra resources). SysMO-DB started in July 2008, and is a 3 year funded effort (3+3 people in 3 teams over 3 sites). Provide a web-based solution to exchange, search, and disseminate data. Need to retrofit data access, model handling and data integration platform. Because of the large number of groups and projects, they are going to aim for low-hanging fruit and early wins: be realistic, not reinvent, sustainable, and encourage standards adoption.

Just like at CISBAN, where we have implemented a web-based data integration, storage, exchange, and dissemination platform in a standards-compliant way (SyMBA), they have three users: experimentalists, bioinformaticians, and modellers. They're lucky, though, in that they have 6 people to develop SysMO-DB, when CISBAN only has 1. 🙂 And, as with CISBAN and many other data integration efforts, much of the work is social: that is, encouraging those three user types to collaborate and understand each other's work. The social solutions include questionnairs, "PALS" (postdocs and phd students), and Audits and sharing of methods, data, models. They discuss things like what people need or don't need from MIAME. (Personal opinion and question: MIAME is intended as a minimal information checklist. What kind of things, then, don't they need? And would it be worth taking this information back to the MIAME people to possibly modify the guidelines if some aspects of it aren't truly minimal? End personal questions.)

Discovery is done via SysMo-SEEK. How to catalog the metadata, and then have mechanisms for accessing the data from locations other than the host site? There is a single search point over "yellow pages" and assets catalogue. They store metadata on results, not the results themselves (again, just like SyMBA, which stores the metadata in a database, and the results in a remote file store). They use myExperiment for both linking the people and the assets. For models, they're using a local installation of JWS online, which is a database of curated models and a model simulator. There is also some links to semantic SBML from the TRANSLUCENT project.

There are two kinds of processes to store. The first is experimental processes, e.g. SOPs and protocols. They use the Nature protocols format, with the addition of high-level classification through tags. (Personal note: What is the underlying format for storing protocols?) The second type of process is Bioinformatics processes, which are stored as workflows. (Question: Why don't you store protocols as workflows? They can be chained in the same way.)  Taverna is used for this work. One bit of work was using libSBML inside taverna for collaborative model development (Peter Li et al). Another automated (definition of automated in this context?) workflow goes from microarray to pathways and published abstracts. Their consortium wants to exchange information from public data sources, SysMO itself, and excel spreadsheets.

(Another personal aside. FuGE (object model for experimental metadata) and ISA-TAB (tabular format, e.g. spreadsheets) are becoming interchangeable – work is going on between FuGE and ISA-TAB people right now – most recent workshop was last week. This is important, as it was mentioned that bioinformaticians have to deal with spreadsheets (which is true enough!). So, you get the best of both worlds with FuGE / ISA-TAB, without having to define yet another schema. A personal question would be: Why build these various metadata schemas and parsers for spreadsheets (e.g. whatever is used for the Assets catalogue and JERM parsing of spreadsheets) rather than use pre-existing models such as FuGE and formats such as ISA-TAB? Using the FuGE object model does not mean that you have to use all aspects of it – you can just take what you need.Perhaps it was due to the maturity of ISA-TAB at the time the project started, though the specification is now in version 1.0. Will SysMO-DB export and import these formats? There was no time for questions at the end of the talk, so I will try to find out during the lunch period. End aside.)

Trying to map to the relevant MIBBI standard. There is a nice feature that reads spreadsheets from specific locations and automatically loads them into the Assets catalogue. (You can still load them directly into that catalogue.) They are performing a 4-site JERM exchange pilot scheme in Spring 2009.

Great talk – thanks 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


Housekeeping & Self References Semantics and Ontologies Standards

This site now listed in Nature Blogs, and the reason behind my keyword choices

Last week when scanning through Friendfeed, someone mentioned Nature Blogs. A number of my friends and fellow friendfeeders (1,2,3,4,5,6,etc.) already have their blogs registered there. I took the plunge and submitted my request last week, and this site was accepted for inclusion in the list this week. You can find it listed under the bioinformatics category. In honor of that occasion, I've decided to post a summary of the tags I chose to mark this blog with on Nature Blogs, and the reasons for them. (The obvious one, bioinformatics, wasn't necessary as far as I could tell because that is the top-level category I've placed the site into.)

  • data integration: It is the main focus of my research, and one of the biggest challenges facing bioinformatics and the life sciences in general. So many formats, so little time! Reconciling these using brute force, standardization, semantics, and sneakiness are what it's all about.
  • ontologies: I like ontologies for many reasons, not the least their potential for reconciling the many different ways of defining and naming things in our lives. We need a common ground from which to perform successful integration and analysis, and I think a well-written ontology (or set of them) is a beautiful thing. They are a major tool in my research bag of tricks. Not only that, but I also help develop a community-driven ontology for describing life-science experiments (OBI).
  • workshops: my method of remembering what goes on in workshops and conferences is to take notes, and I can be a pretty fast typist. I enjoy blogging on each lecture at such an event as they happen, and you'll notice a lot of workshop and conference posts on this site. They are mainly written while the speaker is speaking, with a minimum (if any) of later editing. However, if any speaker reads my notes and would like to suggest areas where I made a mistake, I am more than happy to make those sorts of changes. One of my favorite ways of blogging.
  • systems biology: that's the field in which my bioinformatics research is applied, which makes it an immediately-applicable tag for this blog. But try to define it and, as with so many things in this world, you could get as many definitions as there are people. (Ok, perhaps a slight exaggeration for dramatic effect.) So, I'll not try to define it today, and just say that my posts often deal with work in this field.
  • science outreach: My Mom is a teacher, my Dad was a teacher and remains working in Education. If it wasn't so much hard work, I'd consider it as a career myself. 🙂 However, I do enjoy trying to pass on my enjoyment of and interest in the sciences. Some of my more recent posts talk about the work I'm doing with the Teacher Scientist Network. Outreach is just fantastic, especially when explaining science to kids, and it's something I like to talk about in this site, when the opportunity arises.
  • standards: Perhaps it's because I spent years working at the EBI, where they provide databases and services in specific syntaxes. Perhaps it's just the way my personality is. Whatever the reason, I really enjoy working with data standards. I'm lucky enough to be directly involved with two at the moment (FuGE and OBI), and peripherally involved in other efforts such as SBO (by peripherally I mean that I've nagged them in the past about the whys and wheretofores of various aspects of their ontology) and MIGS (I was involved in the initial work on the checklist, and provided advice on FuGE). I'm a bit of a standards fiend, and try to remind myself that not everyone finds them as interesting (though everyone should at least find them relevant!).

Read and post comments |
Send to a friend


Meetings & Conferences

Questionnaire Design

I spent today in a 1-day
course on Questionnaire Design organized by the Newcastle University Staff Development Unit, and run by Dr. Pamela Campanelli, a Survey Methods
consultant and UK Chartered Statistician. While I won’t recreate her slides
here, as that would be long, irrelevant and possibly infringe some copyrights,
I wanted to present some of the most interesting comments she had to make on the design and analysis of questionnaires and the responses returned.

          I signed up to this course as my PhD project includes, as one of its
(smaller) objectives, the comparison of the perceived level of collaboration
between the various research groups within the Centre I belong to both before
and after my PhD project is made available. Part of that project is to provide
an application accessible to all researchers that will
automatically use the output of certain research groups to inform the research
of other groups. (Yes, I am being deliberately vague here.)
In summary, the ability to provide my target audience with a simple, clear
questionnaire that will additionally produce responses that can be
statistically analyzed in a useful manner is important. As I have no previous
experience writing a questionnaire, a crash-course seemed like a good idea.
Forgive any errors in the points that follow: I am sure they are all due to my
lack of comprehension rather than to the quality of the training course!

          Of most relevance to me Pam mentioned that, when designing
a questionnaire that will be given at multiple time points (i.e. before and
after my work is available to the researchers), to ensure that the
changes in the responses are not due to questionnaire design, make sure that you use an identical
questionnaire every time you provide it

          The most important thing I learnt from the day’s training
is this: always think very carefully
about what you want to ask, and ensure that every question you ask has a
relevant objective and is written with an eye for balancing brevity and clarity
(with clarity being the more important of the two). For instance, in English
“you” may be plural or singular, and which is intended should be made clear.
Equally, words like “doctor” have many meanings: your GP, your specialist, a
PhD. Some may even check “yes” to a question asking if they have seen their
doctor if they have been to the surgery/office and seen the nurse, or even
if they have chatted with their doctor on a chance meeting at the grocery

          Pam mentioned a resource that has been useful to her in the
past, called the CASS Question Bank (
This presents – for free – the information in the
data archive. Not only might a question you wish to use already be written,
but in some cases you can see how often such a question was answered (and
perhaps also the frequencies of each possible answer). It should be noted,
however, that just because a question or questionnaire has been published doesn’t
mean it is perfect. Also, there is no “ideal response rate” for questionnaires that
can be applied across the board. Instead, the rate will naturally differ
between country and even academic discipline (or other grouping). Further, the
people who actually respond to questionnaires have different traits than those
who don’t respond (when under their own recognizance).

          Incentives were also discussed, as I had toyed with the
idea of encouraging people to fill out my questionnaire by having a prize draw
for respondents for chocolate. Interestingly, Pam mentioned that prize draws
can be the worst of the incentive choices available. One study (sorry, I didn’t
catch the reference) examined promised a guaranteed prize of great value as
opposed to giving a much smaller prize before
the respondent filled out the form. The control response rate (no incentives)
was 50%. Where the respondents were guaranteed $50 if they sent back the form,
the response rate rose to 57%. However, when $5 was included in the initial
posting with the questionnaire, the response rate rose to 67%! Whether it was
the respondent’s belief in reciprocity or their feelings of guilt, it seems
that providing the carrot at the same time as the stick was useful. On a
smaller scale, including a tea bag (as was done by a PhD student) proved popular as well.

          Memory is often overestimated. Reports vary about how large
working memory is, but I’ve both 7 +/- 2 items and 5 +/-
2 items were mentioned. As Pam suggested, imagine a scenario where you are at a restaurant and
the waiter is telling you the specials. Most people find it difficult to keep
more than 5 or 6 specials in their head: after that, they start forgetting the
earlier items. This holds just as true for self-completion questionnaires (which
I’m interested in), and questionnaires in general. Therefore, the more clauses
in a question, or the more radio buttons in a range of possible responses, the
less likely that the responder will answer with their “correct” answer. In a
similar vein, you should try not to force respondents to do mathematics in
their head (“How often per day, on average, do you visit the coffee lounge at work?”).
The more mathematics you make them do, the less likely their answer will be the
one they intended. Instead, a couple of simpler questions from which the designer can calculate the value is better.

          She also says that the most common problem she encounters
is trying to answer too many questions with a single item, with her example being “Would you like
to be rich and famous?”: this sentence is alright for those who want either
both or neither, but is not appropriate for those who want one or the other.

          What is most interesting are the social aspects of
questionnaire design. If you have a range of 5 possible answers for a question
(very positive, generally positive, neutral, generally negative, very
negative), you need to decide whether you want to force your respondents to
take a side. To do this, you remove the
“neutral” option, forcing the respondents to get off the fence. You should also be
sparing in your use of “don’t know” as an option, as many people will use that
in preference to thinking about the question. Also, in many cases it is simply
not appropriate: for instance, “don’t know” is not really
applicable to the question “How happy are you with your new TV?”. Further, vague,
subjective quantifiers should be avoided wherever possible. Words like “often”,
“sometimes” and “rarely” mean different things to different people. Instead,
measuring frequencies with words like “everyday” and “about once a week” are
better, though they may not be suitable if the respondent’s behavior is not
regular. Questions using these words must be written clearly so that
respondents can make a decision easily. Finally, numeric scales should at a
minimum have the midpoint and the two extremes named with appropriate adjectives.
If, for instance, you have the range 0-10 and have not marked 5 as the
midpoint, some people may mistake the scale for a unipolar (any number over 0
is positive) rather than a bipolar one (any number over 5 is positive). The course covered many more topics than I've mentioned here. Included below were the references she recommended for further reading.

References Suggested (the
starred reference was the one she mentioned the most)

et al. (2000), The Psychology of Survey Response.

F.J. Jr. (1995), Improving Survey Questions: Design and Evaluation, : Sage.

Dillman, D. (2007), Mail and Internet Surveys: The Tailored Design Method,
2nd Edition, :

          Fowler, F. J. Jr. (2002), Survey Research Methods. 3rd
Edition, :

          Czala, Ronald and Blair, J (2005), Designing Surveys – a
guide to decisions and procedures.
: Pine Forge

Read and post comments |
Send to a friend


Meetings & Conferences Semantics and Ontologies Standards

3rd OBI Workshop: Day 3

Today was a highly informative combination of talks and further improvement of OBI. Hopefully, you'll find these musings on the day's work helpful at either jogging your own memory of the events, or in giving you an idea what went on in our heads.

Outside OBO
Ontologies – How do we integrate and/or make use of them?

  • Can we, at the moment or in future, place
    parent classes for all OBO ontologies in OBI? Definitely not now, as they don't share the same ULO (Upper Level Ontology). Some work is being done by the OBO-UBO group on mapping OBO ontologies to ULOs like BFO. (See the OBO-UBO web page for more information)

    • In a related question, should all OBO
      ontologies use BFO? It would make integration a much more straightforward process. In my opinion, this would be a great idea in the long term, however practicalities may prevent it. 🙂

  • Should things like
    BioTop ( be integrated
    into OBO, under BFO but before OBI? In my opinion (though today was the first time I have read about BioTop so it isn't the most informed one), in our case probably not, as resolving the three may be problematic. However, some terms or ideas might be useful to share.

Formal OWL, aka making OBI Formally correct

  • Should be assigned
    to someone/some people for later, after more classes have been
    created. There is simply too much flux in the file at the moment. Get the graphs in place first, perhaps working on some
    complex relations as you go. Further, the definitions must explicitly hold information
    on creating these relations, irrespective of whether or not you make the relationships as you go or at the end.

  • BFO and OBI use
    different metadata tags, and there should be a
    shared set of tags.

    • The metadata tags
      used in BFO are part of snap/span, I think. Would need to bring up the idea of metadata resolution (if possible, and we all agree it should be pursed) with that group too.

  • Barry Smith will bring OBI's information object and plan terms to the BFO group.

  • A milestone has been added (see the OBI Wiki) to
    hammer out exact implementation of the metadata list, and to work
    with other communities as appropriate (e.g. BFO, OBO Foundry =
    Barry, M Ashburner, Suzie, Chris M.).

Clinical Trial
Ontology – Simona Carini
& Barry Smith


  • Rctbank is a
    clinical trail db – information on all published clinical trials.
    (from journal articles)

  • Its purpose is to provide enough
    information to allow evaluation of these trials

  • RCT = randomized
    controlled trials

  • Epoch and Clinical
    Trial Ontology (CTO) are the other two that are being developed.

  • Barry Smith is involved in CTO, and therefore is built with OBI
    in mind, but is still very small

  • RCT and Epoch
    aren’t close to being OBO/OBI compliant.

    • Developed

    • Their choices are
      in conflict with the choices we’ve made

    • that does NOT mean that they aren't imminently useful (which they are), just that merging would be problematic
  • There has been
    agreement between Epoch and RCT that all should work towards a CTO
    that will work within the OBI framework

    • This necessary
      reconciliation is one of the goals of the CTO workshop in May.

  • There are people
    claiming to develop a CTO but it is actually a CT database
    ontology (I missed the name of the people being referred to here). It isn’t
    the same beast. Understanding the data is not equivalent to
    understanding the processes in a trial.

RCT Schema – Barry

  • Built
    independently of OWL or protégé, and is more correctly
    a database schema, though it is called an ontology.

  • Top-level class:

    • 2o study

    • Trial-details

    • Trial

    • Concept

      • Subclasses

  • Not the right way
    to do it – it is unbalanced: no place for a study, though is a
    place for a 2o study.

  • 2o study seems to
    be at the wrong level in the hierarchy

  • it is unclear what
    trial details means

  • When the same term (or portion of a term) is repeated
    over and over, it is often the a sign of a mistake, of redundancy

  • One of the
    children of population concept is population.

    • An ontology is
      important for reasoning using the is_a hierarchy, which can be reasoned
      over: Population is NOT a population concept and is NOT a concept

    • Reasoning is
      blocked here “from both directions”

    • Further, a recruitment
      flowchart is not a population concept

  • These things, like
    population concept, are headers/labels/conveniences, but they are not
    ontological forms. Some options for restructuring could be the following two things:

  • Population/protocol/design
    is_a continuant is_a entity

  • Trial is_a
    occurrent is_a entity

  • Not all RCT terms have

Epoch Ontology (Dave
Parrish in charge of it) – Barry Smith

  • There are parts of
    this ontology that don’t belong in the CTO, but do belong in OBI

  • Originally
    developed to support the immune tolerance network (ITN), a big
    clinical trial resource: they fund, implement, monitor and assess
    clinical trials, and provide data services.

    • Informatics dept
      of ITN perform operations (generation and collection) -> data
      management -> analysis

  • They have an
    ontology of the kind of analytical steps their software needs to
    perform, and it helps them configure the software application.

  • For example, elements are claimed to be
    nouns, and represent the physical objects of the system. Classes of
    elements are domain types, containers, relationships. These are not
    physical objects always – they’re sometimes processes. Also,
    they are not always nouns.

  • Fits in with the
    community milestones, i.e. we could get many terms from the clinical trials community.

Branches have been assigned. See the OBI Branches Wiki Page for up to date information.


  • Mapping between
    current terms in various OBO ontologies to BFO

    • E.g. GO
      biological process is_a span:process

  • Gramene has
    already developed an environmental ontology in a plant context,
    which we should remember and hopefully incorporate useful terms in the first round of community term dates.

More general

  • Have moved all terms
    that would fall under PATO out of the ontology, e.g. state and
    anything under quality.

  • Do we really need
    "in vitro state" as well as "in vitro"? Terms such as
    these are always tied to objects like cells – these are not design
    as much as the state of the cells.

    • Is in vivo
      a location or a state? You can take in vitro cells and put
      them into “vivo”, and they are still in vitro cells,
      which means in vitro is a BFO quality.

  • The interior of
    your gut is the site for your gut bacteria. The interior of gut (IG)
    is also a type/node in the FMA (as a location). IG has qualities
    (shape, etc). In addition to these qualities it has others that
    determine its roles (having certain pressure, pH value). How to
    distinguish what FMA means from what an environment ontology means?

  • If we remove
    in-vivo_state, we run into problems with multiple inheritance. We
    needed to separate out the state of a biomaterial from the
    biomaterial itself, i.e. don’t have in-vivo_material as a child of

  • What terms do we
    need to use to describe diseases?

    • Disease (hook for
      disease ontology), disease_symptoms, disease_stages,

  • Ended up going through the entire ontology, resolving many problems. There is a new OWL file, but it is not yet ready for public consumption therefore it won't be posted here until it is available from the official OBI pages.

There is general consensus among the workshop attendees that a very large amount of work is getting done, and there is a lot of positive feeling that the Milestones developed this week are giving us hard dates for inclusion of many more terms. The addition of terms can only truly start once the high-level structure has been decided, and this workshop has moved in great leaps and bounds towards a final structure of the higher levels of OBI. The "higher levels" have been generally defined at this meeting as the top two levels of OBI below BFO. This is what was completed today: the two levels directly below BFO have been studied by the group and cleaned.

Read and post comments |
Send to a friend