Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 3: Searls Keynote

David Searls (GlaxoSmithKline Pharmaceuticals, USA)

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding.

A metaphor for SB: the organizing paradigms of systems and languages map neatly to each other (the "parts list" is the vocabulary, or lexicon). "Connectivity" is the rule-based syntax that determines how words may be arranged. "Function" is the subject matter of semantics. Analogous organizing paradigms can be found in a number of related domains, including systems (componentry, connectivity & behaviour: these match the previous set of 3 words). The equivalent triple for proteins are sequence, structure & function. In some ways, you can think of proteins as systems themselves. The semantics, or meaning, of a system is separate from its pragmatics, which is what it does, usually in the larger context of a discourse. This matches the pure function of a protein and it's actual role.

How does complexity arise in biological networks? Pleiotropy & redundancy. Networks ramify by overlapping function. Pleiotropy (multifunctionality) is common, as in the case of "moonlighting" proteins. Redundancy of function is the flip side of pleiotropy, and such redundancy (full or partial) contributes robustness. Linguists have similar terms for wordnets: polysemy and synonymy.

Network Emergent properties:In connecting pathways into networks, it has been suggested that important novel properties emerge. This idea has started to take hold in the biology community. The phrase can be traced back to the early part of the 20th century and the idea of the unity of science. Reductionism: science generally proceeds by reduction to fundamental components and behaviours. Emergence: complex systems are thought to demonstrate this, such that "the whole is greater than the sum of the parts", and such behaviour could not be predicted a priori. Reductionism seems to be "under fire" at the moment: something completely new and different is the best thing. However, it is actually fair to say that systems biology seems to say that reductionism will no longer do by itself.

The 19th-century logician Gottlob Frege set up competing principles of "meaning"; firstly, compositionality (the meaning of the whloe is a function of the meaning of the parts), and secondly contextuality (no meaning can exist independently of the overall context). Contextuality can be dealt with in a compositional context if you know how much context will be necessary. For instance, substrings have variable pronunciations: "ough" has 6 different pronunciations, but looking at the letters around the set of letters, you can determine pronunciation. But how many more letters do we need to look at? Same thing happens with whole words: does vs does. How do we use the context to determine pronunciation here? In proteins, the string ASVKQVS is part of a beta-sheet in an amino-peptidase, and an alpha-helix in a guanylate kinase.

From a compositional viewpoint, examine the example of artificial neural networks, where you imitate life to try to "learn" functions of many variables. Minksy & Papert showed that early nets couldn't classify some functions, such as exclusive-OR (that is, X or Y but not both), but adding a "hidden layer" of neurons fixed this. This seems to be a case for emergent properties. But is it really? If you design from scratch, ab initio, you can get it, so it is just a case of simple logic. However, could emergence simply be a matter of scale? Would imponderable properties arise in larger hidden layers?

There are some interesting parallels between neural network research of 20 years ago and omic datasets of 5 years ago. For instance in NNs there was a belief/concern that a net is a "black box" whose arhitecture is opaque to interpretation (then they worked on rule extraction). for Omics, profiles may emerge in the absence of any clues to the mechanism. Secondly, in NN if hidden layers are too big, nets tend not to generalize buyt just to memorize (overfit). In Omics, high-d data with few samples can allow statistical artifact. Finally, in NN Nets learn differently upon being retrained (nondeterminism). Luckily, these concerns in NN faded over time.

In what ways might complex biological systems resist reductionist description? Firstly Dependency (too highly interconnected to afford discrete, mechanistic explanations), and secondly, ambiguity (too pleiotropic and nondeterministic for definitive or tractable analysis). There is also dependency, of course, in biological systems: nucleotide base pairs embody dependencies in structural RNAs. 2o structure is an abstraction of this. Dependencies are "stretched" by linearizing the primary sequence. Also, side-chain interactions embody dependencies in folded protein chains. 2o structure is a modular abstraction. Dependencies are parallel / antiparallel orientations and chirality.

Folding ambiguity example: Attenuators use alternative RNA 2o structure by exploiting the syntactic ambiguity of the underlying grammar. He then introduced the Chomsky Hierarchy, but it was quite a complex table and cannot be reproduced here. Not only is the Chomsky Hierarchy useful for understanding modularity, but ICs are abstracted hierarchical modules and should also be considered.

Rosetta Stone Proteins: proteins that interact or participate in the same pathway are often fused. Catalogues of fusions can predict function. Circuit design has steadily evolved to higher levels of abstraction & modularity: standard cell VLSI design used libraries of validated, reusable circuit building blocks. Full custom is reserved for optimization, and hardware description languages (HDLs) lets chips be deigned like writing software. Hard and software are a continuum, therefore. Microcode, programmable gate arrays, etc. Some bioinformaticists have written psuedocode to describe biological pathways. In 1968 computer scientist Edsger Dijkstra wrote a now-classic short note entitled "GOTO considered harmful". In it he critcized programming constructs that allow undisciplined jumps in flow of control leading to so-called "spaghetti code", which made larger programs unwieldy. Therefore he helped to launch the structured programming movement, which enforced a strictly nested modularity for more manageable growth, debugging, modification, etc.

Does nature write spaghetti code? Well looks like pasta to the uninitiated 😉 However, if you actually look how things are put together, you'll notice it probably doesn't. Protein domains combine predominately by concatenation or insertion, as seen in pyruvate kinase. Do proteins interleave? Very rarely do proteins seem to have interleaved domain structures, like D-maltodextrin binding protein with three inter-domain crossings (perhaps due to a translocation?) So, it seems quite rare. This puts biological structure and human language at the same level in the Chomsky hierarchy.

Organizing paradigms for linguistics can readily extend to proteins and systems. Abstracted, hierarchical modularity is a means to support "controlled" growth in complexity, in both design and evolution. The Chomsky hierarchy offers tools to measure and analyze this complexity. Proteins and systems form a continuum, exhibiting both compositionality and contextuality (but emergence…? Perhaps not).

The "Computational Thinking" Movement has been growing in recent years, and such work could help people who aren't used to thinking of modularization.

My opinion: Fantastic talk! Great way to start the day.

Read and post comments |
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: Protein Info 2 Disease Terms, Mottaz et al

Presented by Anna-Lise Veuthey, from SIB.

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

Increase interoperability between molecular biology and clinical resources by indexing UniProtKB with medical terminologies, including MeSH. Related work includes GenesTrace, PhenoGO, and MedGene. These systems use text mining methods, or knowledge- and semantic-based methods using ontological relationships of terms.

Why use MeSH? MeSH is a hierarchical CV developed by NLM. It is part of UMLS and thus is linked to other medical terminologies. Further, it is used to index the biomedical literature.

200 disease names from 97 Swiss-Prot entries manually mapped to MeSH terms. used to evaluate the procedure in terms of recall and precision, and used to set up a score threshold.

The mapping system was tuned for high precision to provide a fully automated procedure. But we need to improve the recall by: including NLP techniques in the disease extraction and matching procedures, refining the score with other parameters, trying to map to other terminologies such as SNOMed-CT, and using information from the literature which is indexed with MeSH terms.

They developed a generic terminology mapping procedure which can be used to link various biomedical resources. Further, indexing SP with medical terms opens new possibilities of searching and mining data relevant for clinical research.

Read and post comments |
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: Coherency in HPI, Futschik et al

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

Protein-protein interaction networks: fundamental for comprehensive systems biology? PPIs are crucial for cellular processes. Structuring the "hairball": modularity is a major design principle of biological systems. Division of networks in modules, i.e. clusters of proteins that are highly connected etc. The integration of HPI Maps meant that they've included over 160,000 interactions between over 17000 proteins. The PPI Network is assembled via data extraction from literature, giving 35000 interactions between 8500 non-redundant proteins (based on EntrezID).

Identification of modular structures: most communities have less than 15 proteins, in the "mesoscale" range of cluster. Membership of proteins were such that most of the proteins attributed to modules are found only in 1 community. The next step is functional interactomics, which is linking modules to functions, and the interactions within and between modular structures. Characterization and annotation of detected modules were done using GO information and expression data.

Cellular localization of modules: analysis based on 20 informative GO categories. Almost half of proteins are assigned to "nucleus". Of 316 modules of k>3, 170 contained only proteins allocated to one location. Co-expression, co-localization, and common functional annotation: correlation of co-expression and co-localization with a modest correlation of 0.27: 34/51 large modules (k>10) are significantly co-expressed.

Read and post comments |
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: MIN to ODEs, Yartseva et al

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

From MIN model to ordinary differential equations. A MIN model is a knowledge management formalism for biology. A model should enable knowledge integration, hyp testing, prediction of response, and discovery of fundamental processes.

MIN has: universality (the integratin of various kinds of bio data available today), parsimony (the simplest possible representation of the data), incrementality ( construction of more complex models from simpler ones), precision (expression of relations in a non-ambiguous mathematical way), transposability (formal rules for the translation of the information contained in the model into commonly used (target) modelling formalisms). MIN improves the MIB model: it is a bi-partite graph with labelled nodes and arcs.

Putting microscopic and macroscopic data together. In an example, she describes relation "F", which enumerates the experimentally observed system states expressed through the variables' values. Translation into multivalued logical formalism: the translation procedure produces the candidate models for further analysis. Then there is a direct translation into ODEs.

Note: I lost my way about here, but it sounded really interesting nonetheless. Refer to the paper for more details.

Read and post comments
|
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007 Day 2: Multi-value networks, Banks et al.

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

Talk about multi-value networks, high-level petri nets, and the differences with boolean networks. Formal methods are required to model and analyse complex regulatory interactions. Boolean networks offer a good starting point, but are often too simplistic. Multi-value networks (MVNs) are qualitative, and are seen as a middle ground between differential equation models and boolean networks.

He has applied high-level petri net techniques and a wide range of analysis tools. In MVNs, entities assume a range of values (o…n). Each entity has a neighbourhood of other entities that affect it, and the behaviour of each entity is described using state tables. However, we can't really analyse this: that's where Petri nets come in. They have a graphical notation with mathematical semantics and can model choice, synchronization and concurrency. They have an expressive framework with data types and equational description of behaviour. There are a wide range of analysis techniques and tool support, e.g. model checking. Petri nets use a kind of tokenizing system.

Their approach was as follows. They have defined a set of state transition tables that completely define the model. Equational definitions are extracted from these tables, and then a Petri net is constructed. They also use multi-value logic minimalization applied to each state transition table to simplify the information from the tables. Construction of the high-level Petri net begins with a single place for each entity connected to central transition. Transition encodes equational specification of network behaviour. Each placed "x" is connected to the transition node with input arch "x and output arc x".

They showed how this worked through carbon starvation in E.coli. Exponential growth occurs where there is sufficient carbon, but they enter a stationary phase when the carbon is depleted. The model is validated by checking known properties. Then, you can look at dynamic properties. A mutant analysis was also done, where you can "knockout" or overexpress key genes and observe the effect.

Finally, they do a model comparison with the Boolean network equivalent of this model. There are differences, which leads to some interesting questions: how much detail is required in the model? Is the model representable in the boolean domain?

My opinion: A great, interesting talk that flowed well and was easy to understand. Slides were a little overfull, but it didn't detract. A natural speaker.

Read and post comments
|
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: Reactome, de Bono et al

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

Most disease are multifactorial. Now we've reached a critical mass where we need to put all the information together, to understand the whole rather than the bits. This is the main aim of the Reactome database: to create a single, hand-curated model of human molecular biology, in pathway form.

There are three problems with the currently available biological information: data is lacking, the data that is out there is dispersed over a vast array of literature etc, and often the data is inapplicable (how do we discover what information is pertinent to us?). We extract the knowledge from the experts and insert it into Reactome via its data model. The core of Reactome is the mapping – they use external vocabularies, information etc. They map to proteins via UniProt and get further cross-references via this database. Other primary resources include GO and ChEBI. They also provide an API in both perl and java that can be used for querying and working with Reactome. There is also an online application called Reactome Mart.

How does Reactome represent its data? The better the way to describe biological structure, the better the method to describe biological activity. There is a physical entity class that can describe their state and what they are. What then followed was a very nice description of the data flow and the data model of Reactome, specifically how it deals with combinatorial explosion of possible complex types in a given pathway.

Even though they concentrate on human, they can use orthology information and create putative skeleton models for other organisms. They also have a tool called skypainter where you can paste in your favorite IDs and gene expression data, and then the Reactome pathways can be colored according to your data or IDs. All work and tools is freely available from the Reactome website.

Read and post comments
|
Send to a friend

original

Categories
CISBAN Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: Model Format OWL (MFO), Lister et al.

Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models

Published September 2007 by the Journal of Integrative Bioinformatics

Allyson L. Lister1,2, Matthew Pocock2, Anil Wipat1,2,*
1 Centre for Integrated Systems Biology of Ageing and Nutrition (http://www.cisban.ac.uk)
2 School of Computing Science (http://www.cs.ncl.ac.uk),
Newcastle University (http://www.ncl.ac.uk)*

Abstract

The creation of
quantitative, simulatable, Systems Biology Markup Language (SBML) models that
accurately simulate the system under study is a time-intensive manual process
that requires careful checking. Currently, the rules and constraints of model
creation, curation, and annotation are distributed over at least three separate
documents: the SBML schema document (XSD), the Systems Biology Ontology (SBO),
and the “Structures and Facilities for Model Definition” document. The latter
document contains the richest set of constraints on models, and yet it is not
amenable to computational processing. We have developed a Web Ontology Language
(OWL) knowledge base that integrates these three structure documents, and that
contains a representative sample of the information contained within them. This
Model Format OWL (MFO) performs both structural and constraint integration and
can be reasoned over and validated. SBML Models are represented as individuals
of OWL classes, resulting in a single computationally amenable resource for
model checking. Knowledge that was only accessible to humans is now explicitly
and directly available for computational approaches. The integration of all
structural knowledge for SBML models into a single resource creates a new style
of model development and checking.

Introduction

Systems Biology Markup Language[1] (SBML) is an XML format that has emerged as the de facto standard file format for
describing computational models in systems biology. It is supported by a
vibrant community who have developed a wide range of tools, allowing models to
be generated, analysed and curated in any one of many independently maintained
software applications[1].
The Systems Biology Ontology
[2][2] (SBO)
was developed to enable a useful understanding of the biology to which a model
relates, and to provide well-understood terms for describing common modelling
concepts. The community is engaged in an on-going effort to develop the SBML
standard in ways needed to support systems biology applications. As part of
this process, a manual is maintained that describes and defines SBML and SBO[3].

The biological knowledge used to create and
annotate a high-quality SBML model is typically analysed and integrated by a researcher.
These modellers know and understand both the systems they are modelling and the
intricacies of SBML. However, as with most areas of biology, the amount of data
that is relevant to generating even a relatively small and well-scoped model is
overwhelming. In order to extend the range of modelling tasks that can be
automated, it is necessary to both capture the salient biological knowledge in
a form that computers can process, and represent the SBML rules in a way
computers can systematically interpret. Here we address the latter issue:
describing SBML, SBO and the rules about what constitutes a correctly formed
model in a way suitable for computational manipulation.

The Semantic Web[4]
can be seen as today’s incarnation of the goal to allow computers to go beyond
performing numerical computations, and to share and integrate information more
easily. There are now several standards forming within the Semantic Web
community that together formalise computational languages for representing
knowledge and strictly define what conclusions can be reached from facts
expressed in these languages. The Web Ontology Language
[3][5] (OWL) is
one such language that enjoys strong tools support and which is used for
capturing biological and medical knowledge (e.g. OBI[6],
BioPax[7],
EXPO
[4], and FMA[5] and GALEN[6] in OWL). Once the information about the domain has been modelled in
an OWL file, a software application called a reasoner
[7,
8]
can automatically deduce all other facts that must logically follow
as well as find inconsistencies between asserted facts.

The knowledge about a system described in
SBML can be divided into two parts. Firstly, there is the biological knowledge. This includes information about the
biological entities involved and their biological. Secondly, there is the structural knowledge, describing how the
biological knowledge must be captured in well-formed documents suitable for
processing by applications. In the case of a high-quality SBML model, the structural knowledge required to create
such a model is tied up in three main locations:

  • The Systems Biology Markup Language (SBML[1][8])
    XML Schema Document (XSD[9]),
    describing the range of XML documents considered to be in SBML syntax,
  • The Systems Biology Ontology (SBO[2][10]),
    describing the range of terms that can be used to describe parts of the
    model in a way understandable to the community using the Open Biological
    Ontologies (OBO[11])
    format, and
  • The "Structures and Facilities for Model Definition"
    document[12]
    (hereafter referred to as the "SBML Manual"), describing many
    additional restrictions and constraints upon SBML documents, and the
    context within which SBO terms can be used, as well as information about
    how conformant documents should be interpreted.

From a knowledge-engineering point of view,
it makes sense to represent these sources of structural knowledge as part of a
single knowledge base. Although, to a knowledge-engineer, this current
separation of documents could appear arbitrary, it is in fact well-motivated
according to consumers of each type of information. The portion of the
knowledge codified in SBML transmits all of and only the information needed to
parameterise and run a computational simulation of the system. The knowledge in
SBO is intended to aid humans in understanding what is being modelled. The SBML
Manual is aimed at tools developers needing to ensure that software developed
is fully compliant with the specification.

Only two of these three sources of
structural knowledge are directly computationally amenable. SBML has an
associated XSD that describes the range of legal XML documents, which elements
and attributes must appear, and constraints on the values of text within the
file. SBO captures a term hierarchy containing human-readable descriptions and
labels for each term and a machine-readable ID for each term. Neither of these documents
contains much information about how XML elements or SBO terms should be used in
practice, how the two interact, or what a particular conformant SBML document
should mean to an end-user. The majority of information required to develop a
format-compliant model is in the SBML Manual, in formal English. Anything more
than simple programmatic steps, such as XML validation, can currently only be
done by manually encoding the English descriptions in the SBML Manual into
rules in a program. libSBML[13]
is the reference implementation of this procedure, capturing the process of
validating constraints. Manual encoding provides scope for misinterpretation of
the intent of the SBO Manual or may produce code that accepts or generates
non-compliant documents due to silent bugs. In practice, these problems are
ameliorated by regular SBML Hackathons[14]
and the use of libSBML by many SBML applications. However, the need for a more
formal and complete description of the information in the SBML Manual becomes
more pressing as the community grows beyond the point where all of the relevant
developer groups can be adequately served by face-to-face meetings.

We find that some of these issues can be
avoided by combining the structural knowledge currently spread across three
documents in three formats into a single computationally amenable resource.
This method of constraint integration for all information pertinent to SBML
will require a degree of rigour that can only improve the clarity of the
specification. Once established, standard OWL tools can be used to validate and
reason over SBML models, to check their conformance and to derive any
conclusions that follow from the facts stated in the document, all without
manual intervention.

To address this proposition, we have
developed the Model Format OWL (MFO), implemented in OWL-DL and capturing the
SBML structure plus a representative sample of SBO and human-readable
constraints from the SMBL Manual. We demonstrate that MFO is capable of directly
capturing many of the structural rules and semantic constraints documented in
the SBML Manual. The mapping between SBML documents and the OWL representation
is bi-directional: information can be parsed as OWL individuals from an SBML
document, manipulated and studied, and then serialized back out again as SBML.
We demonstrate feasibility with two simple, illustrative, examples. In future,
we hope to use this as the basis for a method of automatically improving the
annotation of SBML models with rich biological knowledge, and as an aid to principled
automated model improvement and merging.

The integration of all structural knowledge
for SBML models into a single resource creates a new style of model document
development, which we believe will greatly reduce the overheads associated with
computational transformations between biological knowledge and high-quality
systems biology models. MFO is not intended to be a replacement for any of the
APIs or software programs available to the SBML community today. It addresses
the very specific need of a sub-community within SBML that wishes to be able to
express their models in OWL for the purpose of reasoning, validation, and
querying. It has also been created as the first step in a larger data
integration strategy that will eventually encompass the biological as well as
structural knowledge present in SBML documentation and models.


[1]       Hucka,
M. et al.: The systems biology markup
language (SBML): a medium for representation and exchange of biochemical
network models. Bioinformatics (Oxford, England) 19 (2003) 524-531

[2]       Le Novere, N.: Model storage, exchange
and integration. BMC Neurosci 7 Suppl 1 (2006) S11

[3]       Horrocks, I., Patel-Schneider, P.F., van
Harmelen, F.: From SHIQ and RDF to OWL: The making of a web ontology language.
J. of Web Semantics 1 (2003) 7-26

[4]       Soldatova, L.N., King, R.D.: An ontology
of scientific experiments. Journal of the Royal Society, Interface / the Royal
Society 3 (2006) 795-803

[5]       Heja, G., Varga, P., Pallinger, P.,
Surjan, G.: Restructuring the foundational model of anatomy. Studies in health
technology and informatics 124 (2006) 755-760

[6]       Heja, G., Surjan, G., Lukacsy, G.,
Pallinger, P., Gergely, M.: GALEN based formal representation of ICD10.
International journal of medical informatics 76 (2007) 118-123

Enjoyed this? To read the rest, please see the Journal of Integrative Bioinformatics

Read and post comments
|
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 2: myExperiment, Goble Keynote

Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. 🙂

Carole compares it to mySpace. First, an introduction to workflows, where all of the standard bioinformatics statements are made: workflows are fantastic, but laborious, this is the era of Service Oriented Architecture, trying to make repetitive & mundane stuff easier. This leads nicely into a mention of Taverna, a "workflow workbench". Taverna 2 is being built now. They are getting 15,000 downloads per month, when in a good month. Then Carole spent a slide talking about the phrase "In the Cloud", which is a descriptor for independent third-party applications, tools and software. Taverna was designed for people who have little access to resources. She then described a few good examples of how Taverna could be used. Carole suggests to do more in Taverna rather than just workflows, for example SBML models or lab protocols. Then she gave some reasons why using experimental data standards are a good idea. Carole also mentioned that workflows (rather than just data) could be included in peer-reviewed articles.

myExperiment is meant to make it easy for scientists to pool information and data, and is meant to look like a social networking site. This includes collaborative social bookmarking and content sharing. They want to leverage and serve the long tail end of the "cloud". They'll use it as a gateway to other publishing environments and a platform for launching workflows. it is an "Open Archives" Initiative. Want to be able to launch and run workflows via Taverna via myExperiment. Also wants to encourage workflow "mashup" and publishing.

Here's where my opinion goes: By halfway through the talk, she hadn't said what she meant by myExperiment, other than that it is meant to be the mySpace for bioinformatics. However, she did spend at least the last 15 minutes discussing it. Further, while she talks about the usefulness of experimental data standards in relation to adding lab protocols to Taverna, she mentioned it without ever relating to FuGE. As FuGE is being published in Nature Biotech and is being touted as a possible standard experimental data exchange format, it seems an odd omission. (Especially as one of the main developers of FuGE, until recently, also worked at University of Manchester.) In conclusion, while very interesting, not as "meaty" as I'd like. A good talk overall, and I think I'll sign up for myExperiment!

Read and post comments
|
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 1: The OXL format, Taubert et al

Other than where specified, these are my notes from the IB07 Conference, and are in no
way expressions of opinion, and any errors are probably just due to my
own misunderstanding

OXL is the ONDEX data format, and they are presenting it as a possible format for the exchange of integrated data. OXL is based upon an ontology (opinion/question: a true ontology, or a CV?) of concepts and relations. ONDEX itself is an open-source data warehouse in Java that performs ontology-based data integration. OXL is in RDF. There are two ways to use RDF: firstly, model things as predicates (but then you cannot have attributes), and secondly they should be modelled as classes. However, it also seems that they have OXL in XML format, using an XSD.

In their XML format, they don't use any cross-references: it is fully expanded. Yes, it generates lots of XML files, but with file compression it isn't a problem. It does make whole-document validation more difficult, but they're working on it. This method makes it more human-readable.

They then presented some examples. The first was the identification of possible pathogenicity genes in Vibrio salmonicida (with the university of Tromso). Identify clusters of orthologs involving V. salmonicida, then colour nodes according to pathogenicity phenotype.

http://ondex.sf.net

Here are my opinions: A well-presented talk on the whole. Don't mean to harp on today about architecture slides, but they're important when describing software. They had some, but they were so small they were pretty hard to read. Also, I've never been convinced about the "human-readable" explanation for why to make a change to an XSD: XML is simply not meant to be human-readable, and changes shouldn't be made to the XSD to make it so. However, ONDEX is a reasonably mature application, and so it may be useful to ask others to use their format. My main question is about probabilities: a lot of similar work uses weights on edges in data integration: how can these be modelled with OXL?

Read and post comments
|
Send to a friend

original

Categories
Meetings & Conferences

Integrative Bioinformatics 2007, Day 1: CABiNet, Oesterheld et al

Other than where specified, these are my notes from the IB07 Conference, and are in no way
expressions of opinion, and any errors are probably just due to my own
misunderstanding.

There are a large number of concepts and methods. Need to integrate different networks and additional information networks.

CABiNet (COmprehensive Analysis of Biological Networks). It is a generic network analysis suite with a semi-automatic network processing pipeline and methods for the exploration of a protein's functional network. The work was driven by the need to ID the substructures of the network (via clustering, network topologies, and known communities), and the knowledge that networks are incomplete (superposition of networks). Problems with the former include the fact that different algorithms may lead to different results. Created with component-oriented architecture. The architecture slides were very clear, but necessarily hard to reproduce in these notes. Uses Hibernate, Spring, and EJBs, among others. The "n-tier" architecture they used is called Genre. There is an asynchronous invocation of processing pipeline based on Message Driven Beans.

Via their integration tier they have acces to SIMAP, which is used in the detection of orthologs. It contains millions of proteins and FASTA hits, and they can do real-time orthology searches. There is web-service and ejb access.

One example pipeline would be functional classification based on multiple biological networks. Another would be the identification of functional modules from gene expression data (you can upload microarray data). They also showed a real example of a protein interaction network from a nucleosomal complex, and then added more and more data sources until the network grew quite complex.

Here comes my opinion: The clearest talk of the afternoon, dealing with an externally-available application that allows the user to create pipelines to perform network analysis. This one did have good architecture slides: perhaps a few too many 😉 . Their architecture makes me wonder if they use AndroMDA or some other MDA, as such plugins to maven can build Hibernate/Spring/EJB/web services layers. They should have mentioned other similar projects, and how they are different, as I wouldn't know. However, that's a small beef. Finally, how do they know how good each data source is? Do they have a rating of each data source, for instance against a gold standard?

Read and post comments
|
Send to a friend

original