Data Integration

Chapter 2: SyMBA (Thesis)

[Previous: Data integration methodologies for systems biology]
[Next: Chapter 3: Saint]

SyMBA: a Web-based metadata and data archive for biological experiments


With the development and adoption of high throughput experimentation techniques in the life sciences, vast quantities of heterogeneous data are being produced. Consequently, there is a need to integrate these diverse datasets to build a more system-level view of organisms under study. These datasets must be archived, curated with appropriate metadata, and made available in a suitable form for subsequent analysis and integration [1].

There is also an urgent need to provide tools and resources that allow the digital curation and management of scientific data in a fashion amenable to standardisation and capable of facilitating data sharing and knowledge discovery [2]. It is desirable that this data storage and annotation process is implemented as early as possible in the data management cycle to prevent data loss and ensure traceability throughout.

This chapter describes the development of the SyMBA, a data and metadata archive for scientific data [3]. SyMBA addresses the need for data standardisation described by Paton [2] by implementing a method of experimental metadata integration for systems biology. SyMBA is an open source software system providing a mechanism to collate the content, syntax and semantics of scientific data, regardless of type. It can be used to review, share and download metadata and data associated with stored experiments through a simple Web interface. SyMBA accommodates both the representation of community-endorsed data standards and the development of bespoke representations to suit users’ digital curation needs.

Section 1 contains a summary of life-science data standards and how they are utilised by SyMBA. A detailed description of how SyMBA was implemented follows in Section 2, including initial requirements, architecture, data storage and retrieval techniques, and versioning methodology. Section 3 explains how SyMBA can be used, while Section 4 describes similar work and discusses the benefits of SyMBA as part of a larger life-science research life cycle.

1 Availability

A sandbox installation of SyMBA is freely available to test at Researchers wishing to use SyMBA as a local data and metadata archive may evaluate it by downloading the system and customising it for their needs. The Web application is implemented in GWT (Note:, with all major browsers supported. SyMBA is built with Apache Maven 2 (Note:, a software project management and build system similar to but more comprehensive than other build tools such as Apache Ant or GNU Make. SyMBA has a project website (Note: and a Newcastle University project page (Note: The project website provides access to the Subversion repository (Note: as well as various explanatory documents for programmers (Note: http://,

SyMBA is licensed by Allyson Lister and Newcastle University under the LGPL (Note: For more information including licensing for third-party libraries used in the application, see LICENSE.txt in the SyMBA Subversion project. Installation and running of SyMBA has been tested on 32-bit Ubuntu 8.10 and higher. Development questions can be directed to

1 Data Standards

To ensure maximum reuse of published data, Lynch [4] has stated that scientists should act as data stewards in three ways:

  • honour established disciplinary data standards;
  • record appropriate metadata to make future interpretation of the data possible; and
  • define metadata such as provenance and parameters, ideally at the time of data capture [4].

SyMBA makes use of a number of standards, emerging standards, and common formats to save time downstream of data generation by ensuring compatibility with other centres of research and journals complying with the same standards. SyMBA can help research groups follow all of the stewardship methods described above and can limit tedious and repetitive data entry. A brief overview of content, syntax and semantic data standards that are or can be integrated within SyMBA is provided in this section. A more detailed description of these standards is available in Section 1.4.

1.1 Data Content

The results and conclusions of scientific investigations are dependent upon the context, methods and data of the associated experiments. Defining a list of commonly required metadata to guide researchers in reporting the context of these experiments is increasingly gaining favour with data repositories, journals and funders [5]. These metadata reporting checklists outline the important or minimal information integral to the evaluation, interpretation and dissemination of an experiment; such guidelines effectively define what is contained within a scientific dataset and how the set was generated. The MIBBI Registry (Note: provides a listing of these checklists. SyMBA is specifically designed to allow users to store and annotate their data according to any content standard such as those registered with MIBBI. Users can create and save standards templates for guidelines such as MIAME [6] and MINI electrophysiology [7]. Saved templates are then accessible to all users of SyMBA.

1.2 Syntax

Digital curation in the life sciences is predominately database centric [8]. While repositories have been developed for single datatype ‘omics’ experiments [9, 10, 11, 12], until recently the lack of suitable data standards has hampered the development of a data storage system capable of storing multiple data types in combination with a uniform set of experimental metadata.

Recently, two related projects have been created to address the need for a common syntax to describe experimental metadata: FuGE and ISA-TAB [13]. The FuGE project was formed with the aim of standardising the experimental metadata for a range of omics experiments. The FuGE standard contains a model of experimental objects such as samples, protocols, instruments, and software, and provides extension points for the creation of technology-specific data standards [14]. The FuGE project’s native format is UML, with the FuGE-ML XML export as its most commonly used format. The availability of FuGE makes the development of a generic data and metadata storage portal feasible not only to perform data capture, but also for the purpose of integrating metadata from a range of experimental datasets. Among others, the proteomics (Note: Proteomics Standards Initiative,, metabolomics (Note: Metabolomics Standards Initiative, url, microarray (Note: FGED Society, and flow cytometry (Note: Flow Informatics and Computational Cytometry Society, standards groups have adopted the FuGE standard either wholly or for certain aspects of their formats. The structure of SyMBA is based on version 1.0 of FuGE-ML and accepts input and provides output in that format.

The ISA-TAB project is a standard for describing experiments using a tabular format, and is ideal for researchers primarily using spreadsheets. ISA-TAB has been developed alongside FuGE and can be converted into FuGE if required [15]. While spreadsheet formats are useful for data entry and record keeping on a small scale, an XML format—and the associated benefits such as XPath and the large library of tools for the XML format—was deemed more suitable for use within SyMBA.

1.3 Semantics

Describing the meaning, or semantics, of data and its associated metadata can be a complex and difficult process currently being addressed in the life sciences through the use of controlled vocabularies or ontologies [16]. The use of ontologies for describing data has a twofold purpose. Firstly, ontologies help ensure consistent annotation, such as the spellings and choice of terms, such that a single word-definition pair is used to describe a single entity. Secondly, ontologies can add human- and computationally-amenable semantics to the data. Related datasets can be integrated and queried via a common ontology, for example linking experimental results with associated publications. In SyMBA 1.0, users could pre-load controlled vocabularies and portions of ontologies to limit the descriptor choices made available to users. However, it was a complex feature to implement and was not heavily used. In SyMBA 2.0 the functionality was modified such that term choices previously saved by users are offered as auto-complete suggestions in some text fields. When an interface better suited to the addition of subsets of ontologies and controlled vocabularies has been designed, this feature may be reintroduced.

2 Implementation

One of the major requirements of representing scientific data is to be able to store raw, unprocessed data together with information about the equipment, processes, people and conditions under which they were generated [2]. Data arises from a variety of sources and is generated by many different protocols and experimental procedures. Metadata, or information about how the data was generated, should also be captured to ensure that recommended minimal information standards are met. Further there should be methods, whether directly provided from data producers or indirectly via public repositories, to export the data in standard formats. Users also need to be able to retrieve their data and track modifications to their metadata through the assignment of appropriate unique identifiers to ensure traceability and data provenance. Finally, and perhaps most importantly to users of a bioinformatics application, there is a need for an effective user interface. The rest of this section describes the tools and techniques used to meet these requirements.

2.1 Toolkit capabilities

As shown in Figure 1, the SyMBA toolkit provides a number of features:

  • the choice of a FuGE-structured RDBMS or in-memory back-end;
  • an object-relational persistence and query layer using Hibernate (Note:;
  • a set of plain Java classes representing FuGE-ML entities which also connect to the database via hyperjaxb3 (Note:;
  • unit tests of all queries and version retrieval functionality; and
  • a user-friendly Web interface that interacts with the chosen back-end and a remote file store for raw data outputted from experimental assays.

Figure 1: An architectural overview of SyMBA. Blue shading signifies hand-written code, while unshaded areas designate automatically-generated code. For the back-end, automatically-generated code is created with hyperjaxb3 (shown with dashed arrows). GWT automatically converts the front-end Java code into a compiled Web archive. Circles mark agent access points; solid arrows indicate direction of data flow.

FuGE provides all of the structures required to adequately describe an experimental protocol, including data, analyses, and information about the structure of the experiment. At a basic level, information in FuGE-ML may be stored simply as a collection of XML documents or as records in an XML database. However, the use of a RDBMS allows the storage of metadata in a format amenable to searching via technology that is trusted, scalable and familiar to the bioinformatics community. SyMBA uses hyperjaxb3 to create a Hibernate persistence layer and database as well as an automatically-generated FuGE Java API based on the FuGE XSD. Hibernate is a service that can be used with any compatible RDBMS and which allows the creation of an object/relational database persistence and query layer, thus abstracting the low-level database code away from the programmer interface.

In addition to the database back-end, SyMBA also implements a memory-only back-end, which facilitates the testing and initial set-up of SyMBA without needing to first install a database. The in-memory version of SyMBA has all of the capabilities of the database implementation except the ability to store the entered metadata beyond the lifetime of the Web application itself. If an in-memory version of SyMBA is restarted, all data previously uploaded is deleted. This makes the memory implementation ideal for test or sandbox installations.

2.2 Architecture

On the client side, SyMBA is a simple Web application for describing experiments and storing experimental data. Behind the user interface, SyMBA is implemented in GWT and hosted on a Tomcat (Note: server. GWT was chosen because it abstracts away the details of the user interface and allows the programmer to create a working Web application quickly and without needing to know about differences between Web browsers. Figure 1 shows an architectural overview of SyMBA and how GWT relates to the application as a whole, while Figure 2 shows how GWT can be used to create a Web application.

Figure 2: A simplified overview of how Google Web Toolkit creates a Web application. Hosted mode is a development environment for Google Web Toolkit applications, which allows more detailed debugging than is available when running the application using a standard Web server.

With GWT, the programmer writes using standard Java. The Java code is separated into client-side and server-side classes. During testing, all of the code is compiled using a Java compiler and run in hosted mode, a development environment for GWT where a dedicated browser runs the Web application and any errors in the code are reported with standard Java exceptions. In this way, hosted mode makes debugging the application easier. When ready for production, server-side classes are compiled with a Java compiler and the client-side classes are converted into JavaScript using the GWT compiler. This results in JavaScript created specifically for each of the common browser types. This creation of multiple versions of the application allows the programmer to ignore the details of cross-browser compatibility. At the end of the production compilation, a Web archive is produced which can be deployed on an appropriate server such as Tomcat.

SyMBA 1.0 made use of Java Server Pages, which have the benefit of being simple to learn and quick to implement. However, much of the processing code and display code was mixed within the pages, and the application as a whole was not highly configurable. Templates for new experimental methods had to be created by hand, or through a convoluted method of building template FuGE-ML files which would be converted on-the-fly by the Web front end. SyMBA 2.0 allows all users to generate templates via the Web interface rather than by hand-coding XML or Java Server Pages. The new template generation functionality opens up the use of SyMBA templates to a larger community of biologists rather than just bioinformaticians.

2.3 Identifiers

All objects in FuGE-ML inherit from the Describable complex type. This entity allows basic information such as audit trails, additional ontology terms and textual descriptions to be added to all objects. Not all FuGE objects require unique identification, and as such entities that inherit only from Describable have no requirement for identifiers. Most objects in FuGE which describe entities rather than collections of those entities also extend Identifiable, which adds a unique identifier and human-readable name to objects of this type.

SyMBA takes this level of identification one step further, and ensures that all identifiers created within SyMBA are based on LSIDs and therefore globally unique. To facilitate versioning (described further in Section 2.4), SyMBA introduced the abstract Endurant entity as a new child of Describable and sibling of Identifiable, as shown in Figure 3. Both Endurant and Identifiable objects are identified within SyMBA using LSIDs.

Figure 3: The relationship of the SyMBA-specific Endurant element to existing FuGE structures. Abstract entities are represented as rectangles, and concrete entities as circles. The Endurant element and its children (shown in yellow) were added to the FuGE XSD to provide all Identifiable elements with a mechanism to group versions together. The Describable and Identifiable elements are shown in blue and are key elements of the original FuGE XSD. Both Endurant and Identifiable entities are children of Describable, and both are identified in SyMBA with an LSID.

In order to describe how the identifiers are created within SyMBA, the LSID specification must be examined more closely. LSIDs contain both metadata and data. LSID metadata is information about the LSID itself, while LSID data is the information of interest associated with that LSID. An important requirement of LSIDs is that the returned data for a particular LSID must always be identical to the data retrieved in past queries of that LSID. Unlike its data, an LSID’s metadata can change at any time and is not required to always return the same object. LSID metadata includes an expiration timestamp which, if null, means the LSID does not expire, as well as any other information that the supplier of the LSID wishes to provide. In SyMBA, the timestamp will always be null, and the provided metadata is an RDF-formatted object with a title and a value. The metadata value is the LSID of the most recent Identifiable object in that version group. More information on these methods is available in Section 3.6 and Table 1.

The use of the terms data and metadata within the LSID specification should not be confused with their definitions within the FuGE standard. FuGE is a format standard for describing the metadata associated with an investigation. Some raw data can be defined within the FuGE structure, although in SyMBA associated data files are always stored separately from the FuGE metadata.

Over the past five years there has been a large amount of controversy on whether LSIDs are sustainable in the long term, or if HTTP URIs are a better solution (Note:,, Recently, a number of high-profile biology standards such as MIRIAM have developed non-LSID schemes such as, which uses URIs to provide resolvable, permanent identifiers for many life sciences databases. As SyMBA is not intended to be a public large-scale warehouse of experimental data but rather an application for individual research groups to store their data locally and in a standard format, URIs based on the HTTP protocol are not required. Whether or not LSIDs become a heavily-used standard, as URNs they can be easily converted to any other identifier system which does gain widespread acceptance.

2.4 Versioning

The requirement for maintaining data provenance means that modification histories for all metadata must be recorded. The issue of versioning is therefore important. The core FuGE structure can record simple audit information but does not store histories and details of changes to experimental objects. To allow access to previous versions of objects, each versioned object has a stable and globally-unique identifier. Reflecting its more permanent nature, versioning is used when running the database implementation of SyMBA but not with the in-memory implementation.

As shown in Figure 3, the three main classes involved in versioning in SyMBA are Describable, Endurant and Identifiable. All three are abstract, and may not be directly instantiated. Endurant and its child classes were created to add versioning to the FuGE model. Endurant objects do not change their intrinsic value over time, and are identified with an LSID. It is useful to distinguish between a modification to an actual experimental object, such as the addition of a new camera to a robot, and a modification of the stored metadata, such as fixing a spelling mistake in the name of the camera (see Figure 4). However, while the SyMBA back end allows these two types of versioning, the front end will currently always attach the new Identifiable to the existing Endurant. The difference between the two types of versioning can be difficult for users to grasp and therefore at this time only one method has been added to the user interface.

Endurant objects point to one or more versions of non-enduring Identifiable objects and are unresolvable by the LSID Web services. This restriction prevents the LSID Authority from breaking the LSID specification, as the authority must not allow the same LSID to resolve to different objects. All other objects identified with an LSID are children of the Identifiable entity, and identify a particular version of the object associated with the Endurant. As such, Identifiable LSIDs will always point to the same, single, version of the object and are resolvable.

The Endurant classes allow every Identifiable in FuGE to represent a single revision in the history of the Endurant: new objects are added to the database with every change, however minute. The state of an object at any point in its history is easily retrievable, as is the latest version of each object. Error-correction or addition of information, such as fixing a typo, will create an additional Identifiable associated with the same Endurant.


Figure 4: The implementation of Endurant objects within SyMBA. In a), a superficial change to the spelling of an equipment name creates a new Identifiable assigned to the existing Endurant. In b), modifications to the type of equipment used requires the creation of a new Endurant as well as a new Identifiable object.

Both the addition of new objects and updates to existing objects have the same cost within SyMBA, as both actions result in the creation of a new object; no cascade event or propagation of the new object is required. Such a model is suitable for an archival database as it is during data retrieval, rather than data storage, when new versions of each object are queried for and retrieved. This produces an application that is quick to update, but not as quick to retrieve.

2.5 Investigations and Experiments

SyMBA uses a specific terminology for describing experiments based on the FuGE standard. An investigation is a high-level concept which contains hypotheses and conclusions as well as a set of one or more experiments. As such, every experiment in SyMBA must be contained within a top-level investigation. Experiments can be arranged in an ordered hierarchy of steps and may have data files associated with them (see Figure 8 for an example).

3 Data and metadata storage and retrieval

Traditionally, laboratories stored raw data and metadata in disparate, interconnected resources such as spreadsheets, file stores, local databases and lab books. While these are useful tools for a conventional approach to biological research, they do not meet the needs of an environment where the generation of high throughput datasets is commonplace. Further downstream analysis of the raw data requires data to be adequately annotated and available in a standard format. In many cases, the researcher carrying out the analysis will not be the same person who generated the data. SyMBA has been developed to meet the need for a centralised raw data repository that collates and stores data and captures sufficient metadata in an integrated fashion to facilitate downstream analysis, export and publication. This section describes how SyMBA is utilised by individual users to store metadata, create templates and generate FuGE-ML for download.

3.1 Welcome and creating an account

The main entry point for end users is a Web-based graphical user interface that displays a simplified view of the underlying data structure and allows upload of data and description of experimental metadata. The welcome screen is shown in Figure 5. There are three main sections to the SyMBA Web application which are named hereafter using compass directions for their positions on the page. The header along the top of the page, or north panel, contains the SyMBA logo and name on the left, a variety of icons on the right, and the name of the user on the far right. The SyMBA logo will always take the user back to this welcome page. Named from left to right, the four icons on the right-hand side of the header are shortcuts to add a new experiment, view existing experiments, get data files or metadata FuGE-ML for download, and browse the help topics. If the user is not yet logged in, only the links to the welcome and help pages are active.

Figure 5: The welcome screen for the SyMBA Web application.

To the west is the main SyMBA usage panel. Initially, the west panel contains a short summary of SyMBA and a login form. Once the user begins work within SyMBA, the west panel displays summaries of experiments, the experimental metadata and much more. In the east is an optional status panel providing both contextual help and status messages. The status panel can be hidden at any time by clicking on “Hide this panel”.

If SyMBA is being used in a sandbox or test installation, the user may prefer to login with a preprepared guest account from the default installation of the application. A guest logs in by choosing “Guest Account” from the pull-down menu of users in the west panel, and clicking Login. Otherwise, a new user may be created by clicking on “add new user and log in” directly to the right of the login button. The user then fills out the contact form that appears and clicks Save Contact. The contact is saved and the user is automatically logged in. A named account, rather than a guest account, allows the stored metadata to be identified with a particular user.

3.2 Add and view buttons

As described earlier, the right side of the north panel contains a number of quick link buttons. Clicking on the add button results in the appearance of a pop-up window for creating new investigations as shown in Figure 6. This pop-up presents the user with three options. Choosing “Add New” will take the user directly to an empty investigation. The second option is a pull-down menu of existing investigations. Selecting one of these and pressing the Copy button will take the user to the standard investigation form pre-populated with a copy of the selected investigation. Finally, the “View Investigations” link allows the user to opt out of the add menu and instead view a list of existing investigations. This link works in a way identical to the view button from the north panel.

Figure 6: The pop-up window which appears when the add button from the north panel of SyMBA is pressed.

The view button present in the north panel takes the user to a summary of all investigations stored within SyMBA. Figure 7 shows an example summary page containing template, completed and standard investigations. Standard, modifiable investigations are shown in plain text, while read-only investigations are displayed in italics. Further, if an investigation is either completed or a template, this information is displayed in front of the investigation name. Both template and completed investigations are described in detail in the next section.

As explained in the contextual help section of the east panel, the user may either click on an investigation name to view its details or click on the radio button next to its name and then either Copy or Delete the investigation. SyMBA can be modified to disable or completely remove the Delete button.

Figure 7: The summary of investigations stored within SyMBA shown when the view button from the north panel is pressed.

3.3 Templates and completed investigations

If a user wishes to save a series of experimental steps for anyone to copy and use, a template can be created. There are a number of benefits to the use of templates.

  • Content standards. SyMBA templates can be built to community standards such as those present in the MIBBI registry.
  • Conformance. If there are standard protocols for certain types of experiments within a research group, each protocol can get their own template in SyMBA. As each protocol is realised in a particular experiment, users can copy the template and fill in specific experimental details.
  • Reuse. Templates make it easier to add information to experiments which are repeated multiple times. Rather than creating a completely new experiment description and introducing potential errors due to mistakes and missing information a template, for instance based on a completed description, can be used as many times as required to ensure the same information is filled in each time.

A template can either be begun with a blank investigation or based on an existing investigation. Once a set of experimental steps has been created, saving the investigation as a template simply requires the selection of the “Set As Template” checkbox, as shown in Figure 8. If an existing investigation is to be converted to a template a copy should be made of the investigation first. As the conversion removes all links to files and marks the investigation as read-only the copy, rather than the original, should be the one converted to a template. All information contained within each experimental step other than the file links will be retained in the template. The retained information includes parameters, descriptions, names, input materials and output materials. For parameters, the name of the parameter and the relationship the parameter name has with its value will be marked as read-only. Therefore, when a user starts working with an investigation based on that template wishes to fill in that parameter, they must retain the name and relationship of each templated parameter already provided. More information on parameters is available in the next section.

If the description of an investigation within SyMBA has been completed and all parameters and associated metadata have been filled in, the investigation can be frozen, which marks it as completed and prevents further modification. Freezing an investigation is achieved by selecting the “Freeze” checkbox visible in Figure 8. Italics are used to mark read-only investigations in the summary of experiments as shown in Figure 7.

Figure 8: The “View Existing Investigation” screen in SyMBA. This investigation has had some experimental steps and parameters added in preparation for saving it as a template. The user has selected the “Set As Template” checkbox. Clicking on Save or Save and Finish will result in this investigation being saved as a new template within SyMBA.

3.4 Storing experimental metadata and data

When a user is ready to start entering metadata and uploading data files, an existing template can be copied or a brand new experiment description can be created as described in Section 3.2. The act of copying or creation brings the user to the detailed investigation view where SyMBA users can create a new experiment, fill in information about the experiment and upload associated raw data files.

Figure 8 shows an overview of the investigation detail page. Each experimental step can have zero or more parameters, input materials and output materials. Parameters are structured according to a statement of three required and one optional attributes, as shown in Figure 9. These attributes are structured as NameRelationshipValueUnit, with the Unit being optional. In templates the Value can also be missing, but in standard investigations the Value attribute is mandatory. At any time, parameter statements can be deleted. While the NameRelationshipValue portion of the parameter statement is similar to the SubjectPredicateObject of RDF triples, RDF is not used within SyMBA as all metadata can be captured within the existing FuGE structure.

Figure 9: Pop-up window allowing user input of parameters and materials.

FuGE parameters have three types: number, boolean and string literal. These types can be manually specified by the user by selecting the appropriate radio button at the end of the parameter input line, as shown in Figure 9. However, these parameter types are relatively easy to programmatically ascertain, and once a Value is provided SyMBA will automatically select the radio button corresponding to its best guess. This selection can be overridden by the user at any time.

Parameters are highly configurable and can be built to describe virtually any statements required by the user. Concentrations of reagents, types of equipment, taxonomic information and repeat numbers are just some of the vital pieces of experimental information which can be modelled with parameters. In theory, materials could also be stored as parameters, but adding materials separately allows sets of commonly-used materials to be created and reused across investigations.

The name and relationship attributes within parameters are suitable candidates for restriction using ontology terms. Although this functionality was partly implemented in SyMBA version 1.0, it has not yet been added to version 2.0 as the best display of this information has not yet been determined. For instance, OBI could be used for equipment and protocol names, while RO could be used for some of the relationship attributes. In future, equipment will be organised as materials are now, but are currently modelled using parameters.

3.5 Retrieving FuGE-ML

SyMBA uses and manipulates Java objects built from the FuGE XSD, converting upon request between those objects and FuGE-ML for download and batch upload. The get button is used to view and download the metadata for a given investigation in FuGE-ML. Figure 10 shows the pop-up menu which appears when the get button is pressed, while Figure 11 shows a partial screenshot of the resulting FuGE-ML.

Figure 10: The pop-up download menu which appears as a result of pressing the get button in the north panel of SyMBA.

Figure 11: Example FuGE-ML retrieved when an investigation is selected for retrieval in SyMBA.

3.6 SyMBA Web Services

Access to SyMBA is available via a Web interface and through the SyMBA Web services. Web services provide straightforward access to applications via the Web (Note: For more information see SyMBA makes use of Crossfire (Note: via a Maven 2 plugin to package Web services into a Web archive which is then loaded into a Tomcat server.

Currently, SyMBA has three services containing a total of four methods. All of these services relate to the LSIDs contained within SyMBA:

  • The LsidAssigner service contains one method, assignLsid(). This method creates a new LSID based on the default values within the application.
  • The LsidResolver service contains one method, getAvailableServices(). This method checks that the available Web services can deal with the provided LSID.
  • The LsidDataRetriever service provides the getMetadata() and getData() methods from the LSID specification. The former retrieves the LSID of the latest Identifiable object associated with the provided LSID. The latter returns the appropriate FuGE-ML or data file associated with the provided LSID. This may be a full document or XML snippet, depending on the LSID. A RuntimeException is returned if the LSID is not stored within SyMBA. LSIDs based on Endurants will also return a RuntimeException, as there is no way to be sure which version of the corresponding Identifiable objects have been requested.

The LSID Web services currently implement only those parts of the specification (Note: in active use within SyMBA. In particular, these Web services implement LSIDs as Java strings rather than as complex objects. Table 1 shows how the LSID specifications’ getMetadata() and getData() methods are handled. To retrieve FuGE metadata for a given LSID, call the getData() method in the LSID Data Retrieval Web service. To retrieve the data file itself, the getData() method is passed the LSID that points to the file of interest. Only those methods from the LSID specification that are useful within SyMBA have been implemented as it is unclear how complete the long-term uptake of LSIDs will be, and it is important to keep the SyMBA implementation as straightforward as possible.

getMetadata() getData()
Endurant LSID of latest Identifiable empty array of bytes
Identifiable LSID of latest version FuGE-ML for that object
Data File Identifiable LSID of latest holding investigation original file + SyMBA identifier
Timestamped LSID of latest version FuGE-ML as at the stated time
Table 1: Each column describes a different LSID-specified method, and the rows display the behaviour of each method when that object type is passed as a parameter. An Endurant does not point to a specific object, and therefore will never return anything in the getData() call, though it will return the LSID of the latest Identifiable associated with it via getMetadata(). Any time-stamped LSID, whether Endurant or Identifiable, will always return the FuGE-ML that was present at the requested time.

4 Discussion

With the development and application of high-throughput methodologies and experimentation in the life sciences, researchers are seeing what is often referred to as the life-science data deluge [17, 18]. To cope with the data deluge and preserve and discover the knowledge contained within it, there needs to be support for both public centralised archiving, such as that provided by the EBI (Note: and the National Center for Biotechnology Information (Note:, and smaller-scale local archiving such as that possible with a local SyMBA installation.

Data and metadata archiving that utilises content, syntax and semantic standards facilitates data longevity, provenance and interchange. SyMBA has been designed to meet this requirement for research groups and is capable of archiving numerous types of life-science data using FuGE, a community developed and endorsed data standard. SyMBA provides a straightforward capture method for essential experimental metadata and a safe location to store original, unmodified data. Metadata capture is centralised, enabling users to access and view their data in a uniform fashion. SyMBA is the first complete implementation of FuGE Version 1.0 from database back-end through to Web front-end, and includes a full library of code to manipulate and export/import information from FuGE-ML input files and the FuGE-ML-based database. Other applications such as the ISA Software Suite have been built around sister standards such as ISA-TAB.

Within CISBAN, SyMBA was the official archiving platform, and is used as one part of a larger data integration infrastructure. CISBAN has generated metadata and data from approximately 400 data files requiring 300 Gb of space, and covering high-throughput and large-scale transcriptomics, proteomics and time-lapse, multi-dimension, live-cell imaging experiments. SyMBA 1.0 was designed to be a required archival step in the data processing pipeline for all wet lab researchers within CISBAN. However, while this initial implementation had a back end database and support services which were suitable for the task, users found the front end difficult to manage. In practice, each researcher needed the help of a bioinformatician to work through the front end metadata deposition forms. This resulted in low usage of the services, and requests for refinements and improvements to the front end. These changes resulted in SyMBA 2.0, which was primarily a front end user interface change, although streamlining was also performed for non-user-facing parts of the architecture such as the persistence layer. However, the upgrade to SyMBA 2.0 was ultimately never put into production, as CISBAN reached the end of the lifetime of the grant. It remains a useful tool for interested research groups to download and install, allowing flexible metadata entry as well as storage and export of that metadata in a published, standardised format.

In the spirit of the community standards that SyMBA can accommodate, the SyMBA code base is open source. New users and developers are encouraged to interact and contribute to the SyMBA system in an open and collaborative manner, to suit their specific needs. A wider application and contribution to SyMBA will benefit and help solve standards based data management issues in the life-sciences. These features make SyMBA an ideal infrastructural component for enabling interoperability, integration and analysis across many different types of experimental data, and for supporting data management and knowledge discovery throughout the research life cycle. SyMBA aids data integration by providing a common format for multi-omics experimental metadata, minimising format challenges as described in Section 1.6.

SyMBA provides a browsable history of experimental data and metadata while allowing users to compare aspects of studies such as controls and treatments. The unified and simple interface for data capture makes it easier for researchers to archive and store important information about their experiment. Easy copying of the structure and metadata from one investigation to another improves repeatability of experiments. SyMBA also provides improved metadata organization through the use of a nascent standard (FuGE), making preparations for publication and data interchange simpler. Additionally, SyMBA provides a computationally amenable starting point for data analysis pipelines via SyMBA’s ability to store disparate data outputs of laboratories in a single repository with a single metadata format.


Alvis Brazma, Maria Krestyaninova, and Ugis Sarkans. Standards for systems biology. Nature Reviews Genetics, 7(8):593–605, August 2006.
Norman W. Paton. Managing and sharing experimental data: standards, tools and pitfalls. Biochemical Society Transactions, 36(1):33–36, February 2008.
A. L. Lister, A. R. Jones, M. Pocock, O. Shaw, and A. Wipat. CS-TR Number 1016: Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator. Technical report, Newcastle University, April 2007.
Clifford Lynch. Big data: How do your data grow? Nature, 455(7209):28–29, September 2008.
Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma, Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper, Nicolas L. Novere, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison, Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez, Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J. Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser, Jo Vandesompele, and Stefan Wiemann. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology, 26(8):889–896, August 2008.
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, Terry Gaasterland, Patrick Glenisson, Frank C. P. Holstege, Irene F. Kim, Victor Markowitz, John C. Matese, Helen Parkinson, Alan Robinson, Ugis Sarkans, Steffen Schulze-Kremer, Jason Stewart, Ronald Taylor, Jaak Vilo, and Martin Vingron. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics, 29(4):365–371, December 2001.
Frank Gibson, Paul G. Overton, Tom V. Smulders, Simon R. Schultz, Stephen J. Eglen, Colin D. Ingram, Stefano Panzeri, Phil Bream, Evelyne Sernagor, Mark Cunningham, Christopher Adams, Christoph Echtermeyer, Jennifer Simonotto, Marcus Kaiser, Daniel C. Swan, Marty Fletcher, and Phillip Lord. Minimum Information about a Neuroscience Investigation (MINI) Electrophysiology : Nature Precedings. Nature Precedings, March 2008.
Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
Ron Edgar and Tanya Barrett. NCBI GEO standards and services for microarray data. Nature Biotechnology, 24(12):1471–1472, December 2006.
Juan Antonio A. Vizca\'{\i }no, Richard Côté, Florian Reisinger, Joseph M. Foster, Michael Mueller, Jonathan Rameseder, Henning Hermjakob, and Lennart Martens. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics, 9(18):4276–4283, September 2009.
B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molecular interaction database in 2010. Nucleic Acids Research, 38(Database issue):D525–D531, October 2009.
Helen Parkinson, Ugis Sarkans, Nikolay Kolesnikov, Niran Abeygunawardena, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Ele Holloway, Natalja Kurbatova, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Gabriella Rustici, Anjan Sharma, Eleanor Williams, Tomasz Adamusiak, Marco Brandizi, Nataliya Sklyar, and Alvis Brazma. ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Research, 39(suppl 1):D1002–D1004, January 2011.
Philippe Rocca-Serra, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, and Susanna-Assunta Sansone. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics, 26(18):2354–2356, September 2010.
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
Philippe Rocca-Serra, Susanna Sansone, Andy Jones, Allyson Lister, Frank Gibson, Ryan Brinkman, Josef Spindlen, and Michael Miller. XSL transformations for FuGE and FuGE extension documents for HTML and tab-delimited rendering., June 2008.
Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J. Mungall, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H. Scheuermann, Nigam Shah, Patricia L. Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11):1251–1255, November 2007.
Fran Berman, Geoffrey Fox, and Anthony J. G. Hey. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, Inc., New York, NY, USA, 2003.
Judith A. Blake and Carol J. Bult. Beyond the data deluge: data integration and bio-ontologies. Journal of biomedical informatics, 39(3):314–320, June 2006.
Meetings & Conferences Software and Tools Standards

SyMBA Demo causes pondering: how should a bioinformatician choose their output format(s)?

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

SyMBA Demo. The lunch hour was also the demo hour. People came to visit me at the SyMBA demo desk for the whole hour, and we had some interesting conversations. There is one particular question I would like to relate from that hour: what should a bioinformatician choose as an output/export format for multi-omics data? This post relates my thoughts about this challenge. It's not meant to be comprehensive: just some ramblings.

I solve this challenge in SyMBA by storing everything as FuGE objects, which can be exported to FuGE-ML. FuGE-ML can be converted into ISA-TAB and into an html format that mimics ISA-TAB using an XSLT. Therefore, because of this interlink between FuGE and ISA-TAB, you can leverage two complementary formats.

But to a bioinformatician who has just been tasked with building an application (and generally on a short time-scale), how do they choose what export format to use, e.g. FuGE or ISA-TAB? There are considerations of:

  • scale: lightweight or heavyweight implementation. A lightweight implementation might favor your own version of ISACreator and the use of ISA-TAB, or a FuGE-based archive (but not a full-blown LIMS) like SyMBA. A heavyweight solution might be a full LIMS such as PIMS, or another FuGE implementation in development called SysFusion.
  • intent: what is the purpose of storing this data? Is it for later analysis? For later deposition to a public database, e.g. at the EBI? Is it archiving? Is it a combination of these things? Your intent will shape what type of application you build, and what formats you focus your effort on. If your intent is storage only, choose whatever is most convenient for your users. However, these days there is always some aspect of data sharing or publishing. If you need further analysis of the data, then you probably want to be able to produce a computationally-friendly format such as XML. If your intent is submission to public databases, you need to ensure you export in a format they import.

Unfortunately, what this means is that the decision depends on the circumstances. FuGE and ISA-TAB are linked, and so you really get two for the price of one with those. I see this sort of thing as a positive – you have a choice as to the representation, storage and export of your data – a choice of formats! And many, like FuGE and ISA-TAB, are going to be easily convertable. The choice depends on your needs, but there is one easy choice: use something that's already been developed – don't reinvent the wheel!

Anyone else have any further suggestions?

Read and post comments |
Send to a friend



SysMO-DB and Carole Goble, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Systems Biology of Microorganisms. 11 projects from 91 institutes, whose aim is to record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way. These projects have no one concept of experimentation or modelling, which makes it tough for information exchange. Further, there are issues of people having their own solutions, suspicions (about sharing data, for instance), data issues (many don't have data or don't store it in a standard way) and resource issues (no extra resources). SysMO-DB started in July 2008, and is a 3 year funded effort (3+3 people in 3 teams over 3 sites). Provide a web-based solution to exchange, search, and disseminate data. Need to retrofit data access, model handling and data integration platform. Because of the large number of groups and projects, they are going to aim for low-hanging fruit and early wins: be realistic, not reinvent, sustainable, and encourage standards adoption.

Just like at CISBAN, where we have implemented a web-based data integration, storage, exchange, and dissemination platform in a standards-compliant way (SyMBA), they have three users: experimentalists, bioinformaticians, and modellers. They're lucky, though, in that they have 6 people to develop SysMO-DB, when CISBAN only has 1. 🙂 And, as with CISBAN and many other data integration efforts, much of the work is social: that is, encouraging those three user types to collaborate and understand each other's work. The social solutions include questionnairs, "PALS" (postdocs and phd students), and Audits and sharing of methods, data, models. They discuss things like what people need or don't need from MIAME. (Personal opinion and question: MIAME is intended as a minimal information checklist. What kind of things, then, don't they need? And would it be worth taking this information back to the MIAME people to possibly modify the guidelines if some aspects of it aren't truly minimal? End personal questions.)

Discovery is done via SysMo-SEEK. How to catalog the metadata, and then have mechanisms for accessing the data from locations other than the host site? There is a single search point over "yellow pages" and assets catalogue. They store metadata on results, not the results themselves (again, just like SyMBA, which stores the metadata in a database, and the results in a remote file store). They use myExperiment for both linking the people and the assets. For models, they're using a local installation of JWS online, which is a database of curated models and a model simulator. There is also some links to semantic SBML from the TRANSLUCENT project.

There are two kinds of processes to store. The first is experimental processes, e.g. SOPs and protocols. They use the Nature protocols format, with the addition of high-level classification through tags. (Personal note: What is the underlying format for storing protocols?) The second type of process is Bioinformatics processes, which are stored as workflows. (Question: Why don't you store protocols as workflows? They can be chained in the same way.)  Taverna is used for this work. One bit of work was using libSBML inside taverna for collaborative model development (Peter Li et al). Another automated (definition of automated in this context?) workflow goes from microarray to pathways and published abstracts. Their consortium wants to exchange information from public data sources, SysMO itself, and excel spreadsheets.

(Another personal aside. FuGE (object model for experimental metadata) and ISA-TAB (tabular format, e.g. spreadsheets) are becoming interchangeable – work is going on between FuGE and ISA-TAB people right now – most recent workshop was last week. This is important, as it was mentioned that bioinformaticians have to deal with spreadsheets (which is true enough!). So, you get the best of both worlds with FuGE / ISA-TAB, without having to define yet another schema. A personal question would be: Why build these various metadata schemas and parsers for spreadsheets (e.g. whatever is used for the Assets catalogue and JERM parsing of spreadsheets) rather than use pre-existing models such as FuGE and formats such as ISA-TAB? Using the FuGE object model does not mean that you have to use all aspects of it – you can just take what you need.Perhaps it was due to the maturity of ISA-TAB at the time the project started, though the specification is now in version 1.0. Will SysMO-DB export and import these formats? There was no time for questions at the end of the talk, so I will try to find out during the lunch period. End aside.)

Trying to map to the relevant MIBBI standard. There is a nice feature that reads spreadsheets from specific locations and automatically loads them into the Assets catalogue. (You can still load them directly into that catalogue.) They are performing a 4-site JERM exchange pilot scheme in Spring 2009.

Great talk – thanks 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


CISBAN Software and Tools Standards

SyMBA (Multi-omics (meta)data archive) on SourceForge!

We have just opened up the Systems and Molecular Biology Data and
Metadata Archive (SyMBA, formerly known as the CISBAN DPI) to the
community under the terms of the GNU LGPL. Its new home is SourceForge,
and there is a subversion repository, installation instructions,
mailing list, issue tracker etc available.

The main URL is:

The Project Page on SF (where you can get to screenshots, subversion browsing etc) is here:

changed the name of the project to reflect the wider diversity of
developers now contributing to the project. I
will be sending information, announcements, and answers to questions on
the symba developers mailing list, which everyone can subscribe to:

If you wish to subscribe, please go here:

We'd be happy to have additional developers on the project,
and if there is any feature or bug you'd like to report, please use our
issue trackers:

you'd like to take a more hands-on approach, then please email me your
sourceforge user id, and I'll add you as a developer on the project.

SyMBA was initially developed (and is still mainly developed) by the Integrative Bioinformatics Group, headed by Neil Wipat and part of CISBAN. Many thanks to all who have helped, via code or comment, and also to the current SourceForge Developers listed below:

Allyson Lister (CISBAN)
Olly Shaw (CISBAN)
Dan Swan (Newcastle Bioinformatics Support Unit)
Frank Gibson (CARMEN Neuroscience Project)

SyMBA is also being evaluated by other members of the CARMEN project and by CSBE.

sandbox to play around with SyMBA is now up again at after a major disk malfunction on the old server. I'll transfer
all old logins in the old system now having logins on the new one, and
if you'd like a login once the new server is up, please drop me a line.
In the meantime, please have a look around the new SourceForge site and
also the code, if you like! All comments and suggestions welcome.

Some general information:
The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a data archive and
(SyMBA) based on Milestone 3 of the
Functional Genomics Experiment (FuGE)
Object Model (FuGE-OM), and which archives, stores, and retrieves raw high-throughput data. Until now,
published systems have successfully integrated multiple omics data types and information about
in a single database. SyMBA is the first published implementation of FuGE that includes a database
expert and standard interfaces, and a Life Science Identifier (LSID) Resolution and Assigning service to
identify objects and provide programmatic access to the database. Having a central data repository
deletion, loss, or accidental modification of primary data, while giving convenient access to the data
publication and analysis. It also provides a central location for storage of metadata for the
high-throughput data sets, and will facilitate subsequent data integration strategies.

Read and post comments |
Send to a friend


CISBAN Papers Software and Tools Standards

Newcastle University Technical Report: CISBAN DPI

A Technical Report for the School of Computing Science of Newcastle University was released last month describing the CISBAN DPI, an implementation of the FuGE Milestone 3 STK. You can find and download that technical report here:

The Abstract follows:

The Centre for Integrated Systems Biology of Ageing and Nutrition has
developed a Data Portal and Integrator (CISBAN DPI) that is based on
the FuGE Object Model and which archives, stores, and retrieves raw
high-throughput data. Until now, few published systems have
successfully integrated multiple omics data types and information about
experiments in a single database. The CISBAN DPI is the first published
implementation of FuGE that includes a database back-end, expert and
standard interfaces, and utilizes a Life Science Identifier (LSID)
Resolution and Assigning service to identify objects and provide
programmatic access to the database. Having a central data
repository prevents deletion, loss, or accidental modification of
primary data, while giving convenient access to the data for
publication and analysis. It also provides a central location for
storage of metadata for the high-throughput data sets, and will
facilitate subsequent data integration strategies.


Functional Genomics, High-Throughput
Experiments, FuGE, LSID, Experimental Workflows, Databases, Data
Standards, Data Sharing, Metadata, Data Integration.

CS-TR: 1016 Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator
Lister, A. L., Jones, A. R., Pocock, M., Shaw, O., Wipat, A.
School of Computing Science, Newcastle University, Apr 2007

Read and post comments |
Send to a friend