Home > Data Integration > Chapter 2: SyMBA (Thesis)

Chapter 2: SyMBA (Thesis)

[Previous: Data integration methodologies for systems biology]
[Next: Chapter 3: Saint]

SyMBA: a Web-based metadata and data archive for biological experiments

Introduction

With the development and adoption of high throughput experimentation techniques in the life sciences, vast quantities of heterogeneous data are being produced. Consequently, there is a need to integrate these diverse datasets to build a more system-level view of organisms under study. These datasets must be archived, curated with appropriate metadata, and made available in a suitable form for subsequent analysis and integration [1].

There is also an urgent need to provide tools and resources that allow the digital curation and management of scientific data in a fashion amenable to standardisation and capable of facilitating data sharing and knowledge discovery [2]. It is desirable that this data storage and annotation process is implemented as early as possible in the data management cycle to prevent data loss and ensure traceability throughout.

This chapter describes the development of the SyMBA, a data and metadata archive for scientific data [3]. SyMBA addresses the need for data standardisation described by Paton [2] by implementing a method of experimental metadata integration for systems biology. SyMBA is an open source software system providing a mechanism to collate the content, syntax and semantics of scientific data, regardless of type. It can be used to review, share and download metadata and data associated with stored experiments through a simple Web interface. SyMBA accommodates both the representation of community-endorsed data standards and the development of bespoke representations to suit users’ digital curation needs.

Section 1 contains a summary of life-science data standards and how they are utilised by SyMBA. A detailed description of how SyMBA was implemented follows in Section 2, including initial requirements, architecture, data storage and retrieval techniques, and versioning methodology. Section 3 explains how SyMBA can be used, while Section 4 describes similar work and discusses the benefits of SyMBA as part of a larger life-science research life cycle.

1 Availability

A sandbox installation of SyMBA is freely available to test at http://symba.ncl.ac.uk. Researchers wishing to use SyMBA as a local data and metadata archive may evaluate it by downloading the system and customising it for their needs. The Web application is implemented in GWT (Note: http://code.google.com/webtoolkit/), with all major browsers supported. SyMBA is built with Apache Maven 2 (Note: http://maven.apache.org/), a software project management and build system similar to but more comprehensive than other build tools such as Apache Ant or GNU Make. SyMBA has a Sourceforge.net project website (Note: http://symba.sourceforge.net) and a Newcastle University project page (Note: http://cisban-silico.cs.ncl.ac.uk/). The Sourceforge.net project website provides access to the Subversion repository (Note: http://symba.svn.sourceforge.net/viewvc/symba/) as well as various explanatory documents for programmers (Note: http:// symba.sourceforge.net/symba-books/general-information/index.html, http://symba.sourceforge.net/symba-books/installation/index.html).

SyMBA is licensed by Allyson Lister and Newcastle University under the LGPL (Note: http://www.gnu.org/copyleft/lesser.html). For more information including licensing for third-party libraries used in the application, see LICENSE.txt in the SyMBA Subversion project. Installation and running of SyMBA has been tested on 32-bit Ubuntu 8.10 and higher. Development questions can be directed to symba-devel@lists.sourceforge.net.

1 Data Standards

To ensure maximum reuse of published data, Lynch [4] has stated that scientists should act as data stewards in three ways:

  • honour established disciplinary data standards;
  • record appropriate metadata to make future interpretation of the data possible; and
  • define metadata such as provenance and parameters, ideally at the time of data capture [4].

SyMBA makes use of a number of standards, emerging standards, and common formats to save time downstream of data generation by ensuring compatibility with other centres of research and journals complying with the same standards. SyMBA can help research groups follow all of the stewardship methods described above and can limit tedious and repetitive data entry. A brief overview of content, syntax and semantic data standards that are or can be integrated within SyMBA is provided in this section. A more detailed description of these standards is available in Section 1.4.

1.1 Data Content

The results and conclusions of scientific investigations are dependent upon the context, methods and data of the associated experiments. Defining a list of commonly required metadata to guide researchers in reporting the context of these experiments is increasingly gaining favour with data repositories, journals and funders [5]. These metadata reporting checklists outline the important or minimal information integral to the evaluation, interpretation and dissemination of an experiment; such guidelines effectively define what is contained within a scientific dataset and how the set was generated. The MIBBI Registry (Note: http://mibbi.org) provides a listing of these checklists. SyMBA is specifically designed to allow users to store and annotate their data according to any content standard such as those registered with MIBBI. Users can create and save standards templates for guidelines such as MIAME [6] and MINI electrophysiology [7]. Saved templates are then accessible to all users of SyMBA.

1.2 Syntax

Digital curation in the life sciences is predominately database centric [8]. While repositories have been developed for single datatype ‘omics’ experiments [9, 10, 11, 12], until recently the lack of suitable data standards has hampered the development of a data storage system capable of storing multiple data types in combination with a uniform set of experimental metadata.

Recently, two related projects have been created to address the need for a common syntax to describe experimental metadata: FuGE and ISA-TAB [13]. The FuGE project was formed with the aim of standardising the experimental metadata for a range of omics experiments. The FuGE standard contains a model of experimental objects such as samples, protocols, instruments, and software, and provides extension points for the creation of technology-specific data standards [14]. The FuGE project’s native format is UML, with the FuGE-ML XML export as its most commonly used format. The availability of FuGE makes the development of a generic data and metadata storage portal feasible not only to perform data capture, but also for the purpose of integrating metadata from a range of experimental datasets. Among others, the proteomics (Note: Proteomics Standards Initiative, http://www.psidev.info), metabolomics (Note: Metabolomics Standards Initiative, urlhttp://msi-workgroups.sourceforge.net), microarray (Note: FGED Society, http://www.mged.org) and flow cytometry (Note: Flow Informatics and Computational Cytometry Society, http://www.ficcs.org) standards groups have adopted the FuGE standard either wholly or for certain aspects of their formats. The structure of SyMBA is based on version 1.0 of FuGE-ML and accepts input and provides output in that format.

The ISA-TAB project is a standard for describing experiments using a tabular format, and is ideal for researchers primarily using spreadsheets. ISA-TAB has been developed alongside FuGE and can be converted into FuGE if required [15]. While spreadsheet formats are useful for data entry and record keeping on a small scale, an XML format—and the associated benefits such as XPath and the large library of tools for the XML format—was deemed more suitable for use within SyMBA.

1.3 Semantics

Describing the meaning, or semantics, of data and its associated metadata can be a complex and difficult process currently being addressed in the life sciences through the use of controlled vocabularies or ontologies [16]. The use of ontologies for describing data has a twofold purpose. Firstly, ontologies help ensure consistent annotation, such as the spellings and choice of terms, such that a single word-definition pair is used to describe a single entity. Secondly, ontologies can add human- and computationally-amenable semantics to the data. Related datasets can be integrated and queried via a common ontology, for example linking experimental results with associated publications. In SyMBA 1.0, users could pre-load controlled vocabularies and portions of ontologies to limit the descriptor choices made available to users. However, it was a complex feature to implement and was not heavily used. In SyMBA 2.0 the functionality was modified such that term choices previously saved by users are offered as auto-complete suggestions in some text fields. When an interface better suited to the addition of subsets of ontologies and controlled vocabularies has been designed, this feature may be reintroduced.

2 Implementation

One of the major requirements of representing scientific data is to be able to store raw, unprocessed data together with information about the equipment, processes, people and conditions under which they were generated [2]. Data arises from a variety of sources and is generated by many different protocols and experimental procedures. Metadata, or information about how the data was generated, should also be captured to ensure that recommended minimal information standards are met. Further there should be methods, whether directly provided from data producers or indirectly via public repositories, to export the data in standard formats. Users also need to be able to retrieve their data and track modifications to their metadata through the assignment of appropriate unique identifiers to ensure traceability and data provenance. Finally, and perhaps most importantly to users of a bioinformatics application, there is a need for an effective user interface. The rest of this section describes the tools and techniques used to meet these requirements.

2.1 Toolkit capabilities

As shown in Figure 1, the SyMBA toolkit provides a number of features:

  • the choice of a FuGE-structured RDBMS or in-memory back-end;
  • an object-relational persistence and query layer using Hibernate (Note: http://www.hibernate.org);
  • a set of plain Java classes representing FuGE-ML entities which also connect to the database via hyperjaxb3 (Note: http://confluence.highsource.org/display/HJ3/Home);
  • unit tests of all queries and version retrieval functionality; and
  • a user-friendly Web interface that interacts with the chosen back-end and a remote file store for raw data outputted from experimental assays.

Figure 1: An architectural overview of SyMBA. Blue shading signifies hand-written code, while unshaded areas designate automatically-generated code. For the back-end, automatically-generated code is created with hyperjaxb3 (shown with dashed arrows). GWT automatically converts the front-end Java code into a compiled Web archive. Circles mark agent access points; solid arrows indicate direction of data flow.

FuGE provides all of the structures required to adequately describe an experimental protocol, including data, analyses, and information about the structure of the experiment. At a basic level, information in FuGE-ML may be stored simply as a collection of XML documents or as records in an XML database. However, the use of a RDBMS allows the storage of metadata in a format amenable to searching via technology that is trusted, scalable and familiar to the bioinformatics community. SyMBA uses hyperjaxb3 to create a Hibernate persistence layer and database as well as an automatically-generated FuGE Java API based on the FuGE XSD. Hibernate is a service that can be used with any compatible RDBMS and which allows the creation of an object/relational database persistence and query layer, thus abstracting the low-level database code away from the programmer interface.

In addition to the database back-end, SyMBA also implements a memory-only back-end, which facilitates the testing and initial set-up of SyMBA without needing to first install a database. The in-memory version of SyMBA has all of the capabilities of the database implementation except the ability to store the entered metadata beyond the lifetime of the Web application itself. If an in-memory version of SyMBA is restarted, all data previously uploaded is deleted. This makes the memory implementation ideal for test or sandbox installations.

2.2 Architecture

On the client side, SyMBA is a simple Web application for describing experiments and storing experimental data. Behind the user interface, SyMBA is implemented in GWT and hosted on a Tomcat (Note: http://tomcat.apache.org/) server. GWT was chosen because it abstracts away the details of the user interface and allows the programmer to create a working Web application quickly and without needing to know about differences between Web browsers. Figure 1 shows an architectural overview of SyMBA and how GWT relates to the application as a whole, while Figure 2 shows how GWT can be used to create a Web application.

Figure 2: A simplified overview of how Google Web Toolkit creates a Web application. Hosted mode is a development environment for Google Web Toolkit applications, which allows more detailed debugging than is available when running the application using a standard Web server.

With GWT, the programmer writes using standard Java. The Java code is separated into client-side and server-side classes. During testing, all of the code is compiled using a Java compiler and run in hosted mode, a development environment for GWT where a dedicated browser runs the Web application and any errors in the code are reported with standard Java exceptions. In this way, hosted mode makes debugging the application easier. When ready for production, server-side classes are compiled with a Java compiler and the client-side classes are converted into JavaScript using the GWT compiler. This results in JavaScript created specifically for each of the common browser types. This creation of multiple versions of the application allows the programmer to ignore the details of cross-browser compatibility. At the end of the production compilation, a Web archive is produced which can be deployed on an appropriate server such as Tomcat.

SyMBA 1.0 made use of Java Server Pages, which have the benefit of being simple to learn and quick to implement. However, much of the processing code and display code was mixed within the pages, and the application as a whole was not highly configurable. Templates for new experimental methods had to be created by hand, or through a convoluted method of building template FuGE-ML files which would be converted on-the-fly by the Web front end. SyMBA 2.0 allows all users to generate templates via the Web interface rather than by hand-coding XML or Java Server Pages. The new template generation functionality opens up the use of SyMBA templates to a larger community of biologists rather than just bioinformaticians.

2.3 Identifiers

All objects in FuGE-ML inherit from the Describable complex type. This entity allows basic information such as audit trails, additional ontology terms and textual descriptions to be added to all objects. Not all FuGE objects require unique identification, and as such entities that inherit only from Describable have no requirement for identifiers. Most objects in FuGE which describe entities rather than collections of those entities also extend Identifiable, which adds a unique identifier and human-readable name to objects of this type.

SyMBA takes this level of identification one step further, and ensures that all identifiers created within SyMBA are based on LSIDs and therefore globally unique. To facilitate versioning (described further in Section 2.4), SyMBA introduced the abstract Endurant entity as a new child of Describable and sibling of Identifiable, as shown in Figure 3. Both Endurant and Identifiable objects are identified within SyMBA using LSIDs.

Figure 3: The relationship of the SyMBA-specific Endurant element to existing FuGE structures. Abstract entities are represented as rectangles, and concrete entities as circles. The Endurant element and its children (shown in yellow) were added to the FuGE XSD to provide all Identifiable elements with a mechanism to group versions together. The Describable and Identifiable elements are shown in blue and are key elements of the original FuGE XSD. Both Endurant and Identifiable entities are children of Describable, and both are identified in SyMBA with an LSID.

In order to describe how the identifiers are created within SyMBA, the LSID specification must be examined more closely. LSIDs contain both metadata and data. LSID metadata is information about the LSID itself, while LSID data is the information of interest associated with that LSID. An important requirement of LSIDs is that the returned data for a particular LSID must always be identical to the data retrieved in past queries of that LSID. Unlike its data, an LSID’s metadata can change at any time and is not required to always return the same object. LSID metadata includes an expiration timestamp which, if null, means the LSID does not expire, as well as any other information that the supplier of the LSID wishes to provide. In SyMBA, the timestamp will always be null, and the provided metadata is an RDF-formatted object with a title and a value. The metadata value is the LSID of the most recent Identifiable object in that version group. More information on these methods is available in Section 3.6 and Table 1.

The use of the terms data and metadata within the LSID specification should not be confused with their definitions within the FuGE standard. FuGE is a format standard for describing the metadata associated with an investigation. Some raw data can be defined within the FuGE structure, although in SyMBA associated data files are always stored separately from the FuGE metadata.

Over the past five years there has been a large amount of controversy on whether LSIDs are sustainable in the long term, or if HTTP URIs are a better solution (Note: http://lists.w3.org/Archives/Public/www-tag/2006Jul/0041, http://i9606.blogspot.com/2007/06/main-problem-with-lsids.html, http://iphylo.blogspot.com/2007/07/lsid-wars.html). Recently, a number of high-profile biology standards such as MIRIAM have developed non-LSID schemes such as identifiers.org, which uses URIs to provide resolvable, permanent identifiers for many life sciences databases. As SyMBA is not intended to be a public large-scale warehouse of experimental data but rather an application for individual research groups to store their data locally and in a standard format, URIs based on the HTTP protocol are not required. Whether or not LSIDs become a heavily-used standard, as URNs they can be easily converted to any other identifier system which does gain widespread acceptance.

2.4 Versioning

The requirement for maintaining data provenance means that modification histories for all metadata must be recorded. The issue of versioning is therefore important. The core FuGE structure can record simple audit information but does not store histories and details of changes to experimental objects. To allow access to previous versions of objects, each versioned object has a stable and globally-unique identifier. Reflecting its more permanent nature, versioning is used when running the database implementation of SyMBA but not with the in-memory implementation.

As shown in Figure 3, the three main classes involved in versioning in SyMBA are Describable, Endurant and Identifiable. All three are abstract, and may not be directly instantiated. Endurant and its child classes were created to add versioning to the FuGE model. Endurant objects do not change their intrinsic value over time, and are identified with an LSID. It is useful to distinguish between a modification to an actual experimental object, such as the addition of a new camera to a robot, and a modification of the stored metadata, such as fixing a spelling mistake in the name of the camera (see Figure 4). However, while the SyMBA back end allows these two types of versioning, the front end will currently always attach the new Identifiable to the existing Endurant. The difference between the two types of versioning can be difficult for users to grasp and therefore at this time only one method has been added to the user interface.

Endurant objects point to one or more versions of non-enduring Identifiable objects and are unresolvable by the LSID Web services. This restriction prevents the LSID Authority from breaking the LSID specification, as the authority must not allow the same LSID to resolve to different objects. All other objects identified with an LSID are children of the Identifiable entity, and identify a particular version of the object associated with the Endurant. As such, Identifiable LSIDs will always point to the same, single, version of the object and are resolvable.

The Endurant classes allow every Identifiable in FuGE to represent a single revision in the history of the Endurant: new objects are added to the database with every change, however minute. The state of an object at any point in its history is easily retrievable, as is the latest version of each object. Error-correction or addition of information, such as fixing a typo, will create an additional Identifiable associated with the same Endurant.

Figure4

Figure 4: The implementation of Endurant objects within SyMBA. In a), a superficial change to the spelling of an equipment name creates a new Identifiable assigned to the existing Endurant. In b), modifications to the type of equipment used requires the creation of a new Endurant as well as a new Identifiable object.

Both the addition of new objects and updates to existing objects have the same cost within SyMBA, as both actions result in the creation of a new object; no cascade event or propagation of the new object is required. Such a model is suitable for an archival database as it is during data retrieval, rather than data storage, when new versions of each object are queried for and retrieved. This produces an application that is quick to update, but not as quick to retrieve.

2.5 Investigations and Experiments

SyMBA uses a specific terminology for describing experiments based on the FuGE standard. An investigation is a high-level concept which contains hypotheses and conclusions as well as a set of one or more experiments. As such, every experiment in SyMBA must be contained within a top-level investigation. Experiments can be arranged in an ordered hierarchy of steps and may have data files associated with them (see Figure 8 for an example).

3 Data and metadata storage and retrieval

Traditionally, laboratories stored raw data and metadata in disparate, interconnected resources such as spreadsheets, file stores, local databases and lab books. While these are useful tools for a conventional approach to biological research, they do not meet the needs of an environment where the generation of high throughput datasets is commonplace. Further downstream analysis of the raw data requires data to be adequately annotated and available in a standard format. In many cases, the researcher carrying out the analysis will not be the same person who generated the data. SyMBA has been developed to meet the need for a centralised raw data repository that collates and stores data and captures sufficient metadata in an integrated fashion to facilitate downstream analysis, export and publication. This section describes how SyMBA is utilised by individual users to store metadata, create templates and generate FuGE-ML for download.

3.1 Welcome and creating an account

The main entry point for end users is a Web-based graphical user interface that displays a simplified view of the underlying data structure and allows upload of data and description of experimental metadata. The welcome screen is shown in Figure 5. There are three main sections to the SyMBA Web application which are named hereafter using compass directions for their positions on the page. The header along the top of the page, or north panel, contains the SyMBA logo and name on the left, a variety of icons on the right, and the name of the user on the far right. The SyMBA logo will always take the user back to this welcome page. Named from left to right, the four icons on the right-hand side of the header are shortcuts to add a new experiment, view existing experiments, get data files or metadata FuGE-ML for download, and browse the help topics. If the user is not yet logged in, only the links to the welcome and help pages are active.

Figure 5: The welcome screen for the SyMBA Web application.

To the west is the main SyMBA usage panel. Initially, the west panel contains a short summary of SyMBA and a login form. Once the user begins work within SyMBA, the west panel displays summaries of experiments, the experimental metadata and much more. In the east is an optional status panel providing both contextual help and status messages. The status panel can be hidden at any time by clicking on “Hide this panel”.

If SyMBA is being used in a sandbox or test installation, the user may prefer to login with a preprepared guest account from the default installation of the application. A guest logs in by choosing “Guest Account” from the pull-down menu of users in the west panel, and clicking Login. Otherwise, a new user may be created by clicking on “add new user and log in” directly to the right of the login button. The user then fills out the contact form that appears and clicks Save Contact. The contact is saved and the user is automatically logged in. A named account, rather than a guest account, allows the stored metadata to be identified with a particular user.

3.2 Add and view buttons

As described earlier, the right side of the north panel contains a number of quick link buttons. Clicking on the add button results in the appearance of a pop-up window for creating new investigations as shown in Figure 6. This pop-up presents the user with three options. Choosing “Add New” will take the user directly to an empty investigation. The second option is a pull-down menu of existing investigations. Selecting one of these and pressing the Copy button will take the user to the standard investigation form pre-populated with a copy of the selected investigation. Finally, the “View Investigations” link allows the user to opt out of the add menu and instead view a list of existing investigations. This link works in a way identical to the view button from the north panel.

Figure 6: The pop-up window which appears when the add button from the north panel of SyMBA is pressed.

The view button present in the north panel takes the user to a summary of all investigations stored within SyMBA. Figure 7 shows an example summary page containing template, completed and standard investigations. Standard, modifiable investigations are shown in plain text, while read-only investigations are displayed in italics. Further, if an investigation is either completed or a template, this information is displayed in front of the investigation name. Both template and completed investigations are described in detail in the next section.

As explained in the contextual help section of the east panel, the user may either click on an investigation name to view its details or click on the radio button next to its name and then either Copy or Delete the investigation. SyMBA can be modified to disable or completely remove the Delete button.

Figure 7: The summary of investigations stored within SyMBA shown when the view button from the north panel is pressed.

3.3 Templates and completed investigations

If a user wishes to save a series of experimental steps for anyone to copy and use, a template can be created. There are a number of benefits to the use of templates.

  • Content standards. SyMBA templates can be built to community standards such as those present in the MIBBI registry.
  • Conformance. If there are standard protocols for certain types of experiments within a research group, each protocol can get their own template in SyMBA. As each protocol is realised in a particular experiment, users can copy the template and fill in specific experimental details.
  • Reuse. Templates make it easier to add information to experiments which are repeated multiple times. Rather than creating a completely new experiment description and introducing potential errors due to mistakes and missing information a template, for instance based on a completed description, can be used as many times as required to ensure the same information is filled in each time.

A template can either be begun with a blank investigation or based on an existing investigation. Once a set of experimental steps has been created, saving the investigation as a template simply requires the selection of the “Set As Template” checkbox, as shown in Figure 8. If an existing investigation is to be converted to a template a copy should be made of the investigation first. As the conversion removes all links to files and marks the investigation as read-only the copy, rather than the original, should be the one converted to a template. All information contained within each experimental step other than the file links will be retained in the template. The retained information includes parameters, descriptions, names, input materials and output materials. For parameters, the name of the parameter and the relationship the parameter name has with its value will be marked as read-only. Therefore, when a user starts working with an investigation based on that template wishes to fill in that parameter, they must retain the name and relationship of each templated parameter already provided. More information on parameters is available in the next section.

If the description of an investigation within SyMBA has been completed and all parameters and associated metadata have been filled in, the investigation can be frozen, which marks it as completed and prevents further modification. Freezing an investigation is achieved by selecting the “Freeze” checkbox visible in Figure 8. Italics are used to mark read-only investigations in the summary of experiments as shown in Figure 7.

Figure 8: The “View Existing Investigation” screen in SyMBA. This investigation has had some experimental steps and parameters added in preparation for saving it as a template. The user has selected the “Set As Template” checkbox. Clicking on Save or Save and Finish will result in this investigation being saved as a new template within SyMBA.

3.4 Storing experimental metadata and data

When a user is ready to start entering metadata and uploading data files, an existing template can be copied or a brand new experiment description can be created as described in Section 3.2. The act of copying or creation brings the user to the detailed investigation view where SyMBA users can create a new experiment, fill in information about the experiment and upload associated raw data files.

Figure 8 shows an overview of the investigation detail page. Each experimental step can have zero or more parameters, input materials and output materials. Parameters are structured according to a statement of three required and one optional attributes, as shown in Figure 9. These attributes are structured as Name-Relationship-Value-Unit, with the Unit being optional. In templates the Value can also be missing, but in standard investigations the Value attribute is mandatory. At any time, parameter statements can be deleted. While the Name-Relationship-Value portion of the parameter statement is similar to the Subject-Predicate-Object of RDF triples, RDF is not used within SyMBA as all metadata can be captured within the existing FuGE structure.

Figure 9: Pop-up window allowing user input of parameters and materials.

FuGE parameters have three types: number, boolean and string literal. These types can be manually specified by the user by selecting the appropriate radio button at the end of the parameter input line, as shown in Figure 9. However, these parameter types are relatively easy to programmatically ascertain, and once a Value is provided SyMBA will automatically select the radio button corresponding to its best guess. This selection can be overridden by the user at any time.

Parameters are highly configurable and can be built to describe virtually any statements required by the user. Concentrations of reagents, types of equipment, taxonomic information and repeat numbers are just some of the vital pieces of experimental information which can be modelled with parameters. In theory, materials could also be stored as parameters, but adding materials separately allows sets of commonly-used materials to be created and reused across investigations.

The name and relationship attributes within parameters are suitable candidates for restriction using ontology terms. Although this functionality was partly implemented in SyMBA version 1.0, it has not yet been added to version 2.0 as the best display of this information has not yet been determined. For instance, OBI could be used for equipment and protocol names, while RO could be used for some of the relationship attributes. In future, equipment will be organised as materials are now, but are currently modelled using parameters.

3.5 Retrieving FuGE-ML

SyMBA uses and manipulates Java objects built from the FuGE XSD, converting upon request between those objects and FuGE-ML for download and batch upload. The get button is used to view and download the metadata for a given investigation in FuGE-ML. Figure 10 shows the pop-up menu which appears when the get button is pressed, while Figure 11 shows a partial screenshot of the resulting FuGE-ML.

Figure 10: The pop-up download menu which appears as a result of pressing the get button in the north panel of SyMBA.

Figure 11: Example FuGE-ML retrieved when an investigation is selected for retrieval in SyMBA.

3.6 SyMBA Web Services

Access to SyMBA is available via a Web interface and through the SyMBA Web services. Web services provide straightforward access to applications via the Web (Note: For more information see http://www.w3schools.com/webservices/ws_intro.asp). SyMBA makes use of Crossfire (Note: http://cxf.apache.org/) via a Maven 2 plugin to package Web services into a Web archive which is then loaded into a Tomcat server.

Currently, SyMBA has three services containing a total of four methods. All of these services relate to the LSIDs contained within SyMBA:

  • The LsidAssigner service contains one method, assignLsid(). This method creates a new LSID based on the default values within the application.
  • The LsidResolver service contains one method, getAvailableServices(). This method checks that the available Web services can deal with the provided LSID.
  • The LsidDataRetriever service provides the getMetadata() and getData() methods from the LSID specification. The former retrieves the LSID of the latest Identifiable object associated with the provided LSID. The latter returns the appropriate FuGE-ML or data file associated with the provided LSID. This may be a full document or XML snippet, depending on the LSID. A RuntimeException is returned if the LSID is not stored within SyMBA. LSIDs based on Endurants will also return a RuntimeException, as there is no way to be sure which version of the corresponding Identifiable objects have been requested.

The LSID Web services currently implement only those parts of the specification (Note: http://lsids.sourceforge.net/quick-links/lsid-spec/) in active use within SyMBA. In particular, these Web services implement LSIDs as Java strings rather than as complex objects. Table 1 shows how the LSID specifications’ getMetadata() and getData() methods are handled. To retrieve FuGE metadata for a given LSID, call the getData() method in the LSID Data Retrieval Web service. To retrieve the data file itself, the getData() method is passed the LSID that points to the file of interest. Only those methods from the LSID specification that are useful within SyMBA have been implemented as it is unclear how complete the long-term uptake of LSIDs will be, and it is important to keep the SyMBA implementation as straightforward as possible.

getMetadata() getData()
Endurant LSID of latest Identifiable empty array of bytes
Identifiable LSID of latest version FuGE-ML for that object
Data File Identifiable LSID of latest holding investigation original file + SyMBA identifier
Timestamped LSID of latest version FuGE-ML as at the stated time
Table 1: Each column describes a different LSID-specified method, and the rows display the behaviour of each method when that object type is passed as a parameter. An Endurant does not point to a specific object, and therefore will never return anything in the getData() call, though it will return the LSID of the latest Identifiable associated with it via getMetadata(). Any time-stamped LSID, whether Endurant or Identifiable, will always return the FuGE-ML that was present at the requested time.

4 Discussion

With the development and application of high-throughput methodologies and experimentation in the life sciences, researchers are seeing what is often referred to as the life-science data deluge [17, 18]. To cope with the data deluge and preserve and discover the knowledge contained within it, there needs to be support for both public centralised archiving, such as that provided by the EBI (Note: http://www.ebi.ac.uk) and the National Center for Biotechnology Information (Note: http://www.ncbi.nlm.nih.gov/), and smaller-scale local archiving such as that possible with a local SyMBA installation.

Data and metadata archiving that utilises content, syntax and semantic standards facilitates data longevity, provenance and interchange. SyMBA has been designed to meet this requirement for research groups and is capable of archiving numerous types of life-science data using FuGE, a community developed and endorsed data standard. SyMBA provides a straightforward capture method for essential experimental metadata and a safe location to store original, unmodified data. Metadata capture is centralised, enabling users to access and view their data in a uniform fashion. SyMBA is the first complete implementation of FuGE Version 1.0 from database back-end through to Web front-end, and includes a full library of code to manipulate and export/import information from FuGE-ML input files and the FuGE-ML-based database. Other applications such as the ISA Software Suite have been built around sister standards such as ISA-TAB.

Within CISBAN, SyMBA was the official archiving platform, and is used as one part of a larger data integration infrastructure. CISBAN has generated metadata and data from approximately 400 data files requiring 300 Gb of space, and covering high-throughput and large-scale transcriptomics, proteomics and time-lapse, multi-dimension, live-cell imaging experiments. SyMBA 1.0 was designed to be a required archival step in the data processing pipeline for all wet lab researchers within CISBAN. However, while this initial implementation had a back end database and support services which were suitable for the task, users found the front end difficult to manage. In practice, each researcher needed the help of a bioinformatician to work through the front end metadata deposition forms. This resulted in low usage of the services, and requests for refinements and improvements to the front end. These changes resulted in SyMBA 2.0, which was primarily a front end user interface change, although streamlining was also performed for non-user-facing parts of the architecture such as the persistence layer. However, the upgrade to SyMBA 2.0 was ultimately never put into production, as CISBAN reached the end of the lifetime of the grant. It remains a useful tool for interested research groups to download and install, allowing flexible metadata entry as well as storage and export of that metadata in a published, standardised format.

In the spirit of the community standards that SyMBA can accommodate, the SyMBA code base is open source. New users and developers are encouraged to interact and contribute to the SyMBA system in an open and collaborative manner, to suit their specific needs. A wider application and contribution to SyMBA will benefit and help solve standards based data management issues in the life-sciences. These features make SyMBA an ideal infrastructural component for enabling interoperability, integration and analysis across many different types of experimental data, and for supporting data management and knowledge discovery throughout the research life cycle. SyMBA aids data integration by providing a common format for multi-omics experimental metadata, minimising format challenges as described in Section 1.6.

SyMBA provides a browsable history of experimental data and metadata while allowing users to compare aspects of studies such as controls and treatments. The unified and simple interface for data capture makes it easier for researchers to archive and store important information about their experiment. Easy copying of the structure and metadata from one investigation to another improves repeatability of experiments. SyMBA also provides improved metadata organization through the use of a nascent standard (FuGE), making preparations for publication and data interchange simpler. Additionally, SyMBA provides a computationally amenable starting point for data analysis pipelines via SyMBA’s ability to store disparate data outputs of laboratories in a single repository with a single metadata format.

Bibliography

[1]
Alvis Brazma, Maria Krestyaninova, and Ugis Sarkans. Standards for systems biology. Nature Reviews Genetics, 7(8):593–605, August 2006.
[2]
Norman W. Paton. Managing and sharing experimental data: standards, tools and pitfalls. Biochemical Society Transactions, 36(1):33–36, February 2008.
[3]
A. L. Lister, A. R. Jones, M. Pocock, O. Shaw, and A. Wipat. CS-TR Number 1016: Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator. Technical report, Newcastle University, April 2007.
[4]
Clifford Lynch. Big data: How do your data grow? Nature, 455(7209):28–29, September 2008.
[5]
Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma, Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper, Nicolas L. Novere, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison, Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez, Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J. Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser, Jo Vandesompele, and Stefan Wiemann. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology, 26(8):889–896, August 2008.
[6]
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, Terry Gaasterland, Patrick Glenisson, Frank C. P. Holstege, Irene F. Kim, Victor Markowitz, John C. Matese, Helen Parkinson, Alan Robinson, Ugis Sarkans, Steffen Schulze-Kremer, Jason Stewart, Ronald Taylor, Jaak Vilo, and Martin Vingron. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics, 29(4):365–371, December 2001.
[7]
Frank Gibson, Paul G. Overton, Tom V. Smulders, Simon R. Schultz, Stephen J. Eglen, Colin D. Ingram, Stefano Panzeri, Phil Bream, Evelyne Sernagor, Mark Cunningham, Christopher Adams, Christoph Echtermeyer, Jennifer Simonotto, Marcus Kaiser, Daniel C. Swan, Marty Fletcher, and Phillip Lord. Minimum Information about a Neuroscience Investigation (MINI) Electrophysiology : Nature Precedings. Nature Precedings, March 2008.
[8]
Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
[9]
Ron Edgar and Tanya Barrett. NCBI GEO standards and services for microarray data. Nature Biotechnology, 24(12):1471–1472, December 2006.
[10]
Juan Antonio A. Vizca\'{\i }no, Richard Côté, Florian Reisinger, Joseph M. Foster, Michael Mueller, Jonathan Rameseder, Henning Hermjakob, and Lennart Martens. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics, 9(18):4276–4283, September 2009.
[11]
B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molecular interaction database in 2010. Nucleic Acids Research, 38(Database issue):D525–D531, October 2009.
[12]
Helen Parkinson, Ugis Sarkans, Nikolay Kolesnikov, Niran Abeygunawardena, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Ele Holloway, Natalja Kurbatova, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Gabriella Rustici, Anjan Sharma, Eleanor Williams, Tomasz Adamusiak, Marco Brandizi, Nataliya Sklyar, and Alvis Brazma. ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Research, 39(suppl 1):D1002–D1004, January 2011.
[13]
Philippe Rocca-Serra, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, and Susanna-Assunta Sansone. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics, 26(18):2354–2356, September 2010.
[14]
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
[15]
Philippe Rocca-Serra, Susanna Sansone, Andy Jones, Allyson Lister, Frank Gibson, Ryan Brinkman, Josef Spindlen, and Michael Miller. XSL transformations for FuGE and FuGE extension documents for HTML and tab-delimited rendering. http://isatab.sourceforge.net/docs/FUGE-and-XSL-transformations-R1.doc, June 2008.
[16]
Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J. Mungall, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H. Scheuermann, Nigam Shah, Patricia L. Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11):1251–1255, November 2007.
[17]
Fran Berman, Geoffrey Fox, and Anthony J. G. Hey. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, Inc., New York, NY, USA, 2003.
[18]
Judith A. Blake and Carol J. Bult. Beyond the data deluge: data integration and bio-ontologies. Journal of biomedical informatics, 39(3):314–320, June 2006.
Categories: Data Integration Tags: ,

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 494 other followers

%d bloggers like this: