Housekeeping & Self References Meetings & Conferences

Reloaded…Again…Again, or Memories of ISMBs Past

Who out there remembers ISMB 2009? If, like me, you need a refresher then take a look at the ISMB 2009 Summary and Highlights article that I wrote at the end of it. We all enjoyed Sweden, and had a lot of fun. Due to an increase in my family size by two over the intervening years, and a concomitant switch to part-time work, I haven’t been back to an ISMB since. Shocking, I know! I love conferences, and hope to start going again soon, but until then, take a good long look at my badge-based blast from the past, pictured on the right.

Were you carbon based or silicon based? Do you still have your badge? Please let me know!
Were you carbon based or silicon based? Do you still have your badge? Please let me know!

Now, I’m pretty sure that this badge came from the conference party at ISMB 2009. Please can someone confirm this though, as clearly 2009 is a Very Long Time Ago.

…Anyway, at least I remember the party! As we came in, we were each given a pair of sunglasses and had to choose a badge. Carbon-based badges were for people who had studied the life sciences before they came to bioinformatics, and silicon-based badges were for those with a Computing Science background. I chose carbon based as my undergraduate work was in Biology, although all subsequent study was in bioinformatics.

Anyone remember ISMB 2009 and have any experiences they’d like to share, especially with regards to this badge? Were there better parties at other conferences? I’m pretty sure Tainted Love came up at one point while a large proportion of the delegates were dancing. But then, I may be confusing my conferences. Hmmm, must take notes on more than just the lectures next time!

Housekeeping & Self References Meetings & Conferences Papers Science Online

Social Networking and Guidelines for Life Science Conferences
I had a great time in Sweden this past summer, at ISMB 2009 (ISMB/ECCB 2009 FriendFeed room). I listened to a lot of interesting talks, reconnected with old friends and met new ones. I went to an ice bar, explored a 17th-century ship that had been dragged from the bottom of the sea, and visited the building where the Nobel Prizes are handed out.

While there, many of us took notes and provided commentary through live blogging either on our own blogs or via FriendFeed and Twitter. The ISCB were very helpful, having announced and advertised the live blogging possibilities prior to the event. Once at the conference, they provided internet access, and even provided extension cords where necessary so that we could continue blogging on mains power.

Those of us who spent a large proportion of our time live blogging were asked to write a paper about our experiences. This quickly became two papers, as there were two clear subjects on our minds: firstly, how the live blogging went in the context of ISMB 2009 specifically; and secondly, how our experiences (and that of the organisers) might form the basis of a set of guidelines to conference organisers trying to create live blogging policies. The first paper became the conference report, a Message from ISCB published today in PLoS Computational Biology. This was published in conjunction with the second paper, a Perspective published jointly today in PLoS Computational Biology, that aims to help organisers create policies of their own. Particularly, it provides “top ten”(-ish) lists for organisers, bloggers and presenters.

So, thanks again to my co-authors:
Ruchira S. Datta: Blog FriendFeed
Oliver Hofmann: Blog FriendFeed Twitter
Roland Krause: Blog FriendFeed Twitter
Michael Kuhn: Blog FriendFeed Twitter
Bettina Roth
Reinhard Schneider: Blog FriendFeed
(you can find links to my social networking accounts on the About page on this blog)

If you have any questions or comments about either of these articles, please comment on the PLoS articles themselves, so there can be a record of the discussion.

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Scientific Conferences Using Web Technologies PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000563

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) 2009 PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000640

Housekeeping & Self References Meetings & Conferences

Highest-Viewed Blog Posts and Personal Thoughts on ISMB 2009

ISMB 2009 has come to a close, and with its end I’d like to chat a little about three topics: which ISMB 2009 blog posts readers clicked on the most, which presentations I (personally) found the best, and what I thought about the parts of the conference where no slides were involved (the social aspects).

If you want to check out all of my ISMB 2009 posts, remember you can always search on ‘ismb 2009‘. And don’t forget to check out the other bloggers: Oliver Hofmann, Cass Johnston, and Mikhail Spivakov. If there are more of you out there, let me know and I’ll include you here.

Most Highly-Viewed Blog Posts

Below you’ll find a top-ten list of my blog posts of the talks I attended at ISMB. This top ten is based on number of views according to the stats pages WordPress provides. Of course, this ranking is not very scientific. And additionally, this is just a little bit of fun and doesn’t represent any kind of relative merit of these talks. 🙂 I just wanted a snapshot of what the immediate interest was, both from attendees and non-attendees who followed the conference via FriendFeed or similar, and from there found my blog. Some more thoughts about this list:

  • It could be said to either positively or negatively relate to the quality of the FriendFeed comments. People liking the FF comments may have wanted to learn more, and thus clicked through to my posts. Conversely, people not getting enough information from the FF comments may have clicked through to learn more.
  • It could definitely also be said that the simple viewing of one of my posts doesn’t mean the user received any benefits, or indeed liked my post at all!
  • This may be obvious, but I only blogged those talks I attended. Therefore this list isn’t a representation of the popularity of all presentations, just of the number of views of the blog posts about presentations that I actually attended.
  • If I ever want to do a further ranking, this post will probably influence the numbers 🙂
  • It’s just a ranking of the most-viewed pages over the past 7 days, which pretty much covers the SIGs and the main conference. These numbers can and will change over the coming days and weeks. In fact, the positions shifted slightly while I was writing this, but I kept to the original list from this morning.

I hope nobody takes this this little bit of fun too seriously, and enjoy!

The top posts, listed with the most-visited one first (as of the morning of July 3, 2009):

  1. TT:23 Utopia Documents: The Digital Meta-Library, Steve Pettifer
  2. Keynote: New Challenges and Opportunities in Nework Biology, Trey Ideker
  3. Research reproducibility through GenePattern, Jill Mesirov, from the DAM SIG
  4. Keynote: Information and Biology, Pierre-Henri Gouyon
  5. TT40: BioCatalogue: A Curated Web Service Registry for the Life Science Community, Franck Tanoh
  6. Keynote: Computational Neuroscience: Models of the Visual System, Tomaso A. Poggio
  7. Special Session 4: Adam Arkin on Synthetic Biology, part of the Special Session on Advances and Challenges in Computational Biology, hosted by PLoS Computational Biology
  8. Annotation of SBML Models Through Rule-Based Semantic Integration, Allyson Lister, from the Bio-Ontologies SIG
  9. HL53: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project, Chris Taylor
  10. Workflow development and reuse in Taverna, Carole Goble, from the DAM SIG

It’s nice to see a standards talks in the top ten (the MIBBI talk, at number 9). And yes, that is my presentation at number 8, but I promise it was really there in the list, and that it wasn’t me: WordPress doesn’t count my own vists to my blog!

And now onto the talks I liked the most…

This section has two parts: talks that I liked the most,  and my favorite talk. Please note that these are in addition to the top ten I already mentioned above: those will not be getting a double mention here. Also, I’m not mentioning any of the papers I was involved in here, deliberately!

Firstly, presentations I heard that I enjoyed, in no particular order:

And, my absolute favorite? I’ll have to choose a keynote for that, Pierre-Henri Gouyon’s talk on Information and Biology. He was the most engaging of all of the speakers, and had the best style of speech. His talk was funny, intelligent and well constructed. A great way to start the conference.

The ISMB 2009 social scene (no, you’ll find no dirt here!)

The FriendFeed group was mainly sober, serious, and related directly to the presentations. Over 137 subscribed, though fewer people contributed. That’s no bad thing, though – it’s more important to encourage readers to discover the FF group and make use of it than it is to get loads of people writing in the group. Getting readers for the group is the hard part: once people are aware it exists, it’s a lot easier for them to start contributing to the dialog once they’re comfortable. It was on FF that I learnt that people had items stolen from the Light Factory party, which was one of the very few downsides to this conference.

However, it wasn’t all serious. Ruchira Datta started an open thread that was lively from the beginning. There was a Twitterer who was worried about the quality of music in the rooms prior to the talk (here’s just one example of his thoughts on the matter), more than one mention of where power sockets could be found (in the open thread and here), Lars (who wasn’t at the conference but followed in on FF) provided a number of wordles concerning both content and authorship of FF comments. Neil (one of the main bloggers from last year), still eagerly awaits photos (I promise I’ll put some up this weekend, and am myself looking forward to Ruchira’s pics of us FFers at the Thai place!)

It wasn’t all online: many attendees managed to actually meet and talk in person 🙂 . I felt the Vasa Museum was a fabulous place to have the dinner on the Wednesday, and having the initial drinks receptions at the City Hall impressed both me and everyone I spoke with. With alcoholic drinks roughly twice the price of their UK counterparts, I didn’t do much drinking, but then also didn’t miss it. I was kindly invited to the press conference (an experience which I may write up separately later), which was a fantastic first for me. I met people that I had only interacted online with before.

While I have been to other ISMBs before, I think in terms of my work and research, this was the best one. (The Brisbane ISMB was my favorite for non-work reasons, as there I got to cuddle a koala and take a 2-week break in Oz afterwards!)

Finally, I’d like to thank the organisers (especially Reinhard Schneider and the people who embedded all the FF sections into the ISMB pages – well done!), the people who toughed it out through my talk on the Sunday, the other FFers attending both remotely and physically, and the bosses (Tom Kirkwood and Neil Wipat from CISBAN here at Newcastle Uni) who let me attend.

Meetings & Conferences Semantics and Ontologies Software and Tools

TT47: Semantic Data Integration for Systems Biology Research (ISMB 2009)

Chris Rawlings, Also speaking: Catherine Canevet and Paul Fisher

BBSRC-funded research collaboration in Newcastle, Manchester, and Rothamsted : ONDEX and Taverna. Demo: Integration and augmentation of yeast metabolome model (Nature Biotech October 2008 26(10). Presented: Taverna and ONDEX. In ONDEX, everything can be seen as a network. To help with this, ONDEX contains an ontology of concept classes, relation types, and additional properties. Their example is yeast jamboree data integration. They have both specific (e.g. KEGG) and generic (e.g. tab delimited) parsers to load in data.

When ONDEX works with Taverna, instead of using the pipeline manager you use the ONDEX web services and access ONDEX from Taverna. This means you can use Taverna to pull in data into ONDEX. So, first parse jamboree data into ONDEX and remove currency metabolites (e.g. ATP, NAD). Add publications to the graph, from which domain experts can view and manually curate that data. Finally, annotate the graph using network analysis results. Then switch to taverna and identify orphans discovered in ONDEX. Retrieve the enzymes relating to the orphans and assemble the PubMed query and then add hits back to the ONDEX graph. Finally, have a look at the completed visualization. Use the ONDEX pipeline manager to upload data – it’s all in a GUI, which is good.

Then followed a live demo.


Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences Standards

HL53: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project (ISMB 2009)

Chris Taylor

Standards are hugely dependent on their respective communities for reqs gathering, develppment, testing, uptake by stakeholders. In modeling the biosciences there are are a few generic features such as description of source material and experimental design components. Then there are biologically-delineated and technologically-delineated views of the world. These views are still common across many different areas of the life sciences. Much of it can fall under an ISA (Investigation-Study-Assay) structure.

You should then use three types of standards: syntax (images of FuGE, ISA-TAB etc), semantics, and scope. MIBBI is all about scope. How well are things working? Well, there is still separation, but things are getting better. There aren’t many carrots, though there are some sticks for using these standards. Why do we care about standards? Data exchange, comprehensibility, and scope for reuse.  Many funders (esp public funders) are now requiring data sharing or ability for data storage and exchange.

“Metaprojects”: FuGE, OBI, ISA-TAB – draw together many different domains and present in structure/semantics useful across all. Many of the “MI” (Minimum information guidelines) are developed independently, and are sometimes defunct. It’s also hard to track what’s going on in these projects, can be redundant, difficult to obtain an overview of the full range of checklists. When the MI projects overlap, arbitrary decisions on wording and substructuring make integration difficult. This makes it hard to take parts of different guidelines – not very modular. Enter MIBBI. Two distinct goals: portal (registry of guidelines) and foundry (integration and modularization).

There’s lots of enthusiasm for the project (drafters, users, funders, journals). MIBBI raises awareness of various checklists and promotes gradual integration of checklists. Nature Biotechnology 26, 889 – 896 (2008) doi:10.1038/nbt0808-889 for the paper. He’s performed clustering and analysis of the different guidelines: displayed MIs in cytoscape and in fake phylogenetic tree. By the end of the year they’ll have a shopping-basked based tool, MICheckout, to get all concepts together and then you get your own specialized checklist as output. You can make use of isacreator and its configuration to set mandatory parameters etc.

The objections to fuller reporting. Why should I share? funders and publishers are starting to require a bare minimum of metadata – and researchers will just do the bare minimum then, however. Some people think that this is just a ‘make work’ scheme for bioinformaticians, or that bioinformaticians are parasitic. Some people don’t trust what others have done, but then that’s what the reporting guidelines are for in the first place – so you can figure out if you should trust it. Problems of quality are justified to an extent, but what of people lacking resource for large-scale work, or people who want to refer to proteomics data but don’t do proteomics? How should they follow theese guidelines? Perception is that there is no money for this, and no mature free tools, and worries about vendor support. Vendors will support what researchers say they need.

Credit: data sharing is more or less a given now, and need central registries of data sets that can record reuse (also openids, DOIs for data). Side benefits and challenges include clearing up problems with paper authorship wrt reporting who’s done which bit. Would also enable other kinds of credit, and may have to be self-policing. Finally, the problem of micro data sets and legacy data. Example of the former is EMBL entries – when searching against EMBL, you’re using the data in some way, even if you don’t pull it out for later analysis.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences

Keynote: A global view on protein expression based on the Human Protein Atlas (ISMB 2009)

Mathias Uhlen, Royal Institute of Technology

Introduction: Works a lot on affinity reagents. Invented and developed pyrosequencing technology ( now used in 454.

Systematic biology introduction. 18th century – biologist. 19th – chemist (1/3 of elements discovered in Sweden in this century). 20th – physicists and at the end, computer scientist. He’d now like to say that the 21st century is the century of medicine. He spends most of his time on proteins, and is more complicated from a computational POV, but that does make it more fun. An impressive log-scale plot of number of bases sequenced since 1965. Pyrosequencing in 1998 all the way through to PacBio in 2010. Therefore can talk about personalized genomics. Bioinformatics is the key in the new era of genomics.

Systems biology /omics is going to be fantastic in the next 10 years. Genomics will continue to be a fundamental resource. Image of contradictory sign in Paris: you know where you want to go, but not how to get there. So there is a real need for protein probes (antibodies) – this isn’t easy, and a nice article about this in Nature July 7 2007 “The generation game”. Therefore they have the Human Antibody Initiative (HAI). One of the efforts at the HAI is to look at commercially-available antibodies and analyzed >5000 antibodies from 51 commercial companies, and looked at the success rate. Some companies were virtually 100% and others 0% worked. About 50% of the antibodies seem to work (verage success rate) Berglund et al. 2008.

So therefore they developed antibodypedia, which is a portal for validated antibodies. If you have 2 antibodies, you can compare results in various assay platforms so he wants to develop paired antibodies for every protein target. Will take a while. 🙂 Also, perform epitope mapping of antibodies (Nature Methods 2008). Epitope mapping leads to therapeutic targets, including Her2 (Rockberg and Uhlen 2009 Molecular Oncology in press).

The HPR project. Human Proteome Resource. . Public multidisciplinary resource invovling systematic generation of antibodies. 65 researchers at KTH Stockholm, 25 in Uppsala, 15 in India, and a couple other places incl China. It’s a factory of generating clones -> protein factory -> immunization -> affinity purification -> human protein atlas portal. The gene factory does about 200 clones per week, and is in full production. Generating 2 TB data per week.

The antigen design uses PRESTIGE, which is a bioinformatics approach to select antigens using the protein epitope signature tag (PrEST). 19832 genes initiated. They have an automated annotation system for cells, and use pathologists for tissues. They also work on HT subcellular localization. With confocal microscopy can get “exquisite” resolution. Fantastic images, but difficult to scale up to HT. They have a SVM that seems to be able to annotate 28 different parts of the cell. Current weekly output: 50 proteins for week 50000 images. All goes into the HPA.

The Human Protein Atlas update. All data publicly available. Expression data not downloadable, but hope to change that in future. About 2/3 of data comes from their own antibodies, and 1/3 from commerical antibodies. Have doubled the number of antibodies last year. Have about 1/3 of the human proteins based on UniProt. About a further 50% of the human genes are in the pipeline. About 11% have been started and failed, so need to start again. Therefore only 6 % they haven’t started, but will start this year. Most recent release: 7 mln images. The next 5 yrs are also about getting the paired antibodies mentioned earlier. All antibodies available to the public via Prestige Antibodies.

They’ve also started on the Rodent Brain Protein Atlas. The majority of antibodies developed for the human system also work for rodents.

Global expression analysis. how many proteins are expressed in a given cell? How large is the proteome? Ensembl thinks that the genes are up to 23,000, but UniProt thinks 20,000, but the number is probably with that (for genes coding for proteins).How many cells does a particular protein express in? Housekeeping proteins – in lots of cells. Tissue-specific proteins – in few cells. They do analysis using various subfractions of antibodies. Looked at global expression in 45 human cell lines. Look at global expression in 3 cell lines via immunofluorescence (IF). Very few proteins are cell-type specific confirmed by expression profiles in those 3 cell lines via IF and via cytoscape visualization. Look at the tissues and see instead the same level of expression in all three tissues, and more that are only expressed in one tissue, but it is still less than is expressed in all (50% are expressed in all 3 tissues). <2% expressed exclusively in a single cell type (84 proteins), including some previously uncharacterized ones. PROSPECTS: PROteomics SPECification in Time and Space. Look at MCF-7 – human breast cancer cell line. Also working on next-generation sequencing of cDNAs from U2-OS human cell line.

A high fraction of all proteins expressed everywhere – few cell-specific proteins and group-specific proteins. Global expression profiling harmonzing well with the current concept of embryology and histology. Tissue specificty is acieved by precise regulation of protein levels in space and time.

Biobank profiling (translational medicine). How to use the above for biomarker discovery. Important to find biomarkers for early detection of disease. Have used suspension bead arrays. Looked at patients with different kidney diseases. Took 4 hrs to run 26000 assays looking at targets and the signals from the assay. The good ones you do the analysis of with the plasma. Then identify biomarkers very quickly. Found 2 really good candidate biomarkers for the disease, which needs to be tested now in larger clinical cohorts. Next-gen plasma profiling for biomarker discovery… They’re part of ENGAGE.

Nature, “The big ome” – 24 April 2008, editorial – he found it balanced. “Proteomics Ponders Prime Time”, Science, 26 September 2008, in response to the Nature article.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences

Keynote: Chasing the AIDS Virus

Thomas Lengauer, Max-Planck Institute for Informatics

Background in mathematics and Computer Science. A very good example of a cross-disciplinary researcher. He is a founding member of the ECCB.

Start when the drugs are available on the marketplace and they support personalized medicine, and which drugs to give to AIDS patients. AIDS has killed over 25 million people since 1981 and 33 million infected with HIV as of 2007 (Source UNAIDS). AIDS awareness campaigns have waned in recent years, and as a consequence there is an increase in infection rates again. AIDS virus has a small genome, only 10,000 nucleotides. Attaches via surface proteins, integrates into cell and sheds capsid, exposes RNA genome, and then reverse transcriptase makes double-strand DNA which is then transported into the nucleus and is then spliced into the genome and is then an inseperable part of the infected cell. Can sit for a long time until the cell divides, and then the cell machinery builds the viral particle. The virus borrows a bit of the cellular membrane for its shell, and then there is a maturation phase. The protease is important at this stage. And then the virus eventually becomes infective again. The AIDS virus is by far the most well-understood virus.

There are a number of drugs that blocks the fusion of the virus with the cell, 17 blocking reverse transcriptase, etc. It is extremely dynamic in the rate of its evolution. The AIDS patient can have a turnover of 10 billion virus particles per day – and there are many variants of the virus – a drug may be effective against the wt, but then the minority population will grow. So, what do you do? This is the main medical question. We don’t have a drug cocktail that can catch all of them – no drug therapy works forever. In the drug therapy, you combine different classes of drugs with HAART.

In the past, they’ve built mutation tables – global collection of clinical experience. An expert group will build this table, as there may be resistance and they don’t want to subject patients to now-useless drugs: this also has limited expressivity. Expert systems can help with this (medical communities call this algorithms, which is wrong: they are rule-based expert systems!). Interdependencies between mutations cannot be captured my butation tables. Rule-based expert systems do exactly that. Is this kind of resistance analysis objective?

Experimental resistance data includes: phenotypic data (extract virus and culture and expose to drugs in rising concentrations; curve comparison can figure out which drugs the resistance occurs at; this is called resistance factor; but this data is too expensive and too slow to make for clinicians), and genotypic data (id the genome of the viral variant).

Analyzing the current virus. They’re doing multivariate statistical learning with additional traditional techniques. The training data is the genotype-phenotype pairs of 1000+ HIV variants. Quality criteria is predictive power and interpretability. Then there is regression and classification (grouping into categories). Classification comes with cut-offs (resistant/susceptible), but things aren’t always that simple. The classic interpretive model is a decision tree, where you find the mutation that best separates resistant form susceptible viruses. Continue analogously in the two resulting data subsets. Example: protease inhibitor Saquinavir., so far most used clinical tool.

Genotype is aligned to the wt and mutations are identified. Using linear SVM for regression: a line for each drug and have est resistance factor, and normalization with Z-score, and the scored mutations. Some mutations, which confer resistance to some things (e.g. 76V) actual confers re-sensitisation and therefore would have a positive effect.

Estimating the Viral Evolution. Often the virus follows specific mutational paths into resistance, and these are partly known from clinical practice. Can they find such paths in their database? They have lots of patients, but only a few time points on each patient (no longitudinal data). the TAM1 path is found by seeing the virus does *not* follow every possible path (Thanks Ruchira – missed that in the talk). They model the viral evolution to the resistance by tree structure, where every tree represents several alternatives for viral evolution. One tree collects the noise in the data, and the results can be mapped along a timeline. In this way you can get a probability of resistance in a quantitative time frame. Therapy optimization with THEO.

They have since gone European with their data: Euresist database (started Oct 2008). They built 3 prediction engines, of which THEO is one. Error in classifying the therapy into effective/not effective without THEO is above 24% and with THEO is 15%. Practices and labs that treat 2/3 of the AIDS patients in Germany use the geno2pheno software, and the server for it is accessed from about 30 countries.

Coreceptor usage. 1% of Caucasian population does not have this coreceptor. These people cannot be infected by HIV. A couple of different coreceptors are targeted, and the virus can switch once you’re in therapy. People with the CCR5 deletion don’t get AIDS, so the virus first goes through here, but it later switches to the other one (sentence from Ruchira, as I missed it – see FF discussion below for source).

Genotypic prediction of viral tropism: input around 35 aa of the V3 loop of the viral surgace protein gp120. Output is the score that is the larger the more likely the virus uses CXCR4. The method used is SVM. Accuracy increases if structural data is added. There are two kinds of data: clonal (in research setting) and bulk data (clinical routine). Sensitivity goes from 80% to 40% if you move from clonal to bulk data. Adding info on clinical correlates raises prediction accuracy. The power of predicting clinical follow-up is much higher than of predicting tropism phenotype. go away from sanger sequencing to increase accuracy – use ultra-deep sequencing. It yields 1000s of sequences per patient sample.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences

PT47: Predicting and Understanding the Stability of G-Quadruplexes (ISMB 2009)

Oliver Stegle

quadruplexes: stable structures that can come from DNA + RNA. THey play a role in regulation of transcription. A stable fold-back structure can emerge in the presence of a cation. Are these patterns really stable? Which carry a functional role? First indicator of functionality is stability. Melting temp will be a proxy for stability. UV-Melt experiments are low-throughput. Further, the rules for quadruplex stability are complicated and have non-linear relationships.

They want to solve the regression problem mapping from a quadruplex input to its melting temperature in a supervised setting. They are using a gaussian processes regression, with marginal posterior mean and error bars. There is predictive uncertainty.

The sequence is a spectrum kernel to capture local sequence differences, followed by a squared exponential kernel for candidate features. The reason that regression method chosen is that the data used for training is noisy, with outliers. Also, these structures are so stable sometimes they never melt. With a Non-Gaussian likelihood model, robust mixture model likelihood accounting for outliers, and step function to incorporate observations that are bounds.

Predictive accuracy: they look at 260 quadruplexes (one of the first data sets available for quadruplexes). Compared with linear regression, SVN, GP, GP + robust noise. As it gets harder, the error goes up (as expected). The linear model is not adequate. The GP robust was the best of the lot. The GP robust significantly gains over the other models – better at determining confidence levels. With a 50/50 training split, the predictions (with the error bars) always overlap with the “truth” line, sometimes with a large uncertainty. Everything is predicted within 10 degrees C.

Genome-wide quadruplex candidates. Structures are taken from (360,000 candidate seqs). Can we predict them? Is there any relation between location in genome and temperature? Quadruplexes are overrepresented in the promoter regions by order of magnitude than anywhere else.

The current training dataset with 260 observations is not very big. Also, what sequences should be tested next? Active selection has been applied to 10987 quadruplexes in promoter regions – selected 30 measurements actively and at random. He presented a gaussian process scheme of regression of quadruplex stability. Good estimates of predictive uncertainty.

Allyson’s notes: Some of this was a little over my head, so there may be more than a normal chance of me getting some of these notes wrong! 🙂

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences Software and Tools

TT42: Computational Biology in the cloud, towards a federative and collaborative R-based platform (ISMB 2009)

Eamonn Maguire talking on behalf of Karim Chine

BIOCEP-R with advanced graphics – more than with regular R. Is a universal platform for scientific and statstical computing to create an open, federative and collaborative enironment for the production, sharing, and reuse of all the artifacts of computing. Puts new analyitcal, numerical and processing capabilities in the hands of everyone (open science). BIOCEP is a Java app built on top of R and Scilab: anything that you can do within those environments is accessible through BIOCEP. It has a RESTful API.

The BIOCEP computational open platform ecosystem: computational data sources, resources, components, GUIs, web services and scripts. The R Virtualization is like a mini-desktop – virtual R workbench. There is also a plugin repository, including GUI plugins. Firefox plugin called ElasticFox.

Here comes another demo – so fewer notes now… (but FriendFeed is made for this sort of thing, so look there – link below) 🙂 But the R console looks much easier to use the trying to use R on its own, with your own data only. The web services part means you can use BIOCEP to connect to a cloud instance.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Meetings & Conferences Software and Tools

TT40: BioCatalogue: A Curated Web Service Registry for the Life Science Community (ISMB 2009)

Franck Tanoh

They estimate 3000+ web services in life sciences, and we need to find out information for them beyond even just where they can be found. People who have an interest in such services: users, developers, service providers (big and small), and tool developers. Their curation consists of: free text, tags, CV, automated WSDL ripping and analytics, automated monitor and testing, partner feeds.

Next came a demo of biocatalogue. You can bookmark lots of services, even without signing up. Categories are created based on service function and discipline. There is also a history of who adds what and when, to aid attribution. The state of the service is shown with an icon. You can find the description and information on any costs or licensing restrictions. The input and output of the services have their own description. Soon, they’ll support batch services.

They’ll have test scripts that monitor the services, and they’d love to get loads of people involved.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!