Patrick Lambrix: Ontology alignment and applications
February 1, 2010
1 February 2010, Workshop on Modeling biological systems: Bio-model similarities and differences based on semantic information
Presentation by Patrick Lambrix
Abstract: “The use of ontologies is a key technology for the Semantic Web and in particular in the biomedical field many ontologies have already been developed. Many of these ontologies, however, contain overlapping information and to make full use of the advantages of ontologies it is important to know the inter-ontology relationships, i.e. we need to align the ontologies. Knowledge of these alignments would lead to improvements in search, integration and analysis of biomedical data. In this talk we give an overview of techniques for ontology alignment with a focus on approaches that compute similarity values between terms in the different ontologies. Further, we discuss the results of evaluations of these techniques using biomedical ontologies. We also give examples of the use of ontology alignment in applications such as literature search.”
Ontologies are used for communication between people and organisations (to paraphrase Michael Ashburner, it’s easier for biologists to share their toothbrushes than to share their definition of a gene). Ontologies are also used for enabling knowledge re-use and sharing, and as a basis for interoperability between systems. They are repositories of information, and as such can be queried or used as a query model for information sources. They are considered a key technology for the Semantic Web.
(He then described GO and OBO.) Examples of large-scale biomedical ontology efforts are the OBO foundry and the NCBO. In systems biology, SBO, BioPAX, and PSI efforts too.
There are many biomedical ontologies, and a large number of those are used by researchers. As happens often when multiple ways of describing things are used, many ontologies have overlapping concepts. For instance, two ontologies may have different views on the same domain / levels of granularity; in other cases, you may have built an in-house custom ontology that you wish to align with a standard one; finally, you may be building an bottom-up ontology that you wish to attach to an upper-level ontology.
Lambrix and others have developed a standard alignment framework. A number of different matchers are used to compare ontologies. Suggestions are validated by the user, and accepted and rejected choices are stored and used in later iterations. Preprocessing of the ontologies begins with the selection of relevant features of the ontologies and the selection of the search space (you can suggest appropriate “highly mappable” areas).
The matchers have a number of strategies. The first depend on linguistic matching, e.g. names, definitions and synonyms. This linguistic matchers look at the edit distance (number of different characters) and the N-gram set. N consecutive characters in a string, and similarity is based on a comparison of n-grams. Next come the structure-based strategies. Here, if it is known that two concepts in different ontologies are equivalent, then chances are that the children and parents have similarities as well. This is called propagation of similarities. Anchored matching is where two pairs in different levels in the hierarchy can be called equivalent. The third type of matcher strategies are constraint-based approaches. On their own, these approaches aren’t good at finding new mappings, but they are good at helping to reject prospective matches. A fourth type is instance-based matchers, where you could use life-science literature as instances: you can try to use entities annotated with a particular concept as instances of that concept. You could define a similarity measure between concepts using a basic naïve Bayes matcher (one per ontology). Here, a concept was used as a query term in PubMed, and then retrieved the most recent PubMed abstracts. Each Bayes classifier took the abstracts related to one ontology and classified it according to the concept in the other ontology with the highest posterior probability. A final type of matcher is one that uses auxiliary information. For example, you could use WordNet to find synonyms.
Most systems that use ontology alignment use different matcher methodologies. Have written a paper listing many of these systems, but more recently a book chapter. Most systems use single-value threshold filtering, where pairs of concepts with a similarity higher or equal to threshold are mapping suggestions. In this method, if you set a high threshold, you don’t find that many good suggestions, but those you find are good. How to find the best threshold level?
Lambrix and others suggest a method called double-threshold filtering. Here, pairs of concpets higher or equal to the upper threshold are suggestions. Those between the lower and upper threshold are mapping suggestions if they make sense with respect to the structure of the ontologies and the suggestions according to those above the upper threshold. This approach works very well if your structure is good (by which he means complete). From this, you can build an alignment system: theirs is called SAMBO.
The biggest evaluation done for alignments is via the Ontology Alignment Evaluation Initiative (OAEI), which has been around since 2004. There are different tracks. One track which most participate in is the comparison/benchmark track (open – you don’t know what the result should be). Most of the others are blind – you don’t know what the answer should be ahead of time. In 2007 17 groups participated, and AOAS came first in anatomy (from the UMLS people) and SAMBO came second (though it was better than AOAS in terms of non-trivial matches). Still 40% of the non-trivial matches were not found.
In the 2008 anatomy track, the idea was to align mouse anatomy and the NCI anatomy. The organisers say that there are 1544 mappings, of which 934 are trivial as these two groups have been working together already. The tasks include align and optimise, and to see if you could use already-given mappings to improve your results. SAMBO didn’t participate, but the organisers think that they would have won if they had entered. Interestingly, their “improved” SAMBO (SAMBOdtf) didn’t perform well. They think this was because the structure wasn’t complete, and the SAMBO people found a number of missing is-a relationships in the starting ontology(ies). 50% of mappings were found by systems using background knowledge (BK) and those not. 13% was found by each type of system but not by the other. The rest (approx 25%) were not found by either type of system. SAMBOdtf uses the structure, and found that they could improve both precision and recall in other circumstances.
Current issues in this field include: complex ontologies, use of instance-based techniques, the use of alignment types (determining the type of relations rather than just equivalence), complex mappings (1-n, m-n), and determining which alignment strategies work best for connection ontology types. Evaluation of alignment strategies will be done in future with SEALS (Semantic Evaluation At Large Scale). Another topic for current/future discussion include recommending ‘best’ alignment strategies, and the use of a partial reference alignment. Also, they’ve just started on the integration of ontology alignment and repair of the structure of the ontologies (they published something on repair last year, and are working on the integration now).
Ontology alignment can be used to aid literature searches. How do you know what is in the repository (lack of knowledge of the domain? How do you compose an expressive query, if you don’t know the language/method to do it? Commonly now, you do a keyword search and what you get back is not knowledge, but documents. If you’re interested in the relationships, then you use multiple keywords with or without boolean expressions. Still, you just get back documents. But what would be better is to get back knowledge together with their provenance documents. For multiple search terms, you could constrain the query to only allow terms with a certain degree of relatedness.
To do this type of improved querying, you need to first define what a relevant query is: in an ontology, this would be a subgraph of your ontology which contains both/all of your query terms. All of the relevant queries are called a “slice” (there may be different routes encompassing both terms).
In a framework implementing these, there would be a number of instantiated ontologies which are connected to the knowledge base, which in turn is connected to a query formulator. There is also an associated natural language query generator, which is used once a user enters keywords. This system doesn’t exist yet, though they’ve implemented each of the components separately. They tested the system independently (non-integrated, sending data manually to each of the participating groups). They looked at information concerning lipids in ovarian cancer. Input ontologies are the lipid ontology and alignment ontology (Allyson: ? Not sure I got the second ontology name right).
To instantiate the KB, they got the content of the document, extracted sentences, detected relevant sentences, recognised appropriate entities (term identification), normalised the results, extracted the relations; classified to identify ontology classes; populated an OWL ontology using the JENA API. After KB instantiation comes slice generation and alignment (ontologies connected using shortest path). Then these slices are used to generate nRQL queries and from there into natural language queries (use triples to generate sentences). Once you click on one of the NL sentences, the nRQL query is sent to the database, and the result retrieved. There is a tradeoff in terms of query generation: how many queries do you generate? Which ones do you show to the user? You probably need to perform some relevance matching and query ranking, which isn’t done yet.
More information:
How is the mapping implemented/executed in SAMBO? Have you used SWRL at all? They just use what is used in the competition, which is essentially OWL-like subsumption / equivalence / is-a statements.
Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!
Ontogenesis: rapid reviewing and publishing of articles on semantics and ontologies
January 25, 2010
What happens when an ontologist or two gets frustrated at the long-scale publication process that is the norm when publishing scientific books? You get Ontogenesis: a quick-turnaround, low-maintenance solution using WordPress blog software. Next, a bunch of other ontologists are invited to a two-day, intensive writing and peer-reviewing workshop and the initial content is created. Result? Well, my favourite result was Kendall Clark tweeting this: “#Ontogenesis is awesome: http://ontogenesis.knowledgeblog.org/“.
What is Ontogenesis?
Phil Lord had the idea, and together with Robert Stevens and others organised the 2-day Ontogenesis workshop that occurred last week, 21-22 January 2010. Why look around for an alternative to traditional publishing methods? When writing a book, accepting the invitation might take 5 minutes, but getting around to doing it can take 6 months or more. You may only spend a couple of days writing the article, but then need to wait months for reviews (and do reviews for the other authors’ articles). Then there is the formatting and camera-ready copy. Then you may wait many months for proofs and then only get a few days to make corrections. Then, you can wait year or so for actual publication, by which time it is possibly out of date. Not ideal, but still necessary for some forms of publishing.
There are a number of benefits to using blog software, and to the Ontogenesis model in general:
- stable, permanent URLs: Permanent URLs for articles and peer reviews. DOIs have been discussed as well, and are being considered.
- automatic linking of peer reviews and related online articles. The WordPress software automatically adds trackbacks, pingbacks, etc. as comments on the relevant articles, making it easy for interested readers to visit the peer reviews written for that article.
- completely open review system. Unlike many peer-review systems in use today, the reviewer (publicly) publishes his/her article in Ontogenesis.
- less work and quick turnaround time for the editors, reviewers, and authors. Once you have written your article (in whatever format you like, other than a few broad suggestions about licensing and intention), you publish it as “Uncategorised” in the system, and then once reviewers have agreed to look at it, move it to “Under Review”. Once reviews are complete, and the editors have checked everything, it is moved to “Reviewed”. Pretty simple.
A blog that isn’t a blog
But is Ontogenesis a blog? Not really. Is it a book? Not in the traditional sense. While it seems to be correct to call it a blog, how the blog software is being used isn’t the way many people use it. And, though Duncan has called it “blogging a book”, this isn’t quite right either: while content, once completed, will not be changed, new content will be continually added. Phil discussed this point in his introduction to the workshop. He stated that wikis are best suited for certain styles of articles, but not for this sort of infrequently-updated information. Further, in wikis in general, crediting is poor. Google Knol is a nice idea, but not many people are using it. If it’s just a plain website, then there is no real way to have (and to show, more importantly), peer review.
To me, and to the general agreement of the people at the workshop, Ontogenesis can be viewed as a title/proper noun, in the same way as Nature is a title of a journal. Ontogenesis is the first of a class of websites called Knowledge Blogs. It is has more in common with the high-quality, article-style blogging of ScienceBlogs or Research Blogging than it does with the short, informal blogging style that is used by most bloggers. Each article stands on its own, is of a high quality, and describes a topic of interest to both ontologists and novices in the ontology world. Each article is aimed at a general life science readership, ensuring accessibility of knowledge and broad appeal.
My experiences as a contributor
I was lucky enough to be invited to the workshop last week, and had a great time. After an introductory set of presentations, we all got started writing our articles. The idea was that, once written, each article would be peer reviewed by at least 2 others at the workshop. Once the peer reviews were complete, the article would be re-categorised from “Under Review” to “Review”. As Phil said in a recent blog post, we wrote a large number of articles, though the number that have gone through the full review process was not as high. We expect that over the next few days, the number of completed articles will rise.
My article on Semantic Integration in the Life Sciences was the first to come out of peer review. Thanks are very much due to Frank Gibson, Michel Dumontier, and David Shotton for their peer reviewing and constructive criticism: it is a much better article for their input. I also reviewed a couple of articles (1,2) by Helen Parkinson and James Malone, which should be moved to a Reviewed status soon.
Ok, but what’s the downside?
Well, it is new, and there are some kinks to work out. This workshop highlighted a number of them, such as the difficulty people unfamiliar with WordPress had using its UI. Sean has posted a useful summary of his thoughts on the pluses and minuses, which I encourage you to have a read of and comment on. Here are a few thoughts on how to improve the experience in future, as mentioned during the meeting:
- Enable the plugin for autogenerating related articles to improve cross-links.
- The Table of Contents has been started, but different “pathways” for different intended readerships to help guide them through the articles would be helpful.
- Reviewers should be able to change Categories in any article so they can mark when it is Under Review, rather than waiting for the Authors to do this.
- The article-specific Table of Contents are very helpful, but it might be better to move it to different location in the post (e.g. the top rather than the bottom).
- Have a way to mark yourself as willing to accept papers to review, for instance if you have some time in your schedule that week: authors could then preferentially choose you.
- The ability for your name in the byline of an article to link to your profile on Ontogenesis. Currently, the profiles are private and some authors have put their profiles into the article text as a temporary alternative.
- Add the Stats wordpress plugin.
- Comments do not have the author of the comment within them, e.g. pingbacks to reviews have to be clicked through to find out who wrote the review.
- Dealing with references/citations will be done better in future, when an appropriate plugin is found. Currently, basic HTML links to DOIs is the standard way to go.
Conclusions? Be an author yourself, and try it out!
This method of publishing is new, interesting, and quick. If you have a topic you’d like to write about, are interested in peer reviewing, or are just interested in reading the articles then please visit Ontogenesis and have a go, and then let us know what you think!
Please note: as mentioned in the main text, I am one of the authors of articles and peer reviews in Ontogenesis.
Using the glossaries package in Latex and Linux, Kile
December 16, 2009
I was recently frustrated by the limitations of the acronym and glossary packages: I wanted to have something that joined the functionality of both together. Luckily, I found that with the glossaries package, which actually states that it is the replacement for the now-obsolete glossary package.
In order to make this tutorial, I have used the following resources, which you may also find useful: the CTAN glossaries page; the glossaries INSTALL file; (one, two) links from the latex community pages; and a page from the Cambridge Uni Engineering Department. These instructions work for Ubuntu Karmic Koala: please modify where necessary for your system.
Installing glossaries
Note for Windows users: While the makeglossaries command is a perl script for Unix users, there is also a .bat version of the file for Windows users. However, I don’t know how to set up MIKTex or equivalent to use this package. Feel free to add a comment if you can add information about this step.
- Get and unzip the glossaries package. I downloaded it from here. Though you can download the source and compile, I found it much easier to simply download the tex directory structure (tds) zip file. Unfortunately, the texlive-latex-extra package available on ubuntu or kubuntu does not contain the glossaries package – it only contains glossary and acronym. I unzipped the contents of the zip file into a directory called “texmf” in my home directory. You’ll also want to run “texhash ~/texmf/” to update the latex database, according to the INSTALL instructions.
- (Optionally) get the xfor package. If your system is like mine, after you’ve installed the glossaries package latex will complain that it doesn’t have the xfor package (which also is not available via apt-get in Ubuntu). Download this package from here.
- Open the glossaries zip as root in a nautilus window, terminal window, or equivalent. You’ll be copying the contents to various locations in the root directory structure, and will need root access to do this.
- Find the location of your root texmf directory. In Karmic, this is /usr/share/texmf/, though it may be in another location on your system.
- Copy the contents of the tex and doc directories from the glossaries zip into the matching directory structure in your texmf directory. For me, this meant copying the “doc/latex/glossaries” subdirectory in the zip file to “/usr/share/texmf/doc/latex/”, and the same for the tex directory (copy “tex/latex/glossaries” subdirectory in the zip file to “/usr/share/texmf/tex/latex/”). In theory, you can also copy the scripts/ directory in the same way, but I did step 6 instead, as this is what was suggested in the INSTALL document.
- Update the master latex database. Simply run the command “sudo mktexlsr”
- Add the location of your scripts/glossaries directory to your $PATH. This gives programs access to makeglossaries, the perl script you will be using (if you’re in linux/unix). If you followed my default instructions in step 1, this location will be “/home/yourname/texmf/scripts/glossaries”.
- Test the installation. Change into the directory you created in step 1, into the “doc/latex/glossaries/samples/” subdirectory. There, run “latex minimalgls”. If you get an error about xfor, please see step 9. Otherwise, run “makeglossaries” and then “latex minimalgls” again. If everything works, the package is set up for command-line use. You may wish to modify your Kile setup to use glossaries – go to step 10 if this is the case.
- Set up the xfor package. Run steps 3-6 again, but with the xfor.tds.zip file instead of the glossaries zip file. This package is simpler than glossaries, and does not contain a scripts/ subdirectory, so you will not need to do step 7. After installation, try running step 8 again: everything should work.
- Setting up Kile. Though I’m using Ubuntu, I find the Kubuntu latex editor Kile to be my favourite (just “sudo apt-get install kile” on Ubuntu). To set up Kile for using glossaries, you need to add another build tool that runs makeglossaries.
- Go to Settings -> Configure Kile
- Select the “Build” choice, which is a submenu of “Tools” on the left-hand menu.
- This brings up a “Select a Tool” pane and a “Choose a configuration for the tool …” pane.
- Click on the “New” button at the bottom of the “Select a Tool” pane.
- Type in a descriptive name for the tool such as “MakeGlossaries”, and click “Next”.
- The next prompt will be to select the default behaviour (or class) of the tool. I based MakeGlossaries on MakeIndex, as they both run in similar ways. Click “Finish” to finish.
- For some reason for me, Kile wasn’t initially picking up my changes in my $PATH, so in the General tab of the “Choose a configuration for the tool MakeGlossaries” pane, I put the full path plus the name of the “makeglossaries” script in the “Command” field. You may only need to put in “makeglossaries”.
- In the “Options” field of the same tab and pane as step 7, just put in ‘%S’.
- Change the selected tab from “General” to “Menu”. In the “Add tool to Build menu:” field, select “Compile” from the pull-down menu. This allows it to appear in the quick compile menu in the main editor window.
- I didn’t change any other options. Press “OK” in the main Settings window.
- You should now be able to access MakeGlossaries within Kile. Remember, you have to run latex (e.g. PDFLatex) as normal first, to generate the initial file; then run MakeGlossaries; then run PDFLatex or similar again.
Good luck, and I hope this helps people!
Tips on using glossaries
I usually keep all of my acronyms/glossary entries read by the glossaries file in a glossaries-file.tex or similar, and use “\include” to pass it to my main tex file. The links I posted at the top of this tutorial contain a number of useful examples, and included below are my favorites from those locations as well as a few of my own.
Note on usage within your document: Please note that to reference these entries, use \gls{entrylabel} for both referencing an acronym or a glossary entry. Further, to access the plural version of either, use \glspl{entrylabel}. By default, you do NOT need to put in a plural form of an acronym: latex will add an “s” to the expanded form and to the short form when you reference the acronym with \glspl{TLA} rather than \gls{TLA}.
A plain glossary entry that is not also an acronym. The first “sample” is the label used to reference this entry. The second “name={sample}” is the name of the glossary entry, as viewed once the glossary is compiled. The description is the actual definition for the glossary entry:
\newglossaryentry{sample}{name={sample},description={a sample entry}}
A plain acronym entry that is not also a glossary entry. The TLA acronym below illustrates the very basic acronym form. The “aca” example after it illustrates how to add non-normal plurals to the short and long form of the acronym. Then, again, the first instance of “aca” is the label with which to reference the acronym, and the second instance is the name as viewed in the compiled document. The final {} section is the expanded form of the acronym:
\newacronym[]{TLA}{TLA}{Three-letter acronym}
\newacronym[\glsshortpluralkey=cas,\glslongpluralkey=contrivedacronyms]{aca}{aca}{a contrived acronym}
Using an acronym within the glossary definition of a glossary entry. If you wish to make use of an acronym within the glossary definition, and have that acronym indexed properly within the glossary as well as the main text, here is what you do. First, make the acronym. Note that there is nothing special about this acronym:
\newacronym[]{DL}{DL}{Description Logics}
Second, make a normal glossary entry, and reference the acronym as normal. No special work necessary! Please also note that you can put in \cite references within a glossary entry with no problem at all:
\newglossaryentry{TBox}{name={TBox},description={This component of a \gls{DL}-based ontology describes
"intensional knowledge", or knowledge about the problem domain in general. The "T" in TBox could,
therefore, mean "Terminology" or "Taxonomy". It is considered to be unchanging
knowledge~\cite[pg. 12]{2003Description}. Deductions, or \textit{logical implications},
for TBoxes can be done by verifying that a generic relationship follows logically
from TBox declarations~\cite[pg. 14]{2003Description}.}}
Using an acronym as the name of a glossary entry. You sometimes want to use a defined acronym as the name for a glossary entry – this allows you to create a definition for an acronym. In this case, build your acronym as follows. Note that you need to add a “description” field to the square brackets:
\newacronym[description={\glslink{pos}{Part of Speech}}]{POS}{POS}{Part Of Speech}
Then, reference the acronym in the glossary entry as follows (notice the different label for this entry):
\newglossaryentry{pos}{name=\glslink{POS}{Part Of Speech},text=Part of Speech,
description={``Part of Speech''Description}}
Good luck, and have fun.
Inspiring Science Autumn Newsletter
December 7, 2009
I’ve been meaning to link to this Autumn’s Inspiring Science newsletter, put out by Claire Willis and others at the Science Learning Centre North-East. Not only does it have interesting articles on the science outreach they’ve been involved with recently and what’s coming up in the near future, but it also has a short article on me and my partnered teacher, Louise, as part of the Teacher Scientist Network. Find more about the programme on the Inspiring Science website. Enjoy!
Short Tutorial on using Pellet + OWLAPI + SWRL Rules
December 1, 2009
I’ve been looking through Pellet and OWLAPI documentation over the past few days, looking for a good example of running existing SWRL rules via the OWLAPI using Pellet’s built-in DL-safe SWRL support. SWRL is used in ontology mappping, and is a powerful tool. Up until now, I’ve just used the SWRLTab, but needed to start running my rules via plain Java programs, and so needed to code the running of the mapping rules in the OWLAPI (which I’m more familiar with than Jena). Once I clean up the test code, I’ll link it from here so others can take a look if they feel like it.
This example uses the following versions of the software:
- Pellet 2.0.0
- OWLAPI 1.1 (that is, the 1_1 part of the Subverison repository). I forsee no problems with using the new OWLAPI for OWL2, but I haven’t installed that yet.
Pre-existing Examples
Pellet provides a SWRL rule example (RulesExample.java in the Pellet download), but only for Jena, and not the OWLAPI. The OWLAPI Example3.java covers the creation of SWRL rules, but not their running. Therefore, to help others who may be walking the same path as I, a short example of OWLAPI + Pellet + SWRL follows.
New Example
This example assumes that you already have the classes, individuals, and rules mentioned below in an OWL file or files. Here is how the test ontology looks, before running the rule (you can use reasoner.getKB().printClassTree() to get this sort of output):
owl:Thing
source:SourceA- (source:indvSourceA)
source:SourceB - (source:indvSourceB)
target:TargetA
target:TargetB
The example SWRL rule is this (the rule.toString() method prints this kind of output, while iterating over ontology.getRules()):
Rule( antecedent(SourceA(?x)) consequent(TargetA(?x)) )
Please note that if you want to modularise your OWL files, as I do (I have different files for the source classes, the target classes, the source individuals, the target individuals, and the rules) then make sure your owl:imports in the primary OWL ontology are correct, and that you’ve mapped them correctly with the SimpleURIMapper class and the manager.addURIMapper(mapper)method. I will update this post with some unit tests of this setup once I’ve cleaned up the code for public consumption.
Once you have your ontology properly loaded into an OWLAPI OWLOntology class, you should simply realize the ontology with the following command to run the SWRL rules:
getReasoner().getKB().realize();
After this command, all that’s left to do is save the new inferences. In this simple case, one individual is asserted to also be a child of the TargetA class, as follows:
owl:Thing
source:SourceA - (source:indvSourceA)
source:SourceB - (source:indvSourceB)
target:TargetA - (source:indvSourceA)
target:TargetB
You can do this by using the following code to explicitly save the new inferences to a separate ontology file. You can modify InferredOntologyGenerator to just save a subset of the inferences, if you like. Have a look in the OWLAPI code or javadoc for more information. Alternatively, you could just iterate over the ABox and just save the new individuals to a file. Here’s the code for saving the ontology to a new location:
OWLOntology exportedOntology = manager.createOntology( URI.create( outputLogicalUri ) );
InferredOntologyGenerator generator = new InferredOntologyGenerator( reasoner );
generator.fillOntology( manager, exportedOntology );
manager.saveOntology( exportedOntology, new RDFXMLOntologyFormat(), URI.create( outputPhysicalUri ) );
I hope this helps some people!
Science Commons provide a list of considerations for researchers looking to license their ontology
November 12, 2009
Back in March, I wrote a blog post about my experiences trying to find out a) if ontologies should be licensed, b) if ontologies could be licensed, and c) what sort of license would be appropriate. After all, it isn’t clear what sort of thing an ontology is: is it software, or is it a document, or is it something else completely? In this post, I included a response I had received from the nice folks over at Science Commons, giving their perspective on the situation.
Today, I came across a Science Commons blog post by Kaitlin Thaney announcing OWL 2. In it, she also mentions that Science Commons now have a Reading Room article on Ontology Copyright Licensing Considerations which is well worth a read. It updates the information contained in my March post, and provides some useful thoughts on how we should go about licensing ontologies. The section below was the part that particularly caught my eye:
For sharing ontologies in a community or publicly, it would be prudent to think about copyright and licensing. For example, the ontology creator could say that “to the extent I may have copyright in my ontology, I license it in the following way.” In that way, she can reassure the community that even in the event copyright is later found to exist, they may rely upon her offer of a license. This provides an important “safety net” for the community of users, given the uncertainty about whether a given ontology may be copyrightable.
The above section seems to be the biggest new point compared with their earlier statement. While they primarily recommend CC0, they do acknowledge that many researchers may wish to choose an attribution-based licences such as the CC Attribution license.
If you create ontologies, then you should read this article: it’s short, easy to understand, and gives you the information you need to make your own decisions.
I live blogged Cameron Neylon’s talk today at Newcastle University, and I did it in a Wave. There were a few pluses, and a number of minuses. Still, it’s early days yet and I’m willing to take a few hits and see if things get better (perhaps by trying to write my own robots, who knows?). In effect, today was just an exercise, and what I wrote in the Wave could have equally well been written directly in this blog.
(You’ll get the context of this post if you read my previous post on trying to play around with Google Wave. Others, since, have had a similar experience to mine. Even so, I’m still smiling – most of the time
)
Pluses: The Wave was easy to write in, and easy to create. It was a very similar experience to my normal WordPress blogging experience.
Minuses: I wanted to make the Wave public from the start, but have yet to succeed in this. Adding public@a.googlewave.com or public@a.gwave.com just didn’t work: nothing I tried was effective. Also, the copying and pasting simply failed to work when copying the content of the Wave from Iron into my WordPress post in Firefox: while I could copy into other windows and editors, I simply couldn’t copy into WordPress. When I logged into Wave via Firefox, the copy-and-paste worked, but automatically included the highlighting that occurred due to my selecting the text, and then I couldn’t un-highlight the wave! What follows is a very colorful copy of my notes. I’ve removed the highlighting now, to make it more readable.
I’d like to embed the Wave here directly. In theory, I can do this with the following command:
[wave id="googlewave.com!w%252BtZ-uDfrYA.2"]
Unfortunately, it seems this Wavr plugin is not available via the wordpress.com setup. So, I’ll just post the content of the Wave below, so you can all read about Cameron Neylon’s fantastic presentation today, even if my first experiment in Wave wasn’t quite what I expected. Use the Wave id above to add this Wave to your inbox, if you’d like to discuss his presentation or fix any mistakes of mine. It should be public, but I’m having some issues with that, too!
Cameron Neylon’s talk on Capturing Process and Science Online. Newcastle University, 15 October 2009.
Please note that all the mistakes are mine, and no-one else’s. I’m happy to fix anything people spot!
We’re either on top of a dam about to burst, or under it about to get flooded. He showed a graph of data entering GenBank. Interestingly, the graph is no longer exponential, and this is because most of the sequence data isn’t goinginto GenBank, but is being put elsehwere.
The human scientist does not scale. But the web does scale! The scientist needs help with their data, with their analysis etc. They’ll go to a computer scientist to help them out. The CS person gives them a load of technological mumbo jumbo that they are suspicious of. What they need is someone to interpolate the computer stuff and the biologist. They may try an ontologist, however, that also isn’t always too productive: the message they’re getting is that they’re being told how to do stuff, which doesn’t go down very well. People are shouting, but not communicating. This is because all the people might want different things (scientists want to record what’s happening in the lab, the ontologist wants to ensure that communication works, and the CS person wants to be able to take the data and do cool stuff with it).
Scientists are worried that other people might want to use their work. Let’s just assume they think that sharing data is exciting. Science wants to capture first and communicate second, ontologists want to communicate, and CS wants to process. There are lots of ways to publish on the web, in an appropriate way. However, useful sharing is harder than publishing. We need the agreed structure to do the communication, because machines need structure. However, that’s not the way humans work: humans tell stories. We’ve created a disconnect between these two things. The journal article is the story, but isn’t necessarily providing access to all the science.
So, we need to capture research objects, publish those objects, and capture the structure through the storytelling. Use the MyTea project as a example/story: a fully semantic (RDF-backed) laboratory record for synthetic chemistry. This is a structured discipline which has very consistent workflows. This system was tablet-based. It is effective and is still being used. However, what it didn’t work for was molecular biology / bioengineering etc — a much wider range of things than just chemistry. So Cameron and others got some money to modify the system: take MyTea (highly structured and specific system) and extend it into molecular biology. Could they make it more general, more unstructured? One thing that immediately stands out for unstructured/flexible is blogs. So, they thought that they could make a blog into a lab notebook. Blogs already have time stamps and authors, but there isn’t much revision history therefore that got built into the new system.
However, was this unstructured system a recipe for disaster? Well, yes it is — to start with. What warrants a post, for example? Should a day be one post? An experiment? There was little in the way of context or links. People who also kept a physical lab book ended up having huge lists of lab book references. So, even though there was a decent amount of good things (google indexing etc) it was still too messy. However, as more information was added, help came from an unexpected source: post metadata. They found that pull-down menus for templates were being populated by the titles of the posts. They used the metadata from the posts and used that to generate the pull-down menu. In the act of choosing that post, a link is created from that post to the new page made by the template. The templates depend on the metadata, and because the templates are labor saving, users will put in metadata! Templates feed on metadata, which feed the templates, and so on: a reinforcing system.
An ontology was “self-assembled” out of this research work and the metadata used for the templates. Their terms were compared to the Sequence Ontology and found some exact matches and some places where they identified some possible errors in the sequence ontology (e.g. conflation of purpose into one term). They’re capturing first, and then the structure gets added afterwards. They can then map their process and ontologies onto agreed vocabularies for the purpose of a particular story. They do this because we want to communicate to other communities and researchers that are interested in their work.
So, you need tools to do this. Luckily, there are tools available that exploit structure where it already exists (like they’ve done in their templates, aka workflows). You can imagine instruments as bloggers (take the human out of the loop). However, we also need tools to tell stories: to wire up the research objects into particular stories / journal articles. This allows people who are telling different stories to connect to the same objects. You could aggregate a set of web objects into one feed, and link them together with specific predicates such as vocabs, relationships, etc. This isn’t very narrative, though. So, we need tools that interact with people while they’re doing things – hence Google Wave.
An example is Igor, the Google Wave citation robot. You’re having a “conversation” with this Robot: it’s offering you links, choices, etc while having it look and feel like you’re writing a document. Also is the ChemSpider Robot, written by Cameron. Here, you can create linked data without knowing you’ve done it. The Robots will automatically link your story to the research objects behind it. Robots can work off of each other, even if they aren’t intended to work together. Example: Janey-robot plus Graphy. If you pull the result from a series of robots into a new Wave, the entire provenance from the original wave is retained, and is retained over time. Workflows, data, or workflows+data can be shared.
Where does this take us? Let’s say we type “the new rt-pcr sample”. The system could check for previous rt-pcr samples, and choose the most recent one to link to in the text (after asking them if they’re sure). As a result of typing this (and agreeing with the robot), another robot will talk to a MIBBI standard to get the required minimum information checklist and create a table based on that checklist. And always, adding links as you type. Capture the structure – it’s coming from the knowledge that you’re talking about a rt-pcr reaction. This is easier than writing out by hand. As you get a primer, you drop it into your database of primers (which is also a Wave), and then it can be automatically linked in your text. Allows you to tell a structured story.
Natural user interaction: easy user interaction with web services and databases. You have to be careful: you don’t want to be going back to the chemical database every time you type He, is, etc. In the Wave, you could somehow state that you’re NOT doing arsenic chemistry (the robot could learn and save your preferences on a per-user, per-wave basis. There are problems about Wave: one is the client interface, another is user understanding. In the client, some strange decisions have been made – it seems to have been made the way that people in Google think. However, the client is just a client. Specialized clients, or just better clients, will be some of the first useful tools. In terms of user understanding, all of us don’t quite understand yet what Wave is.
We’re not getting any smarter. Experimentalists need help, and many recognize this and are hoping to use these new technologies. To provide help, we need structure so machines can understand things. However, we need to recognize and leverage the fact that humans tell stories. We need to have structure, but we need to use that structure in a narrative. Try to remember that capturing and communication are two different things.
The sound of two hands Waving
October 13, 2009
I got a Google Wave account (grin) via Cameron Neylon on Monday morning (thanks, Cameron!). I’m trying not to get caught up in all the hype, but I can’t help grinning when I’m using it, even though I don’t really know what I’m doing, and even after seeing the Science Online Demo and a couple Google videos.
But where and how will we get the benefit of the Wave?
I’ve read a few articles, and played around a little, and chatted with people, but I’m still a complete novice. So, I’m not going to talk about technical aspects of waving here. However, even now I can see that the power of Wave will not be in what’s available by default (as was the case with Gmail – you got an account, started using it, and that was pretty much it). It will be in the new applications, interfaces and most especially the Robots that will be riding the Wave with us where the most value will be. OK, so I’ve only had an account for one day, but I think even as a beginner, I can see it is in what we will create for ourselves and our communities to use that will make or break this new thing. And, as ‘we‘ are so much a requirement for this to work, my next point becomes pretty important.
What it will really take to get the best out of Wave for us researchers and scientists?
It will take many, many scientists participating. Social networking needs to get a lot more important to people who currently may just make use of e-mail and web browsing. This is exciting, but we’ll need their help. A very good slideshow by Sacha Chua about this can be found on Slideshare. Use it to convince your friends!
First steps.
As for me, I’ll be waving with both hands this Thursday at 2pm, when Cameron Neylon comes to talk about open science, Google Wave, and more. Unless Cameron is a fantastic multitasker, I may be the only one with an account at the presentation. Not sure how interesting it will be if I am the only one waving. I’ll keep you updated, and post my experience with live blogging with Wave here, and let you know how it goes.
I’m also hoping that I can get some of my research out there into the wider world via Wave robots. I have an interest in structured information (ontologies, data standards etc) and think this may lead to some interesting things.
So, the sound of two hands waving? Pretty quiet, I think. But add another few hundred pairs of hands, and things may get a lot louder.
