Breakout 3: Author identity – Creating a new kind of reputation online (Science Online London 2009)

Duncan Hull, Geoffrey Bilder, Michael Habib, Reynold Guida

ResearcherID, Contributor ID, Scopus Author ID, etc. help to connect your scientific record. How do these tools connect to your online identity, and how can OpenID and other tools be integrated? How can we build an online reputation and when should we worry about our privacy?

Geoff Bilder:

Almost every aspect of a person can change without the person themselves changing. So, you want to have an identifier that is a hook to you, and which is better than a name (which is changeable). What about retinal scans? Fingerprints? OpenID? Where does your profile come in? A profile is a collection of attributes that you use to describe who you are. With author identity, what we want is the ability to get at the profile of a person in an unambiguous manner. Until we have such a thing, how do you tell people what your canonical profile is? To complicate matters even more, each user will want multiple personas, each with their own profiles.

When talking about identity, two issues are often conflated: identity authentication and knowledge discovery identity system. That is, you must be more rigorous in determining swho someone is (logging into your identity) than in figuring out who wrote a paper. Further complications occur in the lossy conversion between languages of authors’ names.

Whatever is done, has to be done on an international scale, must be interdisciplinary, and must be interinstitutional. The oldest content cited thus far in CrossRef (with a DOI) is from the 1600s. What happens when you die to your identifier? A final issue is scale: there are about 200K new DOIs per month, and even if we guess at 5 authors per DOI, then there could be between 5-21K failures of identification per month if you estimate a 96-97% success rate for author identification.

Duncan Hull:

He spoke about openID is science, among other things. Currently, authentication of people is very different in most online applications, and is generally only done with a simple username and password combination. Simon Willison (The Guardian) estimates that the average online user has at least 18 user accounts and 3.49 passwords. OpenID is trying to end up with a situation where there are fewer usernames AND passwords.

OpenID works by redirecting you to your openID provider to log in, then sends you back to the location you started at. However, having a URL as a username is not very intuitive. Further, logging in via redirection can be confusing. Therefore while adoption of openId is growing, it may not properly take off until browsers and other vendors support it better. Mentioned myExperiment as something which accepts openId.

Michael Habib:

Michael presented a nice diagram: a square divided into 4 parts, with “about me” and “not about me” across the top, and “by me” and “not by me” down the side. It is the “not” category for both where the disambiguation of people is the most important. He used the example of Einstein and the LC Authority Files to figure out what all of the different versions of his name are.

Completely different from the LC Authority files, which is manually and carefully checked by only certain people, is ClaimID. ClaimID is a way to collect all aspects of your identity in one place. However, it is dependent upon each individual being truthful about what they have ownership over.

Another approach is the Scopus Author ID, which is completely machine aggregated. It is validated by publications, and scales well. It has 99% precision and 95% recall. The cons is that it is impersonal, and those precision and recall values really aren’t very good when you consider that this is about ownership of an article, and that there are a very large number of people.

There is also 2collab, where you can combine author ids (that you know about) into one identity. Then, you can add any other item on the web that is about you.

Reynold Guida (from Thomson Reuters):

They’ve built software to try to address author identity and attribution. If you look at the literature since 2000, communication and scientific collaboration has really changed. What we notice is that the number of multi-author papers has started to increase, while single-author papers have decreased. A google search for common surnames really highlights the problems associated with identity. Name ambiguity is a real problem. The connection between the reseacher and the institution and the community is a real problem. Two of the most important parts in this discussion are who do I know, and who do I want to know? The connections a person makes affects all aspects of their career.

Therefore they have created researcherId (free, secure, open). Privacy options are controlled by the user, even if the institution created the record. There is integration with EndNote, Web of Knowledge, and other systems to help build publication lists. You can link to / visualize your researcher id profile really easily from your own websites.


Question: Has anyone thought through the security implications of these single ID systems: one slip-up and your entire identity has been hacked? GB: Multiple identities encourages poor behaviour, as the thought of changing your password everywhere is so overwhelming that people don’t do it. But yes, these problems exist. However, the tradeoffs make it worthwhile to their minds. You should NOT conflate knowledge issues with security issues. This is because information for your scholarly profile is, by definition, public anyway.

Question: Do different openId providers and author id and researcher id know about each other in the computational sense? Not really yet.

Question: What about just making the markup of the web more semantically friendly? DH: The Google approach is a good one. RG: It’s all about getting the information into the workflow.

Question (Phil Lord): What worries me is that there has been a big land grab for author identity space: for example, you cannot log into Yahoo with any other open id than a Yahoo open id. There’s a lot of value in being in control of someone’s id. Therefore there is a big potential danger. GB: For every distributed system, you need a centralized indexing thing to get it to work correctly. Therefore we need to make sure that if a centralized system appears, there should be accountability.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

SysMO-DB and Carole Goble, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Systems Biology of Microorganisms. 11 projects from 91 institutes, whose aim is to record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way. These projects have no one concept of experimentation or modelling, which makes it tough for information exchange. Further, there are issues of people having their own solutions, suspicions (about sharing data, for instance), data issues (many don't have data or don't store it in a standard way) and resource issues (no extra resources). SysMO-DB started in July 2008, and is a 3 year funded effort (3+3 people in 3 teams over 3 sites). Provide a web-based solution to exchange, search, and disseminate data. Need to retrofit data access, model handling and data integration platform. Because of the large number of groups and projects, they are going to aim for low-hanging fruit and early wins: be realistic, not reinvent, sustainable, and encourage standards adoption.

Just like at CISBAN, where we have implemented a web-based data integration, storage, exchange, and dissemination platform in a standards-compliant way (SyMBA), they have three users: experimentalists, bioinformaticians, and modellers. They're lucky, though, in that they have 6 people to develop SysMO-DB, when CISBAN only has 1. 🙂 And, as with CISBAN and many other data integration efforts, much of the work is social: that is, encouraging those three user types to collaborate and understand each other's work. The social solutions include questionnairs, "PALS" (postdocs and phd students), and Audits and sharing of methods, data, models. They discuss things like what people need or don't need from MIAME. (Personal opinion and question: MIAME is intended as a minimal information checklist. What kind of things, then, don't they need? And would it be worth taking this information back to the MIAME people to possibly modify the guidelines if some aspects of it aren't truly minimal? End personal questions.)

Discovery is done via SysMo-SEEK. How to catalog the metadata, and then have mechanisms for accessing the data from locations other than the host site? There is a single search point over "yellow pages" and assets catalogue. They store metadata on results, not the results themselves (again, just like SyMBA, which stores the metadata in a database, and the results in a remote file store). They use myExperiment for both linking the people and the assets. For models, they're using a local installation of JWS online, which is a database of curated models and a model simulator. There is also some links to semantic SBML from the TRANSLUCENT project.

There are two kinds of processes to store. The first is experimental processes, e.g. SOPs and protocols. They use the Nature protocols format, with the addition of high-level classification through tags. (Personal note: What is the underlying format for storing protocols?) The second type of process is Bioinformatics processes, which are stored as workflows. (Question: Why don't you store protocols as workflows? They can be chained in the same way.)  Taverna is used for this work. One bit of work was using libSBML inside taverna for collaborative model development (Peter Li et al). Another automated (definition of automated in this context?) workflow goes from microarray to pathways and published abstracts. Their consortium wants to exchange information from public data sources, SysMO itself, and excel spreadsheets.

(Another personal aside. FuGE (object model for experimental metadata) and ISA-TAB (tabular format, e.g. spreadsheets) are becoming interchangeable – work is going on between FuGE and ISA-TAB people right now – most recent workshop was last week. This is important, as it was mentioned that bioinformaticians have to deal with spreadsheets (which is true enough!). So, you get the best of both worlds with FuGE / ISA-TAB, without having to define yet another schema. A personal question would be: Why build these various metadata schemas and parsers for spreadsheets (e.g. whatever is used for the Assets catalogue and JERM parsing of spreadsheets) rather than use pre-existing models such as FuGE and formats such as ISA-TAB? Using the FuGE object model does not mean that you have to use all aspects of it – you can just take what you need.Perhaps it was due to the maturity of ISA-TAB at the time the project started, though the specification is now in version 1.0. Will SysMO-DB export and import these formats? There was no time for questions at the end of the talk, so I will try to find out during the lunch period. End aside.)

Trying to map to the relevant MIBBI standard. There is a nice feature that reads spreadsheets from specific locations and automatically loads them into the Assets catalogue. (You can still load them directly into that catalogue.) They are performing a 4-site JERM exchange pilot scheme in Spring 2009.

Great talk – thanks 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend