…Or, on defining a protein.
For some time now, I have been interested in the meanings of things. I am working with ontologies and terminologies in my area of research, and producing rigorous definitions that can be written logically requires deep thinking on the meanings of things. I'm writing this up after a discussion I had with a number of colleagues1, and it was such an interesting example of the thought process that goes into creating definitions suitable for ontologies using Description Logics (which many of mine do) that I thought I would reproduce a distillation of that conversation here. The reason I was delving into what a protein is (or, rather, how I wish a protein to be modelled in an ontology) isn't so important – it's all about the journey, in this case.
Step 1: I have the terms of "polypeptide chain", "protein", and "protein complex". What exactly are they?
You don't always need to start with this question: indeed, you may just start with a biological domain (e.g. translation), and build up what you want. However, in this case I was constrained, and needed to answer this particular question. It quickly became clear that, as with many common words, we had a feel for what each term meant, but didn't really know where one term ended and the other began. We looked at the GO definition of protein complex (which we liked), multiple online definitions of protein, and many definitions of polypeptide, including the one from SO (aka the Sequence Types and Feature Ontology). SO had polypeptide as a synonym for protein, while others said proteins could have more than one polypeptide chain. And if the latter is true, then where do you draw the line between a protein and a protein complex? So, we end up at Step 2.
Step 2: What are the differentiating features that make up the concept of "polypeptide chain", "protein", and "protein complex"?
This is actually a more telling question, as it will lead you to a list of things that make each concept unique. In Description Logics, you have two main ideas behind defining a concept: what is necessary to about a concept, and what is both necessary and sufficient. Take a simple example: we can make a statement about the normal2 state of the concept dog saying that a dog has 4 legs. This statement is necessary because if we declare something to be a dog, then we can infer that it has 4 legs. However, it is not sufficient to describe the concept of a dog unambiguously: if something has 4 legs, we cannot say with certainty that it must be a dog. When you make a definition of any concept, keeping the ideas of necessary and necessary and sufficient in your head is very handy. You'll generally end up with a list of statements, some of which together make up a necessary and sufficient list, and some of which are simply necessary. So, by listing differentiating features of these three terms, we started thinking about their definitions in a logical way. Here were some thoughts we had, as an example of the process we went through:
- Can the concept of a protein include extra features such as metals, or are these just cofactors and not what makes a protein, a protein?
- What is different about a protein and a polypeptide chain? Is a protein just a specialization of a polypeptide chain (a parent-child relationship), or not (e.g. a sibling or even further apart)?
- If a protein can have multiple polypeptide chains, then what differentiates the concept of a protein from that of a protein complex?
Step 3: The Answers
We weren't out to create an entire ontology here (I'll leave that for another post some other time), just think of some sensible starting definitions for these concepts that would be unambiguous and useful in another context. However, we did try to think of relationships between concepts as a fundamental part of their definitions: what are relationships but another type of logic statement that falls into either the necessary (N) or necessary and sufficient (N&S) categories?
PLEASE NOTE: By creating these first-attempt definitions, we are not trying to define these concepts for the entire biological world. The point in my mind is not that we get a definition that 100% of people agree on. The actual point is to get an unambiguous definition that, if written in an appropriate language, would be intelligible by both programs and people. If a group of you share a common understanding of a concept (a bit like a group hallucination? 🙂 ) then you can all talk about it sensibly, and then magic things with inference and integration of data can happen!
Polypeptide Chain (PC): It's all about a lack of tertiary structure, and a multiplicity of 1. We had a number of starting points for this definition, as this concept was already in the ontology under discussion. In the end, we were happy to keep that definition, which boiled down to the following set of logic statements. (I'm paraphrasing here to keep the post as generic as possible.) Necessary: A string of amino acids linked by peptide bonds. However, this is not N&S, as there are many things which would fit this sentence, but have other parts to it that would prevent it from just being a PC. If we wanted to make this a N&S statement, we could change it to be something like: has exactly one string of amino acids linked by peptide bonds, and has no other parts3. By stating that there can be only the one component to make a PC, then if any object meets this set of criteria, then we can infer that it must be of type "polypeptide chain". That is what the N&S statements give us.
Protein: It's all about a presence of tertiary structure, and can be composed of either one chain or multiple "permanent", covalently-linked chains. To differentiate it from a PC, we had the common-sense statement that a PC does not have have any appreciable tertiary structure, while a protein does (e.g. disulphide bonds). Such a trait is N and not N&S: there are many other things in biology which have tertiary structure and which are not proteins. If that was the only defining feature, we could have placed protein as a ch
ild of PC, as protein would have been a more specific type of PC. This is not the case, however: things which are commonly called proteins often have multiple PCs, and we wanted to include this usage in our definition of protein. So, we've differentiated it from a PC based on both its structure and its multiplicity.
Next, how can we define a protein in a way that separates it from protein complex, which also has multiple PCs? Could it be that a multiple-PC protein only ever has PCs that are from the same transcript? No, they can be encoded by multiple transcripts and still be called a protein in conventional use. We quickly realized that finding a clear definition would be hard. In the end, the two main distinguishing features between protein and protein complex seemed to be that proteins, even if containing multiple PCs, were in a more permanent state of attachment than complexes, and that proteins always had covalently-linked subunits. An example of this is the insulin receptor, commonly classed as a protein, and whose alpha and beta subunits are connected with disulphide bonds.
Protein Complex (PCX): It's all about transience of the association between proteins (and perhaps a little about non-covalently linked subunits). More than one protein joined together, generally non-covalently, in a more transient way than with a multi-chain protein. This is N, as other objects may have non-covalently-joined proteins and are not protein complexes. An example of a PCX would be a transcription factor. This is our more problematic definition, and at the time couldn't think of a good N&S statement.
Since the original discussion, more colleagues have joined in, noting that there are some things we call protein complexes which have covalently-bonded subunits. For example, ubiquitination (a covalent modification) is seen as something suitable for a protein complex. This leaves us with just the transience argument. You can see how this could be a problem, and why this sort of work can be difficult and frustrating.
There will always be multiple, contradictory meanings for words like gene and protein, which are in such common use. The thing to do isn't to try to change that – just to try to provide a rigorous way of capturing each useful orthologous definition, and having differently-named concepts for each one.
I'm not sure we scientists think enough about rigorous definitions for such important words (and we should), and I'm not sure any life-sciences ontology or terminology has gotten it completely right (or completely "complete") yet. Most terminologies/ontologies are currently very good at labelling (creating lots of term lists and hierarchies), but not very good at classification yet (don't provide precise, rigorous definitions). As one of my colleagues said, most of what's available now are "natural histories" of a term such as protein rather than true definition of the biology behind it. I'm sure we'll get there, though!
So, paraphrasing the words of a well-known Spaniard, next time you use the words gene or protein, think also of the people you're using it with, and have a ponder on whether or not you mean what you think they mean.
1 "Colleagues", if you wish to be named, just let me know. I didn't want to take liberties!
2 Yes, we can discuss all day about what "normal" is. Just take it in the spirit it's written, for this small example. However, if you wish to have a discussion, just let me know…! 🙂
3 "parts" is vague, most definitely. However, this is a general discussion of the process, and saying anything more about what "parts" are at this stage would necessitate the creation of an entire ontology, not just these three concepts that we were interested in! If we carry it to extreme, we'd have to create new concepts in our ontology for virutally all of the nouns that we use in our definitions. This is the proper way to build an ontology, but this post is about a short exercise in thinking about good definitions, NOT about building an entire ontology.