Indexing with Uniprot. EMBL. Wolfson College. EBI. Cambridge University. (3). Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann. Mum, Dad, the fish and other species.
Indexing with Uniprot
Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann
Mum, Dad, the fish and other species.
I like fish. My favorite is Zebrafish. It’s called like that because, from a fish point of view, it looks like a Zebra. But still, it’s a fish, so it’s a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving ugly big fish.
It’s so nice to look at them. At the beginning, it’s only an egg, and then it becomes a fish! With fins, mouth and eyes! I heard that it’s all done by the genes.
For example, dad told me that there’s a gene called six 3 that has to do with the eyes. He didn’t say much. So I thought that I could get more information about six 3 on the Internet. That’s when problems started.
I typed six 3 in the little box and I started to read the articles. Many were not about my gene. Then, when it was about six 3, it wasn’t about Zebrafish (I don’t care about Chicken or Elephant!). So, I went to see dad. Dad said that it’s because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask. Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that moment I decided to go to see mum. Dad wasn’t funny any more.
Mum said that I shouldn’t listen to dad. That wasn’t the first time she said that. She said that I should forget all about these strange names and just use the UniProt ID (what ever it is). She just said it’s O73708 for six3 in Zebrafish and that’s enough to find all the publications and that I don’t have to worry about the synonyms. Mum is fun. The UniProt Index too.
1) Why is it so difficult to find
publications related to a protein?
Fact: Protein names are highly ambiguous
Numbers: More than 600 protein names from Swiss-Prot are also English words such as ’Had’, ’Great’, ’This’. Also, around 6 000 names from Swiss-Prot are abbreviations with several potential expansions. For instance, ADM abbreviates the gene name ’adrenomedullin’ as well as the drug name ’adriamycin’.
Consequence: Search engine results can be unrelated to the protein of interest.
Fact:Protein names are not species specific
Numbers: Around 90 000 protein names from UniProt are shared over several species.
Consequence: When a protein name is mentioned in the text, it is not obvious which species is concerned.
Fact: Proteins have several names
Numbers: Around 84% of Uniprot entries reference more than one name per protein. Half of SwissProt proteins have at least three names.
Consequence: Search engine results are incomplete.
What: Acronym disambiguation
How: Acronyms can be resolved
with their long-forms. Either
the long-form of the abbreviation
is contained in the document or
the context of the document
allows to guess the long-form.
Once it is resolved, the long form can be considered as a protein name or not.
What: Solving the species
How: Publications mentioning
protein names often contain
information about the studied
organism. It can be the name of
the organism itself, or of an
ancestor or even of a
descendant. Using the NCBI taxonomy, the most probable species is selected given the organisms cited in the document.
What: Including synonyms in the search
How: Swiss-Prot is the most comprehensive and accurate source of names and synonyms for proteins. All the protein names, once disambiguated, are indexed under their names as well as the unique form that represent the protein in the correct organism: the UniProt PANs.
ADrenoMedullin Receptor (gene)
Average Daily Metabolic Rate
Amino-terminal Enhancer of Split (gene)
Anterior Ectosylvian Sulcus
Autocrine Motility Factor Receptor (gene)
Amplitude-Modulation Following Response
2) What can be done?
What: Name disambiguation
How: Protein names that are also
English words can be identified by
analyzing their frequencies in
general English text such as the
British National Corpus (BNC).
3) So, How does it work?
When using EbiMEd for retrieving publications related to a protein, simply use the protein’s UniProt PAN instead of using one of the protein name in conjunction with an organism name. For instance, instead of the query:
Simply use the query:
Frequencies of protein names in the BNC (log)
(“methionine aminopeptidase 2” OR “peptidase M2” OR MAP2 OR MetAP2) and (mouse or mice) and “tooth germ”
O08663 AND “tooth germ”
4) Sounds good. Where can I use it?
For Biologists: The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms, ...).
Bioinformatics: Access the Protein Index via the EBI’s Web Services (SOAP/HTTP).
S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005
EbiMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/ and http://www.ebi.ac.uk/Rebholz-srv/whatizit/
Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-504640.