Skip this Video
Download Presentation
1) Why is it so difficult to find publications related to a protein?

Loading in 2 Seconds...

play fullscreen
1 / 1

1) Why is it so difficult to find publications related to a protein? - PowerPoint PPT Presentation

  • Uploaded on

Indexing with Uniprot. EMBL. Wolfson College. EBI. Cambridge University. (3). Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann. Mum, Dad, the fish and other species.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '1) Why is it so difficult to find publications related to a protein?' - verlee

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Indexing with Uniprot








Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann

Mum, Dad, the fish and other species.

I like fish. My favorite is Zebrafish. It’s called like that because, from a fish point of view, it looks like a Zebra. But still, it’s a fish, so it’s a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving ugly big fish.

It’s so nice to look at them. At the beginning, it’s only an egg, and then it becomes a fish! With fins, mouth and eyes! I heard that it’s all done by the genes.

For example, dad told me that there’s a gene called six 3 that has to do with the eyes. He didn’t say much. So I thought that I could get more information about six 3 on the Internet. That’s when problems started.

I typed six 3 in the little box and I started to read the articles. Many were not about my gene. Then, when it was about six 3, it wasn’t about Zebrafish (I don’t care about Chicken or Elephant!). So, I went to see dad. Dad said that it’s because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask. Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that moment I decided to go to see mum. Dad wasn’t funny any more.

Mum said that I shouldn’t listen to dad. That wasn’t the first time she said that. She said that I should forget all about these strange names and just use the UniProt ID (what ever it is). She just said it’s O73708 for six3 in Zebrafish and that’s enough to find all the publications and that I don’t have to worry about the synonyms. Mum is fun. The UniProt Index too.

1) Why is it so difficult to find

publications related to a protein?

Fact: Protein names are highly ambiguous

Numbers: More than 600 protein names from Swiss-Prot are also English words such as ’Had’, ’Great’, ’This’. Also, around 6 000 names from Swiss-Prot are abbreviations with several potential expansions. For instance, ADM abbreviates the gene name ’adrenomedullin’ as well as the drug name ’adriamycin’.

Consequence: Search engine results can be unrelated to the protein of interest.


Fact:Protein names are not species specific

Numbers: Around 90 000 protein names from UniProt are shared over several species.

Consequence: When a protein name is mentioned in the text, it is not obvious which species is concerned.


Fact: Proteins have several names

Numbers: Around 84% of Uniprot entries reference more than one name per protein. Half of SwissProt proteins have at least three names.

Consequence: Search engine results are incomplete.

What: Acronym disambiguation

How: Acronyms can be resolved

with their long-forms. Either

the long-form of the abbreviation

is contained in the document or

the context of the document

allows to guess the long-form.

Once it is resolved, the long form can be considered as a protein name or not.


What: Solving the species

How: Publications mentioning

protein names often contain

information about the studied

organism. It can be the name of

the organism itself, or of an

ancestor or even of a

descendant. Using the NCBI taxonomy, the most probable species is selected given the organisms cited in the document.


What: Including synonyms in the search

How: Swiss-Prot is the most comprehensive and accurate source of names and synonyms for proteins. All the protein names, once disambiguated, are indexed under their names as well as the unique form that represent the protein in the correct organism: the UniProt PANs.


ADrenoMedullin Receptor (gene)

Average Daily Metabolic Rate


Amino-terminal Enhancer of Split (gene)

Anterior Ectosylvian Sulcus


Autocrine Motility Factor Receptor (gene)

Amplitude-Modulation Following Response


2) What can be done?

What: Name disambiguation

How: Protein names that are also

English words can be identified by

analyzing their frequencies in

general English text such as the

British National Corpus (BNC).


3) So, How does it work?

When using EbiMEd for retrieving publications related to a protein, simply use the protein’s UniProt PAN instead of using one of the protein name in conjunction with an organism name. For instance, instead of the query:

Simply use the query:

Frequencies of protein names in the BNC (log)

10 000

1 000


















(“methionine aminopeptidase 2” OR “peptidase M2” OR MAP2 OR MetAP2) and (mouse or mice) and “tooth germ”

O08663 AND “tooth germ”

4) Sounds good. Where can I use it?

For Biologists: The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms, ...).

Bioinformatics: Access the Protein Index via the EBI’s Web Services (SOAP/HTTP).


S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005

EbiMed: and

Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-504640.