UniProt Indexing for Biologists: Finding Protein Publications Made Simple

Indexing with Uniprot EMBL Wolfson College EBI Cambridge University (3) Sylvain Gaudan * Miguel Arregui * Harald Kirsch * Vivian Lee * Dietrich Rebholz-Schuhmann Mum, Dad, the fish and other species. I like fish. My favorite is Zebrafish. It’s called like that because, from a fish point of view, it looks like a Zebra. But still, it’s a fish, so it’s a Zebrafish. Of course, they have fins and eyes so that they can see and quickly hide from the starving ugly big fish. It’s so nice to look at them. At the beginning, it’s only an egg, and then it becomes a fish! With fins, mouth and eyes! I heard that it’s all done by the genes. For example, dad told me that there’s a gene called six 3 that has to do with the eyes. He didn’t say much. So I thought that I could get more information about six 3 on the Internet. That’s when problems started. I typed six 3 in the little box and I started to read the articles. Many were not about my gene. Then, when it was about six 3, it wasn’t about Zebrafish (I don’t care about Chicken or Elephant!). So, I went to see dad. Dad said that it’s because the Internet is about too many different things. He said I have to be more precise. Ah! I thought I just have to ask. Also, he said I forgot to put Sine oculis blabla 3 something because I should also look for the synonyms. From that moment I decided to go to see mum. Dad wasn’t funny any more. Mum said that I shouldn’t listen to dad. That wasn’t the first time she said that. She said that I should forget all about these strange names and just use the UniProt ID (what ever it is). She just said it’s O73708 for six3 in Zebrafish and that’s enough to find all the publications and that I don’t have to worry about the synonyms. Mum is fun. The UniProt Index too. 1) Why is it so difficult to find publications related to a protein? Fact: Protein names are highly ambiguous Numbers: More than 600 protein names from Swiss-Prot are also English words such as ’Had’, ’Great’, ’This’. Also, around 6 000 names from Swiss-Prot are abbreviations with several potential expansions. For instance, ADM abbreviates the gene name ’adrenomedullin’ as well as the drug name ’adriamycin’. Consequence: Search engine results can be unrelated to the protein of interest. ~~~ Fact:Protein names are not species specific Numbers: Around 90 000 protein names from UniProt are shared over several species. Consequence: When a protein name is mentioned in the text, it is not obvious which species is concerned. ~~~ Fact: Proteins have several names Numbers: Around 84% of Uniprot entries reference more than one name per protein. Half of SwissProt proteins have at least three names. Consequence: Search engine results are incomplete. What: Acronym disambiguation How: Acronyms can be resolved with their long-forms. Either the long-form of the abbreviation is contained in the document or the context of the document allows to guess the long-form. Once it is resolved, the long form can be considered as a protein name or not. ~~~ What: Solving the species How: Publications mentioning protein names often contain information about the studied organism. It can be the name of the organism itself, or of an ancestor or even of a descendant. Using the NCBI taxonomy, the most probable species is selected given the organisms cited in the document. ~~~ What: Including synonyms in the search How: Swiss-Prot is the most comprehensive and accurate source of names and synonyms for proteins. All the protein names, once disambiguated, are indexed under their names as well as the unique form that represent the protein in the correct organism: the UniProt PANs. ADMR ADrenoMedullin Receptor (gene) Average Daily Metabolic Rate AES Amino-terminal Enhancer of Split (gene) Anterior Ectosylvian Sulcus AMFR Autocrine Motility Factor Receptor (gene) Amplitude-Modulation Following Response (1) 2) What can be done? What: Name disambiguation How: Protein names that are also English words can be identified by analyzing their frequencies in general English text such as the British National Corpus (BNC). ~~~ 3) So, How does it work? When using EbiMEd for retrieving publications related to a protein, simply use the protein’s UniProt PAN instead of using one of the protein name in conjunction with an organism name. For instance, instead of the query: Simply use the query: Frequencies of protein names in the BNC (log) 10 000 1 000 100 10 x x x Cut-off x Bnac2 Aciculin Oxitocin Insulin April Task Light x x x (“methionine aminopeptidase 2” OR “peptidase M2” OR MAP2 OR MetAP2) and (mouse or mice) and “tooth germ” O08663 AND “tooth germ” 4) Sounds good. Where can I use it? For Biologists: The Protein Index is available on EbiMed, a Web Portal developed at the European Bioinfromatics Institue. EbiMed retrieves abstracts from Medline and also build a condensed view on the biomedical terminology contained in the result (e.g. Protein names, GO Terms, ...). Bioinformatics: Access the Protein Index via the EBI’s Web Services (SOAP/HTTP). (2) S. Gaudan *, H. Kirsch and D. Rebholz-Schuhmann: Resolving abbreviations to their senses in Medline. Bioinformatics September 2005 EbiMed: http://www.ebi.ac.uk/Rebholz-srv/ebimed/ and http://www.ebi.ac.uk/Rebholz-srv/whatizit/ Sylvain Gaudan is supported by an “E-STAR” fellowship funded by the EC’s FP6 Marie Curie Host fellowship for Early Stage Research Training under con- tract number MEST-CT-2004-504640. http://www.ebi.ac.uk/Rebholz/

UniProt Indexing for Biologists: Finding Protein Publications Made Simple

UniProt Indexing for Biologists: Finding Protein Publications Made Simple

Presentation Transcript

Protein

Research and Publications: A Personal Perspective

Lecture 7: Protein purification

Protein Classification and Meta-organization. Methods for Global Organization of the Protein Universe

Protein

Lecture 9: Protein purification

Protein Purification

nutrient deficiencies

Protein Turnover and Amino Acid Catabolism

SVM and Its Related Applications

Fault Tolerance in Protein Interaction Networks: Stable Bipartite Subgraphs and Redundant Pathways

Recombinant Protein Production

Protein Structure and Function

title title

Chapter 12 Protein Biosynthesis 蛋白质生物合成

Difficult Clients

Prediction of protein function

Introduction to Proteomics and Protein Structure Modeling BMI 705

V9: Reliability of Protein Interaction Networks

From DNA to Protein: Gene Expression