1 / 10

Discussion Points for 3rd ENCODE Pseudogene Call

Discussion Points for 3rd ENCODE Pseudogene Call. Mark Gerstein 2005,10.06 10:30 EDT. Preliminaries. A Tentative Plan 15 Oct freeze to create a consensus list Then intersect with genomics experiments Subgroup call the week of 17 or 24 th to discuss Presentation on 28 Oct to G&T group

Download Presentation

Discussion Points for 3rd ENCODE Pseudogene Call

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discussion Points for 3rd ENCODE Pseudogene Call Mark Gerstein 2005,10.06 10:30 EDT

  2. Preliminaries • A Tentative Plan • 15 Oct freeze to create a consensus list • Then intersect with genomics experiments • Subgroup call the week of 17 or 24th to discuss • Presentation on 28 Oct to G&T group • Tentative Figures • Table or bar graph of pseudogenes vs encode region • Statistics of number transcribed and having upstrem binding site • Browser update • http://genome-test.cse.ucsc.edu/cgi-bin/hgTracks?hgsid=1289656&db=hg17&position=chr21%3A32790943-32791489

  3. Intersection of Pseudogenes from Three Groups: Original 42 45 Havana-Gencode:167 pseudogenes 35 21 86 Yale: 184 pseudogenes 87 87 18 17 18 16 22 UCSC retrogenes: 15 expressed (7-8 pseudogenes) + 143 not expressed (all pseudogenes) 86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene. Provided by France.

  4. Intersection of Pseudogenes from 4 Groups: Status as of 22-Sept-05 call 52 (2) Havana-Gencode:167 pseudogenes 14 (2) 16 (0) 82 (34) Yale: 164 pseudogenes 15 (1) 17 (7) 33 (1) UCSC retrogenes: 146 not expressed Roughly agreement now is: 82 + 52 – 7 = 127 from 229 total What to do with 102?

  5. Intersection of Pseudogenes from 4 Groups: Updated for 6-Oct-05 call 54 (2) Havana-Gencode:165 pseudogenes (167 -2 ) 17 (2) 16 (0) Yale: 167 pseudogenes (164 + 3) 81 (34) 15 (1) 16 (7) 33 (1) UCSC retrogenes: 146 not expressed • The numbers in parentheses are pseudogenes from GIS.

  6. Intersection of Pseudogenes from 4 Groups: In relation to Havana ppt from 22-Sept Call 54 (2) Havana-Gencode:165 pseudogenes (167 -2 ) 17 (2) 16 (0) Yale: 167 pseudogenes (164 + 3) 81 (34) 15 (1) 16 (7) 7 Havana agrees to be added (8, 11, 40, 59, 139, 152, 169). 4 at coding loci. [Yale agrees to delete] 1 with weak sequence identity.* 5 with “non-real” proteins.* Numbers according to Adam’s note 33 (1) UCSC retrogenes: 146 not expressed 9 Havana agrees to be added. 2 at coding loci. [Yale agrees to delete] 1 with weak sequence identity.* 2 with “non-real” proteins.* * Solved by consistent protein set & threshold

  7. Intersection of Pseudogenes from 4 Groups: Similar analysis looking at unique Havana pseudogenes 54 (2) Havana-Gencode:165 pseudogenes (167 -2 ) 17 (2) 16 (0) Yale: 167 pseudogenes (164 + 3) 81 (34) 15 (1) 16 (7) 11 – Yale agrees that were missed and should be added 2 – Not sure what is happening. Pseudogenes with several introns and no disablements. (AC011330.8 and AC011330.5) [Havana should re-justify] 1 - without protein *. 2 - overlap with ENSEMBL exons #. 33 (1) * Solved by consistent protein set & threshold UCSC retrogenes: 146 not expressed # Solved by consistent gene annotation

  8. How to generate a consensus for pseudogenes? • Stick with the intersection • Develop a consistent criteria for identifying pseudogenes and uniformly apply to ENCODE • E.g. protein matches with disablements found from a pipeline • Ignores tricky cases flagged by manual annotation • Do a simple union of UCSC, Havana & Yale • GIS is a subset of other 3 • Describe pseudogenes as being identified by multiple approaches and then explicitly flag each group’s unique ones in final annotation • Easy but perhaps biases stats • Do a qualified union…. • Represent the above consensus pseudogenes and their alignments in UCSC browser. • Track group showing each group below a consensus

  9. Once we have consensus, how to agree on pseudogene boundaries? • Keep unchanged each group’s boundaries • If pseudogenes overlap, take largest region (union) or smallest • Develop a uniform criteria for assigning pseudogene boundaries and apply it to each of the pseudogenes in the consensus set • Could just take each pseudogene in the consensus and have one group realign it against parent

  10. A proposal for qualified union with a uniform criteria for boundaries • Identify a “good” set of human proteins – HAVANA set? • Remove pseudogenes (from all 4 groups) overlapping with current GENCODE exons (does GENCODE have an updated version?). • Create an union of the remaining pseudogenes. • Find the “best” matching proteins for each pseudogene, remove entries without a BLAST hit (e-value cutoff issue?). • Realign each pseudogene to its parent protein to produce a uniform alignment and to define the start and end coordinates. • Apply a threshold to sequence identity and coverage? (No.) • Classify pseudogenes into processed and non-processed (how?)

More Related