ucsc known genes version 3 take 10
Download
Skip this Video
Download Presentation
UCSC Known Genes Version 3 Take 10

Loading in 2 Seconds...

play fullscreen
1 / 20

UCSC Known Genes Version 3 Take 10 - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

UCSC Known Genes Version 3 Take 10. Overall Pipeline. Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique transcripts out of graph.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' UCSC Known Genes Version 3 Take 10 ' - xue


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overall pipeline
Overall Pipeline
  • Get alignments etc. from database
  • Remove antibody fragments
  • Clean alignments, project to genome
  • Cluster into splicing graph
  • Add EST, Exoniphy, OrthoSplice info.
  • Walk unique transcripts out of graph.
  • Assign coding regions (CDS) to transcripts.
  • Classify into coding, antisense, noncoding.
  • Remove weak transcripts.
  • Assign accessions.
removing antibody var regions
Removing Antibody Var Regions
  • Chromosomes 2,14,22 contain antibody regions.
  • Thousands of transcripts for these in Genbank.
  • Gaps are from genomic rearrangements, not splicing. Millions of possibilities.
  • Identify regions by:
    • Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments.
    • Treat anything that overlaps these as Ab fragment too.
    • Cluster together putative Ab fragments.
    • Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.)
  • Remove all alignments in Ab clusters.
  • Replace with a single noncoding gene for each cluster near end of gene build.
cluster into splicing graph
Cluster into splicing graph
  • Make graph where vertices are begin/ends of exons, edges are exons and introns.
  • Multiple input transcripts can share vertices and edges.
slide7

Make graph

Snap soft ends to hard

adding evidence to graph
Adding Evidence to Graph
  • Initial evidence for each edge comes from mRNAs.
  • If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs
  • Make graph in mouse and map via chains. Reinforce orthologous human edges.
  • Reinforce exon edges that overlap Exoniphy predictions.
  • Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.
walking graph
Walking graph
  • Weight of 3 on an edge is good enough.
  • Single exon gene edges take 4 though.
  • Rank input RNA by whether refSeq, and number of good edges they use.
  • If any good edges, output a transcript consisting of the edges used by the first RNA.
  • Output transcript based on next RNA if the good edges it uses have not been output in same order before.
  • Continue until reach last RNA.
assigning coding regions
Assigning Coding Regions
  • Score ORF as so:
    • 1 point for each base in orf
    • 50 points for initial ATG
    • 100 points if ATG follows Kozak rules
      • G after ATG or A/G 3 bases before
    • -400 points if nonsense mediated decay
      • Last intron more than 55 bases past stop codon
    • -0.5 points for each base in upstream ORF
    • -0.5 points each base in upstream Kozak ORF
    • +1 point each base also ORF in other species
      • Rhesus, mouse, dog
  • Scheme agrees with RefSeq reviewed ~96% of the time.
comparing orf finders
Comparing ORF Finders

Comparison vs. RefSeq reviewed ORF annotations.

*twinOrf only predicts if has homologous sequence. This run with dog,

only adds up to 97.2% for this reason.

classifying and weeding
Classifying and Weeding
  • The transcripts are classified into:
    • Coding: CDS survives trimming stage
    • Near-coding: overlap coding by at least 20 bases on same strand
    • Antisense: overlap coding by at least 20 bases on opposite strand
    • Noncoding: other transcripts
  • Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.
take 10 statistics
Take 10 Statistics

RefSeq Statistics

ad