1 / 41

Genomic Analysis

Genomic Analysis. Flowchart. get genome sequence – genome assembly find genes translate genes all against all, self-comparison all against all, interproteome functional classification synteny analysis microarrays. Contigs.

zenia-chang
Download Presentation

Genomic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomic Analysis

  2. Flowchart • get genome sequence – genome assembly • find genes • translate genes • all against all, self-comparison • all against all, interproteome • functional classification • synteny analysis • microarrays

  3. Contigs • Sequences are obtained by genetically engineering pieces of DNA into plasmids • One sequencing reaction can only resolve a maximum of about 800 base pairs • Overlapping fragments allows deduction of complete sequences

  4. Fragment Assembly package in GCG • This package of programs allows you to input fragment sequences, make the contigs, and then edit the final contigs.

  5. Contigs: the algorithm • First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix) • Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)

  6. Identity/overlap ratio

  7. In order to save the threshold-meeting overlaps, must align them • This is a global alignment that does not penalize overhanging ends • So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)

  8. Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)

  9. GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used

  10. In-class exercise • Open the file called fragments; this contains truncated regions of the file named geneseq. • In the editor, select all the sequences. • Select Functions -->Fragment Assembly--> GelStart; enter a project name and select Begin a new project; select Run

  11. In-class exercise, cont • Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run

  12. In-class exercise, cont • Go back to Fragment Assembly again; select GelMerge; Run. • Go back to Fragment Assembly again; select GelView; Run • Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.

  13. Genome project programs • PHRED: analyses raw sequence to produce a `base call‘ with an associated `quality score' for each sequence position • Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong • q of 20 is 10x q of 30 • PHRAP: assembles raw sequence into sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).

  14. GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.

  15. Chromosomal Map from Mycobacterium tuberculosis (TIGR)

  16. Gene and regulatory region finding • Sequencing a million base pairs is relatively easy • Identifying open reading frames (eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.) • Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs

  17. Gene finding by similarity • Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence) • This will miss lots of genes ...

  18. Genomic DNA BLAST results • Input: genomic DNA fragment from E. coli • BLASTX of nr protein database at NCBI • Output follows

  19. This is a pretty trivial example, but you can see how this works for actual unknown genome sequences

  20. Major methods of gene finding • Pattern discrimination • Find metrics that correlate with usage in coding regions • Generate way to separate coding/noncoding regions according to that metric • Others (HMM, neural net, genetic algorithm, …)

  21. ORF patterns • 7 major metrics: • Frame bias: find the frame that matches codon bias of that organism • Fickett algorithm: amalgam of several tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition

  22. Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons • Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database • Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences • Word commonality: exons use rare, introns use common 6-tuples • Repetitive 6-tuple preferences

  23. Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful • Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful

  24. Gene prediction in prokaryotes (and yeast) • Little intergenic DNA, lack of introns, highly conserved regulatory region patterns make gene prediction easier in prokaryotes • MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions

  25. In class exercise: GeneMark and GeneMark.hmm • Go to GeneMark website http://opal.biology.gatech.edu/GeneMark/ • Use text editor to open ecoli_lac_operon.txt file (Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli • Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm • Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?

  26. Glimmer • Higher order HMM’s • Instead of looking at just the previous state, use information from the previous n states (e.g., 5th order • Interpolated HMM’s = IMM’s • Incorporate highest-order information possible that preserves statistical discrimination • Glimmer is TIGR’s main gene finding tool

  27. Gene finding in eukaryotes • Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure • Gene finding in eukaryotes significantly more difficult than prokaryotes

  28. Neural Net • Attempts to mimic neural patterns of learning • Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways • The network is a set of “hidden layers” that provide the information for the final output

  29. Simple neural net Sensor Node output Sensor Node might only give output if both sensors +; or only if both -; or only if one +, one -

  30. More complex neural net output hidden net layers

  31. Construct network of nodes and connections • Train on sequences with known properties; adjust weights for connections to optimize for desired outcome on training set • GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions • GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well

  32. Exercise • Human genomic DNA • Use GRAIL EXP to find exons • Compare to GeneMark.hmm

  33. Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest • Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be

  34. Regulatory region finding • Again use comparison but this time look in regions outside open reading frame • This has been done successfully using Bayesian methods

  35. All-against-all self-comparison of proteome • Translate all identified ORFs • BLAST each translated ORF against all other translated ORF w/in that proteome • Identify paralogs = separate genes that arose by duplication • Identify gene families

  36. All-against-all interproteome comparison • Like self comparison, only between organisms • Identify orthologs = genes with same function conserved between species • Identify gene families • Identify conserved domains

  37. Functional classification • Useful as a precursor to data mining for finding genes related by function, etc.

  38. Synteny analysis • Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms • Computational analysis of synteny very similar to sequence alignment methods • Isochores = “long regions of homogeneous base composition” • 1M base pairs • GC content uniform throughout (differences in GC content of sliding window would be no more than 1% different than overall GC content of isochore) • H = high density – rich in genes • L = low density – poor in genes

  39. Global gene regulation • Microarray analysis • Beyond scope of this class • See discussion in text

  40. Other molecular biology applications • PCR primer finding • How do you think this algorithm works? • Restriction enzyme mapping • How do you think this algorithm works?

More Related