Genomic Analysis

Genomic Analysis

Flowchart • get genome sequence – genome assembly • find genes • translate genes • all against all, self-comparison • all against all, interproteome • functional classification • synteny analysis • microarrays

Contigs • Sequences are obtained by genetically engineering pieces of DNA into plasmids • One sequencing reaction can only resolve a maximum of about 800 base pairs • Overlapping fragments allows deduction of complete sequences

Fragment Assembly package in GCG • This package of programs allows you to input fragment sequences, make the contigs, and then edit the final contigs.

Contigs: the algorithm • First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix) • Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)

Identity/overlap ratio

In order to save the threshold-meeting overlaps, must align them • This is a global alignment that does not penalize overhanging ends • So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)

Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)

GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used

In-class exercise • Open the file called fragments; this contains truncated regions of the file named geneseq. • In the editor, select all the sequences. • Select Functions -->Fragment Assembly--> GelStart; enter a project name and select Begin a new project; select Run

In-class exercise, cont • Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run

In-class exercise, cont • Go back to Fragment Assembly again; select GelMerge; Run. • Go back to Fragment Assembly again; select GelView; Run • Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.

Genome project programs • PHRED: analyses raw sequence to produce a `base call‘ with an associated `quality score' for each sequence position • Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong • q of 20 is 10x q of 30 • PHRAP: assembles raw sequence into sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).

GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.

Chromosomal Map from Mycobacterium tuberculosis (TIGR)

Gene and regulatory region finding • Sequencing a million base pairs is relatively easy • Identifying open reading frames (eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.) • Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs

Gene finding by similarity • Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence) • This will miss lots of genes ...

Genomic DNA BLAST results • Input: genomic DNA fragment from E. coli • BLASTX of nr protein database at NCBI • Output follows

This is a pretty trivial example, but you can see how this works for actual unknown genome sequences

Major methods of gene finding • Pattern discrimination • Find metrics that correlate with usage in coding regions • Generate way to separate coding/noncoding regions according to that metric • Others (HMM, neural net, genetic algorithm, …)

ORF patterns • 7 major metrics: • Frame bias: find the frame that matches codon bias of that organism • Fickett algorithm: amalgam of several tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition

Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons • Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database • Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences • Word commonality: exons use rare, introns use common 6-tuples • Repetitive 6-tuple preferences

Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful • Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful

Gene prediction in prokaryotes (and yeast) • Little intergenic DNA, lack of introns, highly conserved regulatory region patterns make gene prediction easier in prokaryotes • MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions

In class exercise: GeneMark and GeneMark.hmm • Go to GeneMark website http://opal.biology.gatech.edu/GeneMark/ • Use text editor to open ecoli_lac_operon.txt file (Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli • Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm • Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?

Glimmer • Higher order HMM’s • Instead of looking at just the previous state, use information from the previous n states (e.g., 5th order • Interpolated HMM’s = IMM’s • Incorporate highest-order information possible that preserves statistical discrimination • Glimmer is TIGR’s main gene finding tool

Gene finding in eukaryotes • Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure • Gene finding in eukaryotes significantly more difficult than prokaryotes

Neural Net • Attempts to mimic neural patterns of learning • Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways • The network is a set of “hidden layers” that provide the information for the final output

Simple neural net Sensor Node output Sensor Node might only give output if both sensors +; or only if both -; or only if one +, one -

More complex neural net output hidden net layers

Construct network of nodes and connections • Train on sequences with known properties; adjust weights for connections to optimize for desired outcome on training set • GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions • GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well

Exercise • Human genomic DNA • Use GRAIL EXP to find exons • Compare to GeneMark.hmm

Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest • Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be

Regulatory region finding • Again use comparison but this time look in regions outside open reading frame • This has been done successfully using Bayesian methods

All-against-all self-comparison of proteome • Translate all identified ORFs • BLAST each translated ORF against all other translated ORF w/in that proteome • Identify paralogs = separate genes that arose by duplication • Identify gene families

All-against-all interproteome comparison • Like self comparison, only between organisms • Identify orthologs = genes with same function conserved between species • Identify gene families • Identify conserved domains

Functional classification • Useful as a precursor to data mining for finding genes related by function, etc.

Synteny analysis • Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms • Computational analysis of synteny very similar to sequence alignment methods • Isochores = “long regions of homogeneous base composition” • 1M base pairs • GC content uniform throughout (differences in GC content of sliding window would be no more than 1% different than overall GC content of isochore) • H = high density – rich in genes • L = low density – poor in genes

Global gene regulation • Microarray analysis • Beyond scope of this class • See discussion in text

Other molecular biology applications • PCR primer finding • How do you think this algorithm works? • Restriction enzyme mapping • How do you think this algorithm works?

Genomic Analysis

Genomic Analysis

Presentation Transcript

Genomic sequencing and its data analysis

Genomic Analysis of Stress and Inflammation

Genomic Instability

Realized Genomic Relationships and Genomic BLUP

GENE 760: Genomic Methods for Genetic Analysis

Genomic Analysis of Meningococcal Serogroup W

Genomic meta-analysis in combining expression profiles

Genomic Complexity

Genomic Analysis of Marine Viruses

Internet tools for genomic analysis: part 2

Analysis of Genomic Predictor Population

Bioinformatic Analysis of Chromatin Genomic Data

Internet tools for genomic analysis

Genomic Databases

Open Source Genomic Analysis

Bioinformatic Analysis of Chromatin Genomic Data

GENOMIC IMPRINTING

Bioinformatic Analysis of Chromatin Genomic Data

Bioinformatics for Genomic and Proteomic data analysis

Genomic analysis of water use efficiency

Results Genomic Analysis:

Genomic sequencing and its data analysis