BIOINFORMATICS AND GENE DISCOVERY

UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials BIOINFORMATICSANDGENE DISCOVERY Iosif Vaisman 1998

From genes to proteins

From genes to proteins DNA PROMOTER ELEMENTS TRANSCRIPTION RNA SPLICE SITES SPLICING mRNA START CODON STOP CODON TRANSLATION PROTEIN

From genes to proteins

Comparative Sequence Sizes • Yeast chromosome 3 350,000 • Escherichia coli (bacterium) genome 4,600,000 • Largest yeast chromosome now mapped 5,800,000 • Entire yeast genome 15,000,000 • Smallest human chromosome (Y) 50,000,000 • Largest human chromosome (1) 250,000,000 • Entire human genome 3,000,000,000

Low-resolution physical map of chromosome 19

Chromosome 19 gene map

Computational Gene Prediction • Where the genes are unlikely to be located? • How do transcription factors know where to bind a region of DNA? • Where are the transcription, splicing, and translation start and stop signals? • What does coding region do (and non-coding regions do not) ? • Can we learn from examples? • Does this sequence look familiar?

Artificial Intelligence in Biosciences Neural Networks (NN) Genetic Algorithms (GA) Hidden Markov Models (HMM) Stochastic context-free grammars (CFG)

Information Theory 0 1 1 bit

Information Theory 00 01 1 bit 11 10 1 bit

Information Theory 1 bit 1 bit

Stochastic models Mechanistic models Mechanism Black box Predictive power Elegance Consistency Predictive power Hidden Markov models Stochastic mechanism Scientific Models Physical models -- Mathematical models

Neural Networks • interconnected assembly of simple processing elements (units or nodes) • nodes functionality is similar to that of the animal neuron • processing ability is stored in the inter-unit connection strengths (weights) • weights are obtained by a process of adaptation to, or learning from, a set of training patterns

Genetic Algorithms Search or optimization methods using simulated evolution. Population of potential solutions is subjected to natural selection, crossover, and mutation choose initial population evaluate each individual's fitness repeat select individuals to reproduce mate pairs at random apply crossover operator apply mutation operator evaluate each individual's fitness until terminating condition

Parent A Parent B crossover point Child AB Child BA Crossover Mutation

Markov Model (or Markov Chain) A A G T C T Probability for each character based only on several preceding characters in the sequence # of preceding characters = order of the Markov Model Probability of a sequence P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]

G T A C A C T Hidden Markov Models States -- well defined conditions Edges -- transitions between the states ATGAC ATTAC ACGAC ACTAC Each transition asigned a probability. Probability of the sequence: single path with the highest probability --- Viterbi path sum of the probabilities over all paths -- Baum-Welch method

Hidden Markov Model of Biased Coin Tosses • States (Si): Two Biased Coins {C1, C2} • Outputs (Oj): Two Possible Outputs {H, T} • p(OutputsOij): p(C1, H), p(C1, T), p(C2, H) p(C2, T) • Transitions: From State X to Y {A11, A22, A12, A21} • p(Initial Si): p(I, C1), p(I, C2) • p(End Si): p(C1, E), p(C2, E)

Hidden Markov Model for Exon and Stop Codon (VEIL Algorithm)

REFINED EXON POSITIONS FINAL EXON CANDIDATES POSSIBLE EXONS GRAIL gene identification program

Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)

FN TN FN TP FN TN TN TP FP REALITY PREDICTION REALITY Sensitivity c nc Sn = TP / (TP + FN) FP TP c PREDICTION Specificity FN nc TN Sp = TP / (TP + FP) Measures of Prediction Accuracy Nucleotide Level

number of correct exons Sensitivity Sn = number of actual exons number of correct exons Sp = Specificity number of predicted exons Measures of Prediction Accuracy Exon Level MISSING EXON WRONGEXON CORRECTEXON REALITY PREDICTION

GeneMark Accuracy Evaluation

Bibliography http://linkage.rockefeller.edu/wli/gene/list.html and http://www-hto.usc.edu/software/procrustes/fans_ref/ Gene Discovery Exercise http://metalab.unc.edu/pharmacy/Bioinfo/Gene

BIOINFORMATICS AND GENE DISCOVERY