BioInformatics Consultation Presentation 5 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22

BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu

Content of the Presentation • Gene Search in Eucariots • Hidden Markov Models (HMM): • Basic definition • STEP1: Weight matrix • STEP2: Probability scores • STEP3: Gene syntax • STEP4: Assembling Bayesian Network • STEP5: Markov Chain Models • STEP6: Maximum Likelyhood • Dynamic Programming • Genetic Algorithm • HMM Software • GenMark • Codon Usage Statistics: • Basic terms • Codon Usage optimization • Codon Usage Tables • Codon Usage in ORFs • Frameshift detection • Alternative startcodon detection • Problems in Codon Usage statistics • Codon Usage software: • Codon Usage databases: Kazusa • Codon Usage software: EBI, GCUA • References

Gene Search in Eucariots: Hidden Markov Models:Weight matrix,Probability score • This is more complex task than simple ORF analysis in Procariots and in cDNS as here exons and introns should be separated to determinate coding parts of analyzed sequence: • Hidden Markov Model, HMM (Rejtett Markov-modell): It can predict location of genes and their introns/exons in the 6 possible reading frames of an analyzed sequence, using statistical/optimization methods to look for probabilistic signals (Valószínűsíthető jelzések): • Donor(Donor)/ Receiver(Fogadó) splice sites, • Start and stop codons, • Transcription termination sequences, • Polyadenilation sites, • Ribosome binding sites, • Transcription factor binding sites • Elements of promoter, TATA-box. • STEP1:Set up database of Signal Sensor (Jelzés érzékelő) Weight Matrices (Súlymátrixok): • Statistical summary about the probability of occourence of A,T,C,G bases in a given position of a signal (eg. intron start). It can be represented on a diagramm where size of the nucleotids’ symbol are proportional with their probability: • Weight matrices of signals are stored for each species and continuosly refined with Learning algorithms (Tanuló algoritmus): • Predistion of weight matrices can be set up from similar known sequences of similar species. Therefore providing species origin or related species to sequence analyzed greatly helps gene prediction • Moreover, from longer (10Mbase) unknown sequence, it is possible to learn weight matrices standalone, providing sizeable sample for statistical analysis • STEP2:Determine Probability Scores (Valószínűségi score-ok): Based on Bayesian Conditional Probability(Bayes-i feltételes valószínűség) theorem, probability of fitting sequence analyzed to a signal starting in a hypothetical position can be determined: • Probability of joint occournce (A∩B) of two independent events (A, B) equals the multiplication of their individual probabilities: P(A∩B) = P(A)×P(B) • We consider nucleotide positions of unknown sequence independent from each other, in that parts where a signal is not identified yet • Therefore probability of signal fit is multiplication of position fit probabilities: Eg. Probability of sequence part GTAAGT to be intron open is = 100%G × 100%T × 50%A × 60%A × 70%G × 40%T (Probabilities are from intron start weight matrix above) × × × × ×

Gene Search in Eucariots: Hidden Markov Models:Assembling Bayesian networks • STEP3:Set up database of Gene Syntax (Gén szintaktikai) probabilities: • Probabilities are assigned to specific follow-up of signals (eg. after intron open, it is much more likely that intron close will follow, instead of another intron start) • STEP4:Assembling Bayesian Network (Bayes-féle hálózat): Data is summarized from weight matrices, probability of fit scores, follow-up probabilities on a Directed Graph (Irányított gráf) where: • Nodes(Csomópontok) denote States(Állapotok). States can be: • Observable Output(Megfigyelhető): We can observe nucleotides in positions of analyzed sequence (see state y1 on graph = there is T on 41th position) • Hidden (Rejtett) states: They cannot be observed, we just assume that a given type of signal starts from a given hypotetical position in the sequence (see state x1 on graph = Last position of intron start, most likely T). Hence model is called „Hidden” • Edges(Élek) denote Transition Probabilities (Tranzíciós valószínűségek) between 2 given states, what can be computed from: • Matching probability of a given signal in a given position with analyzed sequence (see b11on graph:T on 41th position(y1) can quite probably()resulted from last position of intron start(x1) • Syntactic rules between signals (see a12 on graph: last character of intron start (x1) is less likely () followed by first character of intron end (x2), as this would result in 0 lenght intron part) Sum of probabilities of edges departing from one node is 1: (intron end can be many places (y1,y2,y3,y4) but it is sure there is somewhere, so b11 + b12 + b13 + b14 = 1) • The graph can have any type of network structure TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Gene Search in Eucariots: Hidden Markov Models: Markov Chain Models 0.6 0.4 IF SEX=FEMALE THEN X=1 IF SEX=MALE THEN X=2 0.5 0.5 0.7 0.3 IF X=2 ANDHAIR= BLONDE THEN Y=4 IF X=2 ANDHAIR= BROWN THEN Y=3 IF X=1 ANDHAIR= BLONDE THEN Y=2 IF X=1 AND HAIR= BROWN THEN Y=1 0.4 0.6 IF X=2 AND IQ= STUPID THEN SHOPPING= 800$ IF X=2 AND IQ= CLEVER THEN SHOPPING= 400$ • STEP5:Markov-Chain (Markov-lánc) models: • They assume that from the graph network of probabilistic transitions, only a tree-like subset can be valid in one moment, called Decision Tree(Döntési fa): • In the tree, probability of all states are influenced max. 1Ascendent(Előzmény/Felmenő/Gyökér) state, so only 1 edge can arrive into 1 node (eg. there cannot be both T and G in the same position same time in the analyzed sequence), • But, for 1 state, there can be moreDescendant(Következmény/Leszármazott) states with its given probability (eg. a given T can be part of both intron start and TATA-box also). And this is repeated from Root(Gyökér) state of tree to Leaves(Levelek) • Multiplicating probability of nodes leading from root to a given leaf, we can compute Aggregate Probability(Aggregált valószínűség) that the analyzed sequence describes given type of signals in given position • Only 1 leaf element (and 1 possible chain(Lánc) route from edges) can be valid in 1 time • EXAMPLE5-1:To understand working of the decision tree and probability aggregation, we present you a sample application computing the probability your partner will shop up the whole plaza if you leave alone him/her with your credit card: • A general partner is the root event • He/she cannot have multiple Sex, Hair color and IQ in the same time, so descendant events Brown, Blonde, Stupid, Clever form a decision tree • At a given leaf-partner, only one chain of edges leading from root will be valid, and its probability can be computed multiplicating probability of edges: 0.4(Female)×0.5(Blonde)×0.4(Stupid) = 0.08 0.60 0.40 0.6×0.7=0.42 0.6×0.3=0.18 0.4×0.5=0.2 0.4×0.5=0.2 0.4×0.5×0.6=0.12 0.4×0.5×0.4=0.08

Gene Search in Eucariots: Hidden Markov Models: Maximum Likelyhood (ML) Promoter 1 2 3 4 5 6 7 8 9 101112131415 • STEP6: Maximum Likelyhood (ML) (Maximális valószínűség): • We use an Optimization Algorithm (Optimalizációs algoritmus) on analyzed DNA sequence to: • Change starting points of possible signals as discrete-valued decision variables (if a signal is considered not in the sequence, starting point will be pushed to its end) • In all 6 possible reading frames of the analyzed sequence • To get maximal aggregated probability summing up: • Signal nucleotid weights (TACG), • Signal matching probabilities ({-big, {-small), • Signal syntactic probabilities (-big, -small) • This goal function is a nasty stepped, multimodal, nonlinear monster with tons of local sub-optima • Using signal starting positions of Optimal Solution ( ) (Optimális megoldás), analyzed sequence can be translated into gene structure and exons/introns, and coding parts can be translated into proteines • Protein products can be further analyzed with other tools, not just get the location of the gene, but its function also TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Gene Search in Eucariots: Hidden Markov Models: Optimization methods 1 • What kind of optimization methods we can use from the collection we learn earlier? • You can forget Analytic Optimization(Analitikus optimalizáció) as it cannot handle discrete variables and stepped functions • Also you can forget Gradient Descendent (Hegymászó) and Simulated Annealing(Szimulált hűtés) algorithms, as probability function is so multimodal that they will stuck in first sub-optimum • So, older software use Linear Programming (Lineáris programozás) with Branching& Bounding (Korlátozás-szétválasztás) which can handle discrete variables, and nonlinear goal function with linearization, but it explodes the model size, and has so huge computational requirement, that the examined species will be quite extint when it will find optimal solution finish gene search • But it has a special variant called Dynamic Programming (Dinamikus programozás):It uses the principle of series of „bottlenecks”: • IF a system has consecutive series of states (eg. time periods) • AND it can be transitioned only into the next neighbored state (cannot jump more states in one step, or go back, see stages 3, 2, 1, 0 on the figure) (eg. time goes forward continously, except for the Pyjama-clothed guys in Star Trek) • THEN optimizing neighboured state transitions individually in series of small, easy to compute models will optimize the whole system (during all time)

Gene Search in Eucariots: Hidden Markov Models: Optimization methods 2 • EXAMPLE5-2:To make this math blah-blah more understandable, this principle can be true not only in time but space: eg. if you have to cross 3 rivers consecutively, and there are only 1 bridges at each river, it is enough to optimize your route BETWEEN NEIGHBOURED BRIDGES, and your whole route will be optimal. • And exactly this is what was totally fucked up at Operation Market Garden in 1944… • EXAMPLE5-3:Also we can utilize that on the chromosome, genes are coded in one consecutive direction (5’-3’ on upper strand or 3’-5’ on reverse strand), not forth and back. So, if we already find a promoter with quite high mathing probability on the starting part of the sequence, it is not worth to search for upstream elements (eg. Expression factors) usually reside BEFORE the promoter. Instead of that we will search only intron starts/ends • This way „frog-leaping” forward only with a window of sequence on each 3 reading frames of 2 DNA strands, and use output of last model run (recognized start position of signals) as input in next model, will tremendously reduce computational requirement! • In newer software Genetic Algorithm (Genetikus algoritmus) is used to solve the discrete, multimodal, nonlinear optimization problem directly, without need of dynamic programming, because of their relatively much lower computational requirement than LP-B&B TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Gene Search in Eucariots: Hidden Markov Models: Software • Hidden Markov Model-based gene search software: GenMark (http://opal.biology.gatech.edu/GeneMark/) • At Start Screen: • Give Title • Give Sequence in FASTA format • Select Species for signal weight matrices • Press Start GeneMark.hmm button • It gives estimated position of genes and their introns/exons: • In Startframe/endframe column it gives the reading frame of probable exon starts/ends. • If starting and ending reading frame do not match, there is Frameshift mutation(Leolvasási keret mutácó) in exon: 1 or 2 extra bases are inserted/deleted in its sequence • Also, it splices exons (unfortunately, only in one version, corresponding their original sequence), and translates them into protein sequence in FASTA format • This can be analyzed later with Protein Structure software Click Click Click >gene_1|GeneMark.hmm|389_aa MSAPAKRSSTDTQDKDLMLAADKDMEKDTWNFKSMTDDDPMDFGFGSPAKNKKNAFKLDM GFDLDGDFGSSFKMDMPDFDFSSPAKKTTKTKETSDDKPSGNSKQKKNPFAFSYDFDALD DFDLGSSPPKKGSKTTTKSMDCEEICASSKSDKSDDLDFGLDLPITRQVPSKANTDVQAK ASAEKESQNYKTTDTLVVNKSKNSNQAALESMGDFEAVESPQGSRKKASQTHTMCVQPQS VDTSPLKTSCSKVEEKNEPCPSNETIAPSPLHASEIAHIAVNRETSPDIHELCRSGTKED CPIDPENANKKMITTMESSYEKIEQTSPSISSHLCSDKIEHQQEEMGTDTQAEIQDNTKG ALYNSDAGHSLTTLSGKISPGTRTSQTAK

Gene Search in Eucariots: Codon Usage Statistics: Basic terms 1 • There is a considerable evolution pressure, that codons in coding parts should be much more organized than in non-coding parts, as bad-coded proteins (eg. infested with frameshift mutations) will not work and reduce survival • Redundant coding (Redundáns kódolás):1 amino acid is usually coded by more codons, because of two reasons: • There are more tRNA with different anticodons still transfer the same amino acid • In some tRNA space sturucture of anticodons allows tRNA to „wobble” binding on mRNA at protein synthesis allowing to company the same amino acid for more codons – but the point is that probability of alternate „wobbled” matches varies considerably across different species! • Codon Usage Optimization (Kodonhasználat-optimalizáció): • In rapidly reproducing organisms (eg. yeast), where fast transcription is base of survival, a preference is formed during evolution for some codon alternatives • Optimized codon usage probably had synergic effect on tRNA evolution also, but it is little known about this issue yet • In slowly reproducing organisms (eg. man) codon usage optimization has reduced importance • Codon Usage Table (Kodonhasználati tábla): • A database broken up by species/sub-species: • What is the percentage share of amino acids in coding parts • What is the percentage share of alternative codons coding the same amino acid

Gene Search in Eucariots: Codon Usage Statistics: Basic terms 2 • Codon Usage analysis in ORFs (Nyitott leolvasási keretek kodonhasználata): • It examines all ORFs from all laternative ATG startcodon in all 6 possible reading frames • It computes which are preferated and Scarce codons (p<10%) in a given ORF • It computes frequency AT/GC pairs in the middle (non-wobbling) codon positions in a given ORF • It compares data with frequencies stored in data base (possibly at matching sub-species) and estimates which of the ORFs can be coding ORF • Frameshift mutation detection (Leolvasási keret hiba detekció): • If 2 coding ORFs positions are almost consecutive, just they are in different reading frames, then it is very likely that there was an insert/delete frameshift mutation at their border • This can more exactly identfy frameshift mutation location than Hidden Markov Model, where we get an indirect warning that consecutive intron start and –end are not in the same reading frame • Alternative start codon detection (Alternatív startkodon detekció): • ATG is most frequent start codon but not the only possible • If the actual ORF before the first ATG seems to be coding one, it is likely that the start codons are the more scarce GTG, GTT • Additional signals help to decide which can be the real start codon: • Start codon is usually preceeded with 10 base pairs by the ca. 3-5 base pair wide Shine-Dalgrano Ri- bosome Binding Site, RBS (Ribo- szóma bekötőhely), what is comple- menter of a ribosomal RNA • Or, even before this we can capture TATA box of promoter • Problems in codon preference analysis: • Lack of species-related scarce codons • At Eucariotes, intron content can blur statistics, so they should be removed by HMM • mRNA Editing (mRNS editálás): At eucariotes, alternative splicing can put unexpected stop codon on mRNA, but this has no any trace in DNA

Gene Search in Eucariots: Codon Usage Statistics: Software 1 • Database of codon usage tables by species:(http://www.kazusa.or.jp/codon/): • Simple codon usage statistics software: EBI(http://www.ebioinfogen.com/biotools/codon-usage.htm): • At Start Screen: • Upload analyzed DNA sequence in FASTA format, • Select code table: Standard for vertebrates, genomial • Press Submit button • In output table, amino acids and their codon usage frequencies can be seen Click Click Click

Gene Search in Eucariots: Codon Usage Statistics: Software 2 • Complex codon usage analysis software: GCUA: (http://gcua.schoedl.de/ ): • At StartScreen: we can select 2 modes: • Analyze triplets in all ORFs of the analyzed sequence based on codon usage of a selected species: • Give Sequence name • Give organism of the analyzed sequence • Give Analyzed sequence in FASTA format • Select the Codon usage table by species • Press Submit button • It gives Relative adaptiveness score (Relatív adaptivitási score) of triplet codons in all ORFs at bar carts: = freq. of alternative codons freq. of the most prefferred codon coding the given amino acid • Scarce codons on charts marked with grey, red signalling non-coding regions Click Click Click Click Click

Gene Search in Eucariots: Codon Usage Statistics: Software 3 • The second mode is Compare codon usage table of analyzed sequence with a known species: • Black barsare relative adaptiveness of codons in analyzed sequence • Red bars are relative adaptiveness of codons in selected species • It also provides the Mean percentage difference bet-ween them

References • Gene search in Eucariots • Hidden Markov Models • Bayesian probability: http://en.wikipedia.org/wiki/Bayesian_probability • Bayesian network: http://www.cs.columbia.edu/~sal/notes/AISP05/m14-bayesian.ppthttp://www.niedermayer.ca/papers/bayesian/ • Decision trees: http://en.wikipedia.org/wiki/Decision_tree • Markov-chains: http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf • Hidden Markov Models (HMM): http://jedlik.phy.bme.hu/~gerjanos/HMM/node2.html • Optimization algorithm: • Dynamic Programming: http://en.wikipedia.org/wiki/Dynamic_programming • HMM Software: • GeneMark: http://opal.biology.gatech.edu/GeneMark/ • Codon Usage/Preference: • Codon Usage databases: • Kazusa: http://www.kazusa.or.jp/codon/ • Codon Usage software: • EBI: http://www.ebioinfogen.com/biotools/codon-usage.htm • GCUA: http://gcua.schoedl.de/

BioInformatics Consultation Presentation 5 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22