Welcome to lecture 4: An introduction to modular PERL

Welcome to lecture 4:An introduction to modular PERL IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

Last time… • We covered a bit of material… • Try to keep up with the reading – it’s all in there! • We’ve covered variables, control structures, data structures, functions… • Now we’ll cover modular programming • We’ll create libraries of our own to use • We’ll take an example of a biological problem that incorporates everything we’ve learned and used so far…

Gene Finding(A very simplified example)

How to find a gene given a sequence? Conversely, what is the likelihood that a given region of sequence is a coding region? Note that we used the term ‘likelihood’: • A simplified statistical approach • Codon usage throughout an organisms genome is non-uniform • Non-coding regions differ from coding regions in their codon usage • We can use this information to test putative ORFs • A more traditional approach • Makes use of homology of putative regions to other known protein sequences • Does not require prior information regarding codon bias • Suffers from problems inherit in homology – based analysis

How to find a gene given a sequence? We don’t have to choose. We can use both (heuristics)? We’ll be dealing with sometimes complex and contradicting information Genes are non-linear linear: • There may be many methionine codons present • There are different reading frames possible • There are intron / exon combinations (alt. splicing) possible stop codons ATG possible start codons In-frame start to stop putative ORF

Note that we used the term ‘likelihood’:We need to introduce, briefly, someProbability and statistics Just a little Evil…

Permutations • Groups of Ordered arrangements of things • How many 3 letter permutations of the letters a, b, & c are there? • abc, acb, bac, bca, cba, cab  6 total • Can use Basic Principle of Counting: • 3*2*1 = 6 • General Formula: • n = total number of things • k = size of the groups your taking • k<n • 3!/(3-3)! = 6

Permutations • What if some of the things are identical? • How many permutations of the letters a, a, b, c, c & c are there? IQR Where n1, n2, … nr are the number of objects that are alike 6! / (3!2!) = 60

Combinations • Groups of things (Order doesn’t matter) • How many 3 letter combinations of the letters a, b, & c are there? 1: abc • How many 2 letter combinations of the letters a, b, & c are there? 3: ab, ac, bc • ab = ba; ac = ca; bc = cb *Order doesn’t matter IQR • General Formula: • n = total number of things • k = size of the groups your taking • k<n “n choose k”

Set Theory Sample Space of an experiment is the set of all possible values/outcomes of the experiment S = {a, b, c, d, e, f, g, h, i, j} S = {Heads, Tails} S = {1, 2, 3, 4, 5, 6} IQR E = {a, b, c, d} F = {b, c, d, e, f, g} E S F S Ec = {e, f, g, h, i, j} Fc = {a, h, i, j} E  F = {a, b, c, d, e, f, g} E F = EF = {b, c, d}  

S E F G Venn Diagrams IQR

Simple Probability • Frequent assumption: All Outcomes Equally likely to occur • The probability of an event E, is simply: • Number of possible outcomes in E • Number of Total possible outcomes • S = {a, b, c, d, e, f, g, h, i, j} • E = {a, b, c, d} F = {b, c, d, e, f, g} • P(E) = 4/10 P(F) = 6/10 P(S) = 1 0 < P(E) < 1 P(Ec) = 1 – P(E) P(E  F) = P(E) + P(F) – P(EF)

Independence Two events, E & F are independent if neither of their outcomes depends on the outcomes of others. So if E & F are independent, then: P(EF) = P(E)*P(F) If E, F & G are independent, then: P(EFG) = P(E)*P(F)*P(G) IQR

EF S E F Conditional Probability Given E, the probability of F is: IQR • Similarly:

ASSUME Here’s a simple question. I come from a family of two children (prior information states: I am a male). What’s the probability that my sibling is a sister? • Outcome (sex of offspring) is equally likely • Is it 0.5? Something else? • What is the question really asking?

Assumptions Here’s a simple question. I come from a family of two children. What’s the probability that my sibling is a sister? • Sample space is actually four pairs of possible siblings (in order of birth): {(B,B),(B,G),(G,B),(G,G)} • Let U be the event “one child is a girl” • Let V be the event “one child is Mike” • We want to calculate P(U|V)

Assumptions… Here’s a simple question. I come from a family of two children. What’s the probability that my sibling is a sister? IQR • P(U|V)=P(U  V)/P(V) • = P(one child is B, one is G)/P(one is B) • = 2/4 / ¾=2/3 • biologists cringe now…

EF S E F Conditional Probability Given E, the probability of F is: IQR • Similarly:

Random Variables • Definition: A variable that can have different values • Each value has its own probability • X = Result of coin toss • Heads 50%, Tails 50% • Y = Result of die roll • 1, 2, 3, 4, 5, 6 each 1/6 IQR

Discrete vs. Continuous • Discrete random variables can only take a finite set of different values. • Die roll, coin flip • Continuous random variables can take on an infinite number (real) of values • Time of day of event, height of a person IQR

Probability Density Function Many problems don’t have simple probabilities. For those the probabilities are expressed as a function… aka “pdf” Plug a into some function i.e. 2a2 – a3

Some Useful pdf’s Simple cases (like fair/loaded coin/dice, etc…) Uniform random variable (“equally likely”) For a = Heads For a = Tails

pdf of a Binomial Very useful function! Where p = P(success) & q = P(failure) P(success) + P(failure) = 1 n choose k is the total number of possible ways to get k successes in n attempts IQR

Hypergeometric distribution • Tends towards the binomial distribution when N is large • We can use combinatorics to test for enrichment; i.e. is the number found greater than expected by chance? IQR

Hypergeometric distribution • Our microarray of 9300 probesets (genes with some duplication) yields 200 upregulated genes in response to substance X. • We use gene ontology to cluster these genes into 4 biological process clusters: 160 genes in mitosis, 80 in oncogenesis, 60 in cell proliferation, and 40 in glucose transport. • Is substance X related to cancer? • Need to account for total number of genes queried by microarray in each category… • An enrichment problem (obs genes M, total number of genes N, the number of categorical genes x on the array, and the number of regulated genes K). • Source: Data Analysis Tools for DNA Microarrays. Sorin Draghici, 2003. IQR

Hypergeometric distribution • We may find that the inferred effect of substance X is very different from our initial response… • Glucose transport 4x more than expected by chance; oncogenesis not better than chance… Maybe correlation is with diabetes instead?

Using the p.d.f. • What is the Probability of getting 3 Heads in 5 coin tosses? (Same as 2T in 5 tosses) • n = 5 tosses k = 3 Heads • p = P(H) = .5 q = P(T) = .5 • P(3H in 5 tosses) = p3q2 = 10p3q2 • = 10*P(H)3*P(T)2 • = 10(.5)3(.5)2 = 0.3125 IQR

Notice how these are Binomials… • What is the probability of winning the lottery in 2 of your next 3 tries? • n = 3 tries k = 2 wins • Assume P(win) = 10-7 P(lose) = 1-10-7 • P(win 2 of 3 lotto) = P(win)2P(lose) • = 3(10-7)2(1-10-7) • = ~ 3*10-14 • That’s about a 3 in 100 trillion chance. • Good Luck! IQR

Expectation of a Discrete Random Variable Weighted average of a random variable …Of a function IQR

Measures of central tendency • Sample mean: the sum of measurements divided by the number of subjects. • Sample median: the measurement that falls at the middle of the ordered sample. IQR

Variance Variation, or spread of the values of a random variable Where μ = E[X] IQR

Variance and standard deviation: measures of variation in statistics: • Variance (s2 ): the mean of the squared deviations for a sample. • standard deviation (s ): the square root of the variance, or the root mean squared deviation, labelled IQR

Statistics of populations • The equations so far are for sample statistics • a statistic is a single number estimated from a sample • We use the sample to make inferences about the population. • a parameter is a single number that summarizes some quality of a variable in a population. • the term for the population mean is  (mu), and Ybaris a sample estimator of  . • the term for the population standard deviation is  (sigma), and s is a sample estimator of  . IQR Note that  and  are both elements of the normal probability curve. Source: http://www.bsos.umd.edu/socy/smartin/601/

Measuring probabilities under the normal curve • We can make transformations by scaling everything with respect to the mean and standard deviation. • Let z = the number of standard deviations above or below the population mean. • z = 0 y =  • z = 1 y =  +/-  (p=0.68) • z = 2 y =  +/- 2 (p=0.95) • z = 3 y =  +/- 3 (p=0.997) IQR

Did rounding occur? Ordered Array (radix sort) yields stem and leaf plots

Difficult to integrate… But probabilities have been Mapped out to this curve. Transformations from other Curves possible…

Box plots (box and whiskers plots, Tukey, 1977) min((Q3+1.5(IQR)),largest X) Outliers Fence / whiskers Q3 Median IQR Q1 Fence / whiskers max((Q1+1.5(IQR)),smallest X)

IQR

Statistics of populations • The equations so far are for sample statistics • a statistic is a single number estimated from a sample • We use the sample to make inferences about the population. • a parameter is a single number that summarizes some quality of a variable in a population. • the term for the population mean is  (mu), and Ybaris a sample estimator of  . • the term for the population standard deviation is  (sigma), and s is a sample estimator of  . IQR Note that  and  are both elements of the normal probability curve. Source: http://www.bsos.umd.edu/socy/smartin/601/

Measuring probabilities under the normal curve • We can make transformations by scaling everything with respect to the mean and standard deviation. • Let z = the number of standard deviations above or below the population mean. • z = 0 y =  • z = 1 y =  +/-  (p=0.68) • z = 2 y =  +/- 2 (p=0.95) • z = 3 y =  +/- 3 (p=0.997) IQR

IQR

Back to our task – gene finding Let’s start with a simple model that utilizes codon bias What we need: • A routine for reading and accessing the data • A statistical construct for evaluating all possible codons within the data • A way to reuse segments of our code when appropriate possible stop codons ATG possible start codons In-frame start to stop putative ORF

Codon bias assumptions • Codons are independent of each other • So if E & F are independent, then: • P(EF) = P(E)*P(F) • Codon frequencies are not uniform across the genome IQR

EF S E F Conditional Probability • Given E, the probability of F is: • (this is the likelihood) • We can evaluate competing likehoods • Through a ratio; called log-odds ratio, • Or LOD

EF S E F Conditional Probability • Our LOD is culled from the following information:

EF S E F Our model • To get our codon model, we need a TRAINING SET of data for known coding regions… • We then simply count the frequencies of each codon occurrence • We can often get this information from genomic databases in the form of ORF-only FASTA files…

EF S E F Our model • To get our random model, it is typical to model noncoding sequences as random DNA (uniform distribution)

Welcome to lecture 4: An introduction to modular PERL