1 / 49

Theoretical distributions Lorenz Wernisch

Theoretical distributions Lorenz Wernisch. Distributions as compact data descriptions. List of numbers becomes organized in a histogram distribution captures statistical essentials. 66 2463 933 1287 297 777 1431 954 588 567 714 591. Gamma function with l = 2.660301

anila
Download Presentation

Theoretical distributions Lorenz Wernisch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Theoretical distributionsLorenz Wernisch

  2. Distributions as compact data descriptions List of numbers becomes organized in a histogram distribution captures statistical essentials • 66 • 2463 • 933 • 1287 • 297 • 777 • 1431 • 954 • 588 • 567 • 714 • 591 • ..... Gamma function with l = 2.660301 s = 358.4419

  3. A scoring scheme • Compare a new target (keyword, DNA sequence) • with objects (HTML pages, known DNA sequences) • in a database to find matching objects. • Each comparison results in a score database scores comparison -46 target -42 -59 -45 330 -53

  4. Compare target score with rest of scores The scores of unrelated (random) database hits number Score 330 lies here, way outside the heap of random scores scores

  5. Fit a Normal (Gaussian) distribution m = -47.1 score s = 330 s = 20.8

  6. Probability for high scores The red area is the probability that a random N(-47.1,20.8) distributed variable has score > 0 Pr[s > 0] = 0.0117

  7. Objectives • At the end of the day you should be able to • do some combinatorial counting • mention one typical example for each of the following distributions: Binomial, Poisson, Normal, Chi-square, t, Gamma, Extreme Value • plot these distributions in R • fit one of the above distributions to a set of discrete or continuous data using maximum likelihood estimates in R

  8. How many ways are there to order numbers • 3 numbers? 123, 213, ... • 4 numbers using the result for 3 numbers? • n numbers using the result for (n-1) numbers?

  9. The subsets of size k in a set of size n 8 10 • How many subsets of k = 3 • in set of n = 10 elements? • Idea: • enumerate all orderings • take first three • ....... • 3 7 5 1 10 9 4 2 3 8 • 3 5 7 1 10 9 4 2 3 8 • 5 7 3 9 4 3 1 10 8 2 • ....... 9 5 3 4 7 1 2 6 How many orderings give the same result?

  10. Binomial coefficient • Number of choosing subsets of k from set of n • “n choose k” • Alternative form:

  11. Very simple: the Bernoulli distribution • A coin with two sides, side 0 (tail) and side 1 (head), • has been tested with thousands of throws: • 60% gives 0 40% gives 1 • P[0] = 0.6 P[1] = 0.4 • The coin is a Bernoulli variable X with p = 0.4 • expectation • variance

  12. Flipping a coin • What is the probability of the following runs? • P[0 1 0 1 1 0 0 0 1 0] = • P[1 1 0 0 0 0 1 0 0 1] = • P[0 0 0 1 1 1 0 1 0 0] = • How many of such runs with four 1s exist? • What is the total probability of four 1s in 10 throws?

  13. Binomial distribution 1 Experiment: 100 coin tosses with p = Pr[side 1] = 0.4: expecting about 40 1s (but not exactly 40) 1000 experiments: (1000 x 100 coin tosses) number of experiments out of 1000 number of 1s out of 100

  14. R program for binomial plot • # generate 1000 simulations • x <- rbinom(1000,100,0.4) • # generate vector from 20 to 60 • z <- 20:60 • # plot the histogram from 20 to 60 • # with no labels • hist(x,breaks=z,xlab="",ylab="",main="", • xlim=c(20,60)) • # draw points of binomial distribution • # for 1000 experiments, line width 2 • lines(z,1000*dbinom(z,100,0.4), • type="p",lwd=2)

  15. Binomial distribution p = 0.82 • Cumulative distribution • function (cdf) • probability density • function (pdf) binomial n = 100 p = 0.4 l = 44

  16. R functions for distributions • Each distribution comes with a set of R functions. • Binomial distribution with parameters • n = size, p = prob • probability density function, pdf • dbinom(x,size,prob) • cumulative distribution function, cdf • pbinom(x,size,prob) • generate random examples • rbinom(number,size,prob) • More information: help(dbinom)

  17. R program for two binomial plots • # make space for two plots • par(mfrow=c(2,1)) • z <- 20:60 • # cdf first, “s” for cdf plot • plot(z,pbinom(z,100,0.4),type="s") • # pdf next, “h” for vertical lines • plot(z,dbinom(z,100,0.4),type="h") • # add some points, “p” for points • lines(z,dbinom(z,100,0.4),type="p") • # one plot only in future • par(mfrow=c(1,1))

  18. Binomial for large n and small p • Number n of throws very large • Tiny probability p for throwing a 1 • Expected number of 1s constant: l = np n = 50 p = 0.06 l = 3 0 1 n = 100 p = 0.03 l = 3 0 1

  19. Poisson distribution • If, on average, one sees l heads between 0 and 1 what is the probability of seeing k heads? • Make grid between 0 and 1 finer and finer, keeping l = np constant: • For example: requires some calculation with limit

  20. Poisson distribution • cdf • pdf • expectation • variance • ppois(x,lambda) • dpois(x,lambda) • rpois(n,x,lambda) Poisson l = 2.7

  21. Application of Poisson: database search • Query sequence against database of • “random” (that is, nonrelated) sequences: • p = Pr[database sequence looks significant, hit] • n = number of sequences in database • n is large and p is small, “E-value” l = np • Chance to get at least one wrong (random) hit:

  22. Normal distribution N(m,s) P[X<7] = 0.841 Cumulative distribution function: Normal m = 5, s = 2 Probability density function: Expectation Variance

  23. R program for two normal plots of N(5,2) • par(mfrow=c(2,1)) • z <- seq(-2,12,0.01) • plot(z,pnorm(z,5,2),type="l",frame.plot=F, • xlab="",ylab="",lwd=2) • plot(z,dnorm(z,5,2),type="l",frame.plot=F, • xlab="",ylab="",lwd=2) • # plot the hashing of the pdf • z <- seq(-2,7,0.1) • lines(z,dnorm(z,5,2),type="h") • par(mfrow=c(1,1))

  24. Normal distribution as limit of Binomial • increase n • constant p • Binomial • Normal • (less and less • skewed)

  25. Central limit theorem • Independent random variables Xi all with the same distribution X: E[X] = m and V[X] = s 2 • The sum • is normal N(0,1) in the limit: • or equivalently: density

  26. Fitting a distribution • 200 data points in histogram • seem to come from • normal distribution • To fit a normal distribution • N(m,s), we need to know • two parameters: m and s • How to find them?

  27. Maximum likelihood estimators (MLE) • Density function f(x; m,s) • Sample x = {x1, x2, ..., xn} from pdf f(x; m,s) • Likelihood for fixed sample depends on parameters m and s • Log-likelihood • We now try to find the m and s that maximize the • likelihood or log-likelihood

  28. MLE for normal distribution

  29. MLE for normal distribution is simple • For normal distribution the MLE of m and s is • straightforward: it’s just the mean and standard • deviation of the data: • # stepsize stsz of histogram, datavector x • z <- seq(min(x)-stsz,max(x)+stsz,0.01) • lines(z, • length(x)*stsz*dnorm(z,mean(x),sd(x)) )

  30. Fitting a distribution by MLE in R • Data points xi: x <- scan(“numbers.dat”) • For each data point xi a density value f(xi; m,s) • We are looking for m and s that make all these density values as large as possible together • Parameters m and s should maximize • sum(log(dens(x,mu,sigma))) • or minimize -sum(log(dens(x,mu,sigma)))

  31. MLE for normal distribution in R • # define the function to minimize • # it takes a vector p of 2 parameters • f.neglog <- function(p) • -sum(log(dnorm(x,p[1],p[2]))) • # starting values for the 2 parameters • start.params <- c(0,1) • # minimize the target function • mle.fit <- nlm(f.neglog,start.params) • # get the estimates for the parameters • mle.fit$estimate

  32. Points to notice about MLE in R • The nlm algorithm is an iterative heuristic minimization algorithm and can fail! • Choice of the right starting values for the parameters is crucial: an initial guess from the histogram can help • For location and scale parameters mean(x) and sd(x) is often a good guess • Check the result by fitting the density function with the calculated parameters to the histogram

  33. Exponential distributionExp(l) • cdf • pdf • Expectation • Variance • pexp(x,lambda) • dexp(x,lambda) • rexp(n,x,lambda) Exponential l = 4

  34. Exponential distribution and Poisson process • The exponential distribution as limiting process, tiny probability p, many trials n for throwing a 1 in the unit interval [0,1]: l = np • How long does it take until the first 1? • 18.1% chance of seeing a 1 in the indicated region 0 1

  35. MLE estimate for exponential function • Maximum likelihood estimate for l in • 1) • 2) • 3) l =

  36. The Gamma function • is the continuation of the factorial n! to real numbers: • and is used in many distribution functions • Moreover,

  37. Gamma distribution • density function, pdf: • expectation • variance • pgamma(x,alpha,lambda) • dgamma(x,alpha,lambda) • rgamma(n,x,alpha,lambda) Gamma a = 3, l = 4

  38. a = 1 a = 1/3 a = 3 a = 1/5 a = 5 Shape parameter of Gamma distribution

  39. Gamma distribution and Poisson process • The gamma distribution as limiting process: tiny probability p, many trials n for throwing a 1 in the unit interval [0,1]: l = np • How long does it take until the third 1? • 0.11% chance of seeing three 1s in the indicated region 0 1

  40. Example: waiting times in a queue • Time to service a person at the desk: • exponential with le = 1/(average waiting time) • If there are n persons in front of you, how is the time you have to wait distributed? • It turns out: the sum of n exponentially distributed • variables (waiting time for each person) is • Gamma distributed with a = n and l = le

  41. Extreme value distribution • cumulative distribution cdf • probability density pdf • No simple form for • expectation and variance Extreme Value m = 3, s = 4

  42. Examples for Extreme Value Distribution EVD • 1) An example from meteorology: • Wind speed is measured daily at noon: • Normal distribution around average wind speed • Monthly maximum wind speed is recorded as well • The monthly maximum wind speed does not follow a normal distribution, it follows an EVD • 2) Scores of sequence alignments (local alignments) often follow an EVD

  43. R code for Extreme Value distribution • No predefined procedures for the EVD in R • Replace for example: • pexval <- function(x,mu=0,sigma=1) • exp(-exp(-(x-mu)/sigma)) • dexval <- function(x,mu=0,sigma=1) • 1/sigma*exp((x-mu)/sigma) • *pexval(x,mu,sigma) • pexval(x,3,4) • dexval(x,3,4)

  44. c2 distribution • Standard normal random variables Xi, Xi~N(0,1), • The variable • has a cn2 distribution with n degrees of freedom • density • expectation • variance squared! pchisq(x,n) dchisq(x,n) rchisq(num,x,n)

  45. n = 1 n = 2 n = 4 n = 6 n = 10 Shape of c2 distribution is actually Gamma function with a = n/2 and l = 1/2

  46. t distribution • Z ~ N(0,1) independent of U ~ cn2 • then • has a t distribution with n degrees of freedom • density pt(x,n) dt(x,n) rt(num,x,n)

  47. Shape of t distribution n = 10 N(0,1) • Approaches • normal N(0,1) • distribution • for large n • (n > 20 or 30) n = 3 n = 5 n = 1

  48. Fitting chi-squared ort distribution w • Need location parameter • m and scale parameter s • the new density is • mu <- 3 • sig <- 2 • 1/sig*dt((x-mu)/sig),3) t with n = 3 m sw m = 3 s = 2

  49. MLE for t distribution • # function to minimize with 3 parameters • # p[1] is m, p[2] is s, p[3] is n • f.neglog <- function(p) • -sum(log(1/p[2]*dt((x-p[1])/p[2],p[3]))) • # starting values for the 2 parameters • start.params <- c(0,1,1) • # minimize the target function • mle.fit <- nlm(f.neglog,start.params) • # get the estimates for the parameters • mle.fit$estimate

More Related