Statistical & Uncertainty Analysis UC Summer REU Programs

Statistical & Uncertainty AnalysisUC Summer REU Programs Instructor: Lilit Yeghiazarian, PhD Environmental Engineering Program

Instructor Dr. Lilit Yeghiazarian Environmental Engineering Program Office: 746 Engineering Research Center (ERC) Email: yeghialt@ucmail.uc.edu Phone: 513-556-3623

Textbooks • Applied Numerical Methods with MATLAB for Engineers and Scientists, 3rd edition, S.C. Chapra, McGraw-Hill Companies, Inc., 2012 • An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements, 2nd editions, J.R. Taylor, University Science Books, Sausalito, CA

Outline for today • Error • numerical error • data uncertainty in measurement error • Statistics & Curve Fitting • mean • standard deviation • linear regression • t-test • ANOVA

Types of Error • General Error (cannot blame computer) : • Blunders • human error • Formulation or model error • incomplete mathematical model • Data uncertainty • limited to significant figures in physical measurements • Numerical Error: • Round-off error (due to computer approximations) • Truncation error (due to mathematical approximations)

Gare Montparnasse, Paris, 1895

Accuracy and Precision • Accuracy: how closely a computed/measured value agrees with true value • Precision: how closely individual computed/measured values agree with each other (a) inaccurate and imprecise (b) accurate and imprecise (c) inaccurate and precise (d) accurate and precise Note: Inaccuracy = bias Imprecision = uncertainty Figure 4.1, Chapra

Accuracy and Precision • Inaccuracy: systemic deviation from truth • Imprecision: magnitude of scatter (a) inaccurate and imprecise (b) accurate and imprecise (c) inaccurate and precise (d) accurate and precise Note: Inaccuracy = bias Imprecision = uncertainty Figure 4.1, Chapra

Error, Accuracy and Precision In this class we refer to Error as collective term to represent both inaccuracy and imprecision of our predictions

Round-off Errors • Occur because digital computers have a limited ability to represent numbers • Digital computers have size & precision limits on their ability to represent numbers • Some numerical manipulations highly sensitive to round-off errors arising from • mathematical considerations and/or • performance of arithmetic operations on computers

Computer Representation of Numbers • Numerical round-off errors are directly related to way numbers are stored in computer • The fundamental unit whereby information is represented is a word • A word consists of a string of binary digits or bits • Numbers are stored in one or more words, e.g., -173 could look like this in binary on a 16-bit computer: (10101101)2=27+25+23+22+20=17310 off “0” on “1”

As good as it gets on our PCs … -1.797693134862316 x 10308 1.797693134862316 x 10308 0 15 significant figures 15 significant figures Underflow Overflow + - Overflow “Hole” on either side of zero -2.225073858507201 x 10-308 2.225073858507201 x 10-308 For 64-bit, IEEE double precision format systems

Implications of Finite Number of bits (1) • Range • Finite range of numbers a computer can represent • Overflow error – bigger than computer can handle • For double precision (MATLAB and Excel): >1.7977 x 10308 • Underflow error – smaller than computer can handle • For double precision (MATLAB and Excel): <2.2251 x 10-308 • Can set format long and use realmax and realmin in MATLAB to test your computer for range

Implications of Finite Number of bits (2) • Precision • Some numbers cannot be expressed with a finite number of significant figures, e.g., π, e, √7

Round-Off Error andCommon Arithmetic Operations • Addition • Mantissa of number with smaller exponent is modified so both are the same and decimal points are aligned • Result is chopped • Example: hypothetical 4-digit mantissa & 1-digit exponent computer • 1.557 + 0.04341 = 0.1557 x 101 + 0.004341 x 101 (so they have same exponent) = 0.160041 x 101 = 0.1600 x 101 (because of 4-digit mantissa) • Subtraction • Similar to addition, but sign of subtrahend is reversed • Severe loss of significance during subtraction of nearly equal numbers → one of the biggest sources of round-off error in numerical methods – subtractive cancellation

Round-Off Error andLarge Computations Even though an individual round-off error could be small, the cumulative effect over the course of a large computation can be significant!! • Large numbers of computations • Computations interdependent • Later calculations depend on results of earlier ones

Particular Problems Arising from Round-Off Error (1) • Adding a small and a large number • Common problem in summing infinite series (like the Taylor series) where initial terms are large compared to the later terms • Mitigate by summing in the reverse order so each new term is comparable in size to the accumulated sum (add small numbers first) • Subtractive cancellation • Round-off error induced from subtracting two nearly equal floating-point numbers • Example: finding roots of a quadratic equation or parabola • Mitigate by using alternative formulation of model to minimize problem

Particular Problems Arising from Round-Off Error (2) • Smearing • Occurs when individual terms in a summation are > summation itself (positive and negative numbers in summation) • Really a form of subtractive cancellation – mitigate by using alternative formulation of model to minimize problem • Inner Products • Common problem in solution of simultaneous linear algebraic equations • Use double precision to mitigate problem (MATLAB does this automatically)

Truncation Errors • Occur when exact mathematical formulations are represented by approximations • Example: Taylor series

0th 1st 2nd 3rd 4th • Taylor series widely used to express functions in an approximate fashion • Taylor’s Theorem: • Any smooth function can be approximated as a polynomial Taylor series expansions where h = xi+1 - xi

Each term adds more information: e.g., f(x) = - 0.1x4 - 0.15x3 - 0.5x2 - 0.25x + 1.2 at x = 1 = 1.2 ≈ 1.2 – 0.25(1) = 0.95 Figure 4.6, Chapra, p. 93 ≈ 1.2 – 0.25(1) –(1.0/(1*2))*12 = 0.45 = 1.2 – 0.25(1) – (1.0/(1*2))*12 – (0.9/(1*2*3))*13 = 0.3

Total Numerical Error • Sum of • round-off error and • truncation error • As step size ↓, # computations ⁭ • round-off error ⁭ (e.g. due to subtractive cancellation or large numbers of computations) • truncation error ↓ • Point of diminishing returns is when round-off error begins to negate benefits of step-size reduction • Trade-off here Figure 4.10, Chapra, p. 104

Control of Numerical Errors • Experience and judgment of engineer • Practical programming guidelines: • Avoid subtracting two nearly equal numbers • Sort the numbers and work with the smallest numbers first • Use theoretical formulations to predict total numerical errors when possible (small-scale tasks) • Check results by substituting back in original model and see if it actually makes sense • Perform numerical experiments to increase awareness • Change step size or method to cross-check • Have two independent groups perform same calculations

Measurements & Uncertainty

Errors as Uncertainties • Error in scientific measurement means the inevitable uncertainty that accompanies all measurements • As such, errors are not mistakes, you cannot eliminate them by being very careful • The best we can hope to do is to ensure that errors are as small as reasonably possible • In this section, words error and uncertainty are used interchangeably

Inevitability of Uncertainty Carpenter wants to measure the height of doorway before installing a door First rough measurement: 210 cm If pressed, the carpenter might admit that the height in anywhere between 205 & 215 cm For a more precise measurement, he uses a tape measure: 211.3 cm How can he be sure it’s not 211.3001 cm? Use a more precise tape?

Measuring Length with Ruler

Measuring Length with Ruler Note: markings are 1 mm apart Best estimate of length = 82.5 mm Probable range: 82 to 83 mm We have measured the length to the nearest millimeter

How To Report & Use Uncertainties Best estimate ± uncertainty In general, the result of any measurement of quantity x is stated as (measured value of x) = xbest ± Δx Δx is called uncertainty, or error, or margin of error Δx is always positive

Basic Rules About Uncertainty Δx cannot be known/stated with too much precision; it cannot conceivably be known to 4 significant figures Rule for stating uncertainties: Experimental uncertainties should almost always be rounded to one significant figure Example: if some calculation yields Δx=0.02385, it should be rounded to Δx=0.02

Basic Rules About Uncertainty • Rule for stating answers: The last significant figure in any stated answer should usually be of the same order of magnitude (in the same decimal position) as the uncertainty • Examples: • The answer 92.81 with uncertainty 0.3 should be rounded as 92.8 ± 0.3 • If the uncertainty is 3, then the answer should be rounded as 93 ± 3 • If the uncertainty is 30, then the answer should be rounded as 90 ± 30

Propagation Of Uncertainty

Statistics & Curve Fitting

Curve Fitting Figure PT4.1, Chapra • Could plot points and sketch a curve that visually conforms to the data • Three different ways shown: • Least-squares regression for data with scatter (covered) • Linear interpolation for precise data • Curvilinear interpolation for precise data

Curve Fitting and Engineering Practice • Estimation of intermediate numbers from tables in design handbooks →interpolation • Trend analysis – use pattern of data to make predictions: • Imprecise or “noisy” data → regression (least-squares) • Precise data →interpolation (interpolating polynomials) • Hypothesis testing – compare existing mathematical model with measured data • Determine unknown model coefficient values … or … • Compare predicted values with observed values to test model adequacy

Figure 13.1, Chapra Figure 13.2, Chapra You’ve Got a Problem … especially if you are this guy • Wind tunnel data relating force of air resistance (F) to wind velocity (v) for our friend the bungee jumper • The data can be used to discover the relationship and find a drag coefficient (cd), i.e., • As F⁭ , v⁭ • Data is not smooth, especially at higher v’s • If F = 0 at v= 0, then the relationship may not be linear How to fit the “best” line or curve to these data?

Before We Can Discuss Regression Techniques … We Need To Review • basic terminology • descriptive statistics for talking about sets of data

Data from TABLE 13.3 Basic Terminology Range? 6.775 - 6.395 = 0.380 Maximum? Minimum? 6.775 6.395 Individual data points, yi y1 = 6.395 y2 = 6.435 ↓ y24 = 6.775 Number of observations? Degrees of freedom? n = 24 n – 1 = 23 Residual?

Use Descriptive Statistics To Characterize Data Sets: Location of center of distribution of the data • Arithmetic mean • Median (midpoint of data, or 50th percentile) • Mode (value that occurs most frequently) Degree of spread of the data set • Standard deviation • Variance • Coefficient of variation (c.v.)

Data from TABLE 13.3 Arithmetic Mean

Data from TABLE 13.3 Standard Deviation St: total sum of squares of residuals between data points and mean

Data from TABLE 13.3 Variance

Data from TABLE 13.3 Coefficient ofVariation (c.v.) c.v. = standard deviation / mean Normalized measure of spread

Data from TABLE 13.3 Figure 12.4, Chapra Histogram of data For a large set of data, histogram can be approximated by a smooth, symmetric bell-shaped curve → normal distribution

Confidence Intervals • If a data set is normally distributed, ~68% of the total measurements will fall within the range defined by • Similarly, ~95% of the total measurements will be encompassed by the range

Descriptive Statistics in MATLAB >>% s holds data from Table 13.2 >>s=[6.395;6.435;6.485;…;6.775] >>mean(s), median(s), mode(s) ans = 6.6 ans = 6.61 ans = 6.555 >>min(s), max(s) ans = 6.395 ans = 6.775 >>var(s), std(s) ans = 0.0094348 ans = 0.097133 >>range=max(s)-min(s) range = 0.38 >>[n,x]=hist(s) n = 1 1 3 1 4 3 5 2 2 2 x = 6.414 6.452 6.49 6.528 6.566 6.604 6.642 6.68 6.718 6.756 n is the number of elements in each bin; x is a vector specifying the midpoint of each bin

Figure 13.8a, Chapra Figure 12.1, Chapra not very!!! distribution of residuals is large Figure 13.2, Chapra Back to the Bungee Jumper Wind Tunnel Data … is the mean a good fit to the data?

Curve Fitting Techniques Figure 13.8b, Chapra • Least-squares regression • Linear • Polynomial • General linear least-squares • Nonlinear • Interpolation (not covered) • Polynomial • Splines Can reduce the distribution of the residuals if use curve- fitting techniques such as linear least-squares regression

Linear Least-Squares Regression • Linear least-squares regression, the simplest example of a least-squares approximation is fitting a straight line to a set of paired observations: (x1, y1), (x2, y2), …, (xn, yn) • Mathematical expression for a straight line: y =a0+a1x+ e error or residual intercept slope

Statistical & Uncertainty Analysis UC Summer REU Programs