Signal and Data Processing

Signal and Data Processing CSC 508 Basic Methods in Data Reduction

3.0 Basic Methods in Data Reduction In this section we study some of the more important methods in statistical data analysis and reduction. We establish a foundation for the concepts of significance testing, correlation between data sets, independence, and hypothesis testing. The definition of data reduction we will use in this course includes any process that extracts information from measurements and presents it in a more compact form. This lecture is an introduction to almost everything else that we will cover in this course. We will learn a number of basic definitions and data reduction techniques common to all aspects of signal and data processing. 3.1 Random Vectors Last week we reviewed some basics of probability and statistics applicable to analysis of measurements in noise. Some of the data sets we used would fit the definition of a vector, Now we formally introduce the concept of a random vector. A vector is a collection of two or more values with a fixed order and related by an object or other entity from which each element of the vector can be measured. For example, a person's height and weight can be measured and saved as a two-element vector.

These are a few examples of such vectors in which the first element is a person's height in inches and the second element is the same person's weight in pounds. A random vector is one in which the measurement and/or the elements themselves exhibit some amount of uncertainty. (72,180) (70,198) (68,112) (77,205) (66, 98) (71,230) (69,123) (58, 84) The (height,weight) vector fits the definition of a random vector. These 8 samples can be plotted as shown below. Repeated measurements of a single quantity may be used to make a random vector. Elements of a random vector may be obtained from direct measurement or may be derived from other quantities. Successive measurements of the same quantity over time is a type of random vector referred to as a time series.

3.2 Propagation of Errors Keep in mind that the apparent randomness in a random a sequence of measurements can be due to noise in the measuring process or due to variations in the entity being measured. The part of the uncertainty due to measurement noise is called the error. When we are dealing with values that are direct measurements of fixed quantities we can determine an estimate of the amount of error in the measurements by computing their standard deviations. Many times we will use two or more direct measurements to calculate some other quantity that cannot be directly measured. We will need to know how to estimate the amount of error in the derived quantity given estimates of the errors in the direct measurements that constitute the derived quantity. First we will assume that the direct measurements used to obtain the derived quantity are independent of each other. That is, the magnitude of one of the measurements is not affected by the magnitude of any other measurement in a particular trial. Later, we will worry about cases for which there is dependence between the measurements. Given that a derived quantity g is a known function of a random vector X=(x1,x2,...,xn), we can obtain an estimate of the standard deviation of g by,

where dg/dxi is the partial derivative of g with respect to the ith element of the random vector X. That is, the amount of variation in the derived quantity due to a variation in the value of xi. As with most closed-form representations it is only notional. We can't do anything with this formula until we know something specific about the relationships between the direct measurements and the derived quantities. An example will be of great help at this point. Imagine that you are measuring the dimensions of a box in order to determine its volume. The volume of the box is the derived quantity and the dimensions of the box are the direct measurements. The volume of the box is the product of its height H, width W and length L. V = H x W x L H +/_dH Notice that the amount of error in the volume of the box due to an given error in the height measurement is a function of the width and length of the box (weird!). L W

We will use the average values of H, W and L as our best estimates of the height, width and length of the box. The estimated error in the volume of the box is therefore, For now, it is sufficient for you to gain an intuitive understanding of what this equation is telling us. the first terms says that the amount to which an error in the height affects the volume is proportional to the area of base (or top) of the box. This means that the change in the volume due to a change in the height of the box is given by the height change times the area of the bottom (or top) of the box. The same argument holds for the other two dimensions and their relative errors. In general the partial derivatives can be approximated by varying each of the elements of the random vector and determining how this affects the value of the derived quantity. Propagation of errors will be an important part of our study of identification theory. You may be asking yourself why we computed the resultant standard deviation of the derived quantity by taking the square root of the weighted sums of the squares o the individual standard deviations of the direct measurements. (Even if you're not asking yourself this question we need to discuss it.)

Root Sum Squares (RSS) The reason we use the root-sum-squares (RSS) instead of simply adding the weighted sums of the standard deviations is that the standard deviations are measures of the expected deviation from the means of random quantities. The individual samples are varying in unpredictable and independent ways, so that when a measurement of one of the elements is above its mean value, another element's measurement may be below its mean value. That is, one error can compensate for another. (This happens whenever the separate elements of the random vector are not affected by one another.) Attempting to compute the overall standard deviation by simply adding the individual standard deviations wold imply that the random variation in the components of the vector were somehow correlated t each other. Sometimes the separate elements of a random vector are correlated. Consider again the example of height and weight. Are the height and weight of a person independent? In other words, can be make a better estimate of the weight of a person if we know their height? Even though height and weight are not perfectly correlated (i.e. they are not deterministic functions of each other) we can improve our estimate of one of these quantities by knowing the other. If these quantities were completely independent the points in the scatter-plot of height vs. weight would be distributed in a circle (assuming normalized scales for each axis).

covariance mean function The Mean Function Just as we computed the mean of a fixed quantity, we can compute the mean function of the relationship between components of a random vector. In this case the mean function is the relationship between height and weight derived from our sample set. We note that none of the samples must fall on the mean function line. However this line is our best estimate of the average height for a given weight. With sufficient numbers of samples, we could compute the standard deviation of the height as a function of the weight and vice-versa. You might also be able to prove that, rather than being over-weight, you are under-tall. When two components of a random vector are not completely independent we say that the variance of one component is correlated with the variance of the other component. This relationship is called the covariance. A simple form of the covariance between two components can be derived by,

Covariance Matrix Notice that when we change y to x in the expression for the covariance we obtain the expression for the variance in x. The variances for x and y and the covariances between x and y are combined in the covariance matrix, The covariance matrix is always symmetric. That is, Cij=Cji for all i,j=1,...,n in an nxn covariance matrix. The main diagonal terms (Cii) represent the variance of the individual components of the vector. When all components of a random vector are independent, the off-main diagonal terms of the covariance matrix are zero. For this reason the covariance matrix can be used to determine the level of independence between the elements of a random vector. It is also used as a measure of the potential for each element to provide additional information about a derived quantity.

3.3 Estimation One of the major tools of data reduction is statistical inference, which is a method of making generalizations about populations (not just people but any collection of entities) based on a small sample of the population. Statistical inference can be divided into two major categories, estimation and hypothesis testing. We will study estimation first. Estimation is the collection of data by sampling a population in order to determine an approximate measure of an unknown parameter based on the sample and an analysis of the accuracy of the estimate. The accuracy of an estimate is a function of the sample size and the sample distribution. Hypothesis testing involves the establishment of an educated guess (the hypothesis) and the subsequent testing of this guess based on a sampling of the population. In hypothesis testing we do not attempt to estimate a parameter but rather use the sample data to accept or to reject our hypothesis. We have already covered the simplest form of estimation called a pointestimate. Examples of point estimates include the sample mean as an estimate of the population mean and the sample standard deviation as an estimate of the population standard deviation.

Using our small sample of heights we can compute a point estimate of the average height as 68.9 inches. Assuming that our sample is taken from the population distribution, how close is our sample mean to the population mean? What we need is a method of expressing the possible difference between our best estimate of the mean and the true mean. If height follows a normal distribution we can estimate the deviation in the sample mean obtained from independent samples of size n (here n=8) by, Remember that ss is the standard deviation of a single measurement from our sample mean. The value sm is the standard deviation of the sample mean from the population mean. This equation holds only if the distribution of the sample matches the distribution of the population and the population is large. We can derive the probability that the sample mean lies within a specified range of the population mean by integrating the normal probability density function between the specified intervals.

Bayes Estimate The Bayesian method combines measurements with a priori information to provide an improved estimate of statistical quantities. We will use a graphical approach to gain an intuitive understanding of this method. Consider a probability space partitioned into n mutually exclusive events Ai i=1,..,n. The a priori (prior) probability that event Ai will occur is P(Ai). The probability that some event Ai i=1,..,n will occur is P(A1)+P(A2)+ . . . +P(An) = 1. The shaded area in the right-hand figure represents the information gained when we have knowledge that condition B is true.

In our graphical example the probability that event Ai will occur, can be interpreted as the area of Ai divided by the entire probability space. This ratio is analogous to the probability that a randomly thrown dart (that falls inside the probability space) has fallen inside the area Ai. If we are told that the dart has fallen within the area defined by B we can re-compute the probability that the dart is also inside Ai. The probability of Ai given B is written as P(Ai|B) and is proportional to the ratio of the area of the portion of Ai in B to the area of B itself. These areas can also be obtained by intersecting B with each of the Aj.

In our graphical example we are using areas, but in reality, we are working with probabilities. The intersection of two overlapping areas in probability space is analogous to the probability of the first event times the probability of the second event given that the first event has occurred. Using this relationship we can rewrite our original expression as, This is Bayes Rule. In words it states, "The probability that Ai is true given that condition B has occurred is equal to the prior probability that Ai is true times the probability that condition B has occurred given that Ai is true." OK, maybe another example is in order....

The Willies A medical journal announces the availability of a new diagnostic test. The announce-ment states the following: "An incredibly accurate indicator for the presence of the willies has recently been developed by Hokes Laboratories that will give a positive reading on an infected patient with probability 0.998 and has a false positive reading in only 2 out of 1000 patients. This modern miracle will revolutionize..." Even though you know that only 1 in 10,000 people in the world have the disease, you have long suspected that you have the willies. You rush to your doctor and demand to be tested. Confirming your suspicions, the test comes back positive. Based on this one test, what is the probability that you are really infected? In this example, the probability space is partitioned into two regions: A1 = You have the willies; and A2 = You do not have the willies. You also have the information that 1/10,000 persons in the population actually have the willies which is equivalent to an a priori probability that you are infected of 0.0001. This means that the a priori probability that you do not have the willies is 0.9999.

The test is positive, therefore B is true in our example. In this case we can apply values to the following probabilities. P(A1) = 0.0001 P(A2) = 0.9999 P(B|A1) = 0.998 P(B|A2) = 0.002 Before you continue, take a moment to be sure you understand the meaning of each of these probabilities. Now we can determine the probability that you actually have the willies by direct application of Bayes' Rule. In other words, there is less than a 5% chance that you actually have the willies.

Homework 1. Consider the 10 samples of the two component random vectors listed in (a) and (b) below: a. (98,21) (16,57) (38,86) (95,2) (47,38) (96,75) (52,84) (16,47) (79,20) (87,86) b. (2,7) (97,90) (96,101) (39,35) (18,29) (12,0) (45,35) (40,37) (59,57) (68,83) Sketch scatter plots of the two sample sets. What can you infer from these points about the populations from which they were taken? Compute the covariance matrices for these two sets. Are the elements of the random vector sampled in a correlated? How about b? 2. Use Bayes Rule to determine the probability that the red die in a pair of dice (one red, one green) is a five, given that the total on the two dice is 9. 3. The meteorologist at the local TV station reports, "There is an 80% chance of rain and a 20% chance that I know what I'm talking about." Given that it rains about 65 days a year, what is the actual probability of rain?

3.4 Methods of Hypothesis Testing One of the most important tasks in signal processing applications is the detection of a signal in the presence of noise. Imagine that we have made a number of observations (collected measurement vectors) in which some object of interest (target in background) was present, and we then made observations in the same environment but without the object of interest (background only). We can use these data to construct models for the presence or absence of the object of interest in an observation. This figure shows the PDF's for a background model and for a target in background model. We assume that the distributions of these two measurement sets are normally distributed. The PDF's show the relative probabilities that a particular sample belongs to the background model or to the target in background model as a function of the value of the measurement vector X.

Maximum Likelihood The maximum likelihood criterion L(x) states that we will associate the sample x with the target model if the target PDF is greater than the background PDF at x. where H2 is the hypothesis that the sample is a measurement of background only and H1 is the hypothesis that x is a measurement that includes the target in the background. If the target and background PDF's are normally distributed we may compute the conditional probabilities by,

Misclassification Errors In many applications the target and background PDF's overlap enough that the probability of errors in classification are significant. When we associate a sample with the wrong model we are making an error called misclassification. When we label a target sample as background it is called a miss and when we label a background sample as a target it is called a false alarm. We wish to place the decision threshold at a location that minimizes these two types of error. Where we want to place the threshold depends of the relative costs of a miss and a false alarm. If these two cost are equal we will place the threshold so that the probabilities of thes two errors are equal.

If we know that the models are normally distributed we can compute the probabilites of a miss and a false alarm by, where T is the threshold value. The closed forms of these probabilities as shown above are not really practical for computations. Recall that there is no analytical form for the integral of the normal PDF, therefore we will have to use numerical methods to compute these probabilities. Generally, the cost of a miss Cm is not equal to the cost of a false alarm Cfa. Consider the example of a smoke detector. The cost of having a smoke detector go off when you burn the toast is much less than the cost of it missing an occasional house fire.

Cost/Benefit Analysis We will learn about ways to affect the shape and amount of separation of the two model PDF's. The ideal case is one in which there is no overlap between our target and background models. Unfortunately such a situation is rare in a real application. Once we have done all we can to minimize the overlap between the models we want to choose a threshold that minimizes the total cost of errors, at the same time maximizing the benefits of correct decisions. If we only know the costs of errors we choose a value for the threshold T that minimizes, Ctot(T) = CmPm(T) + CfaPfa(T) When we know both costs and benefits, we minimize, Ctot(T)=CmPm(T) + CfaPfa(T) - CdPd(T) - CbPb(T) where Pd(T) and Pb(T) are the probabilities of correctly classifying a target sample and a background sample respectively, and -Cd and -Cb are the benefits of these correct classifications. The probabilities are related by, Pd(T) = 1 - Pm(T) and Pb(T) = 1 - Pfa(T) Substituting these into our cost equation, we can express the total cost/benefit in terms of the error probabilities alone. Ctot(T) = (Cm + Cd)Pm(T) + (Cfa + Cb)Pfa(T)

Homework Assume the mean of a background model mbkg=1.5 and the mean of a target model mtgt=3.8 in the following: a. If the background and target standard deviations are equal, and the cost of a miss equals the cost of a false alarm, find the value of the threshold T that minimizes total cost. b. If stgt=1.0 and sbkg=0.5 and the error costs are equal, find the value of the threshold T that minimizes total cost. c. If the background and target standard deviations are equal, and the cost of a miss is 10 times the cost of a false alarm, find the value of the threshold T that minimizes total cost. d. If stgt=1.0 and sbkg=0.5 and the cost of a miss if 10 times the cost of a false alarm, find the value of the threshold T that minimizes total cost. Bonus: Describe a situation (i.e. give means and standard deviations) in which there would need to be two thresholds defined to minimize total cost for classification between two one-dimensional models.

3.5 Curve Fitting We have just seen methods of data reduction in which a set of measurements is used to make a decision about the class membership of a sample measurement. If we are only interested in model association, we may discard the sample data and just keep the model identification, once the classification has been made. Sometimes we wish to know more about the measurement set than which model it best fits. We will now study another type of data reduction in which a large data set can be represented by a few parameters in a manner that permits us to recover an approximation of the original data. This technique is known is curve fitting. Let's look at a set of N data pairs Z(xi,yi), i=1,...,N, which represents a series of measurements on some set of objects or some time-varying phenomenon. We might first compute the covariance between the x's and the y's to determine the extent to which these two elements are correlated. If we find that they are highly correlated, we will then want to describe the relationship between them y=f(x). The form of the functional relationship between the x's and the y's depends on the characteristics of the data set that we wish to emphasize and those that we wish to ignore. In other words, we decide, based on the application, what is the signal and what is the noise.

Method of Least Squares One of the simplest functional relationships we can apply is the first order polynomial or linear function: f(x) = a + bx Once we establish a criterion for minimizing the discrepancies between the chosen function f(x) and the y components of the data set, we can determine values for a the coefficients a and b that best fits the function to the data. There is no absolute method for determining the best values for a and b but we will argue for the use of a particular technique called the method of least squares. First we want to choose coefficients a and b to minimize the differeneces between the actual y values and those calculated by the function f(x). That is, minimizedyi = yi - a - bxi for all i=1,..,n We could simply add all the dyi terms, but this would allow positive and negative difference to cancel each other, possibly masking larger discrepancies between f(x) and the measured values of y. Instead we can minimize the sum of the the squares of the differences, minimizeSdyi2

Finally, we might decide to normaize each dyi term by dividing by the standard deviation of the yi values, (There are other reasons for dividing by the standard deviation which need not concern us here.) This quantity is called the "chi-squared" value. It is also the exponent of the normal probability density function. Whe we minimize this term we will simul-taneously maximize the probability that the calculated yi will equal the measured yi. The yi's are the measured values and the -a-bxiare the derived values. To find the values of the coefficients that minimize this sum, we will comput the partial derivatives of C2 with respect to a and b, and set them equal to zero.

Now we see that our concern about the normalization constant was unfounded in this case. Simplifying, we have, and where the sums are all from 1 to N. These equations may be solved simultaneously for the values of a and b which minimize the chi-squared value.

Similar derivations are possible for higher order polynomial functions. As shown below we can obtain a least-squares fit to a quadratic (2nd order), a cubic (3rd order) and higher order polynomials for the 6 data points. Generally, we wish to avoid polynomial orders approaching the number of data points. Even though higher order polynomials fit the data points better, they tend to become erratic between the data points. As shown in this example, the 5th order function passes through all the data points, but it is obviously not a good representation for the behavior of the data set.

Step 1: Step 2: Step 3: Step 4: Step 5: Step 6: Step 7: Determine the piece-wise linear function represented by connected the successive data points with straight lines. Set the current polynomial order, N to 1. Compute the least-squares fit to an Nth order polynomial. Determine the area between the current order polynomial function and the piece-wise linear functions. Save this value. If N=1 then set N=2 and return to Step 3, else got to Step 6. Compare the area for polynomials of reder N and N-1. If the area for N is smaller than the area for N-1 then set N=N+1 and return to Step 3, else go to Step 7. Choose the N-1 polynomial as the best fit to the data set. Restricted Interpolation Procedure We may ask, " Which order polynomial fit should we choose when performing a curve fit?" The answer depends on the application. If you want a definite rule that works for most situations, consider the restricted interpolation procedure.

The figure below shows a comparison of the quadratic function and the linear function with respect to the piece-wise linear interpolation of the data points. Using the restricted interpolation procedure, we find that the best least-squares fit for this data is a linear function. It should be clear that this approach will work for any fitting function for which we can compute the chi-square values.

Correlation Coefficient Finally, we need to consider whether we are justified in our presumption that there is a linear (or any other kind of) relationship between the x and y data pairs. In other words, are the values of x and y correlated? A popular method for determining the level of correlation between two or more measurements stes it to compute the covariance matrix, and then use the terms of C to compute the correlation coefficient rxy between the components x and y. When the number of data pairs is small we cannot use the correlation coefficient to determine the level of correlation. When rxy is close to zero the data sets are probably independent. Conversely, when the value of rxy is close to +/- 1 the elements of the data sets are probably correlated.

Signal and Data Processing