1 / 73

13 Collecting Statistical Data

13 Collecting Statistical Data. 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.6 Clinical Studies. The Population.

rlincoln
Download Presentation

13 Collecting Statistical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.6 Clinical Studies

  2. The Population Every statistical statement refers, directly or indirectly, to some group of individualsor objects. In statistical terminology, this collection of individuals or objects is calledthe population. The first question we should ask ourselves when trying to make senseof a statistical statement is, “what is the population to which the statement applies?”

  3. The N Value Given a specific population, an obviously relevant question is, “How many individuals or objects are there in that population?” This number is called the N-value ofthe population. (It is common practice in statistics to use capital N to denote population sizes.) It is important to keep in mind the distinction between the N-value–a number specifying the size of the population–and the population itself.

  4. Example 13.2 The Return of theBald Eagle Over a period of many years, the United States Fish and Wildlife Service was ableto keep a remarkably accurate tally of the number of bald eagle breeding pairs inthe contiguous 48 states. (breeding pairs are usedas a useful proxy for the health of the overall population.) A tremendous amountof effort has gone into collecting and verifying these N-values, which, for a wildlifepopulation, are of remarkable accuracy.

  5. Example 13.2 The Return of theBald Eagle: Part 2 The figure on the next slide summarizes the populationnumbers over the period 1963–2000.(No tallies were conducted in 1964–1973,1975–1980, 1983, and 1985.) Since 2000 the bald eagle population has grown to thepoint that the U.S. Fish and Wildlife Service has discontinued the annual tallies.

  6. Example 13.2 The Return of theBald Eagle: Part 2

  7. Example 13.3 N Is in the Eye of the Beholder Andy has a coin jar full of quarters. He is hoping that there is enough money inthe jar to pay for a new baseball glove. Dad says to go count them, and if thereisn’t enough, he will lend Andy the difference. Andy dumps the quarters out ofthe jar, makes a careful tally, and comes up with a count of 116 quarters.

  8. Example 13.3 N Is in the Eye of the Beholder What is the N-value here? The answer depends on how we define the population. Are we counting coins or money? To Dad, who will end up stuck with allthe quarters, the total number of coins might be the most relevant issue. Thus, toDad, N = 116.Andy, on the other hand, is concerned with how much money is inthe jar. If he were to articulate his point of view in statistical language, he wouldsay that N = 29(dollars).

  9. Data The word data is the plural of the Latin word datum, meaning “somethinggiven,” and in ordinary usage has a somewhat broader meaning than the one we will give it in this chapter. For our purposes we will use the word data as any type ofinformation packaged in numerical form, and we will adhere to the standard convention that as a noun it can be used both in singular (“the data is…”), and plural(“the data are…”) forms.

  10. Census The process of collecting data by going through everymember of the population is called a census. Theidea behind a census is simple enough, but in practice a census requires a great dealof “cooperation” from the population. For larger, more dynamic populations (wildlife, humans, etc.), accurate tallies are inherently difficult if not impossible, and in these cases the best we can hope for is a good estimate of the N-value.

  11. 2000 Census Undercounts The most notoriously difficult N-value question around is,”What is the N-valueof the national population of the United States?” This is a question the UnitedStates Census tries to answer every 10 years–with very little success. The 2000 U.S.Census wasthe largest single peacetimeundertaking of the federalgovernment–it employedover 850,000 people and costabout $6.5 billion– and yet itmissed counting between 3and 4 million people

  12. Example 13.4 2000 Census Undercounts

  13. Example 13.4 2000 Census Undercounts Given the critical importance of the U.S. Census and given the tremendous resources put behind the effort by the federal government, why is the head countso far off?

  14. CASE STUDY 1 THE U.S. CENSUS The Constitution of the United States mandates that a nationalcensus be conducted every 10 years. The original intent of the census was to “Countheads” for a twofold purpose: taxes and political representation. The count was to exclude “Indians not taxed” and to count slaves as “three-fifthsof a free Person.”

  15. CASE STUDY 1 THE U.S. CENSUS Since then, the scope and purpose of the U.S. Census have beenmodified and expanded by the 14th Amendment and the courts in many ways: ■Besides counting heads, the U.S.Census Bureau now collects additional information about the population: sex, age, race, ethnicity, marital status, housing, income, and employment data. Some of this information is updated on aregular basis, not just every 10 years.

  16. CASE STUDY 1 THE U.S. CENSUS ■Census data are now used for many important purposes beyond the originalones of taxation and representation: the allocation of billions of federal dollarsto states, counties, cities, and municipalities; the collection of other importantgovernment statistics such as the Consumer Price Index and the Current Population Survey; the redrawing of legislative districts within each state; and thestrategic planning of production and services by business and industry.

  17. CASE STUDY 1 THE U.S. CENSUS ■For the purposes of the Census, the United States population is defined asconsisting of “all persons physically presentand permanently residing in theUnited States.” Citizens, legal resident aliens, and even illegal aliens aremeant to be included.

  18. Taking a Census Nowadays, the notion that if we put enough money and effort into it, all individuals living in the United States can be counted like coins in a jar is unrealistic.In 1790, when the first U.S. Census was carried out, the population was smallerand relatively homogeneous, as people tended to stay in one place, and, by andlarge, they felt comfortable in their dealings with the government. Under theseconditions it might have been possible for census takers to count heads accurately.

  19. Taking a Census Today’s conditions are completely different. People are constantly on the move.Many distrust the government. In large urban areas many people are homeless ordon’t want to be counted. And then there is the apathy of many people who thinkof a census form as another piece of junk mail.

  20. Taking a Census If the Census undercount were consistent among all segments of the population, the undercount problem could be solved easily. Unfortunately, the modernU.S. Census is plagued by what is known as a differential undercount. Ethnicminorities, migrant workers, and the urban poor populations have significantlylarger undercount rates than the population at large, and the undercount rates varysignificantly within these groups.

  21. Taking a Census Using modern statistical techniques, it is possibleto make adjustments to the raw Census figures that correct some of the inaccuracycaused by the differential undercount, but in 1999 the Supreme Court ruled inDepartment of Commerce et al. v. United States House of Representatives et al. thatonly the raw numbers, and not statistically adjusted numbers, can be used for thepurposes of apportionment of Congressional seats among the states.

  22. 13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

  23. A Survey The practical alternative to a census is to collect data only from some membersof the population and use that data to draw conclusions and make inferences aboutthe entire population. Statisticians call this approach a survey (or a pollwhen thedata collection is done by asking questions). The subgroup chosen to provide thedata is called the sample, and the act of selecting a sample is called sampling.

  24. A Survey Ideally, every member of the population should have an opportunity to be chosen as part of the sample, but this is possible only if we have a mechanism to identify each and every member of the population. In many situations this is impossible.Say we want to conduct a public opinion poll before an election. The population forthe poll consists of all voters in the upcoming election, but how can we identify whois and is not going to vote ahead of time? We know who the registered voters are,but among this group there are still many nonvoters.

  25. A Survey The first important step in a survey is to distinguish the population for whichthe survey applies (the target population) and the actual subset of the populationfrom which the sample will be drawn, called the sampling frame. The ideal scenario is when the sampling frame is the same as the target population–that would meanthat every member of the target population is a candidate for the sample. When thisis impossible (or impractical), an appropriate sampling frame must be chosen.

  26. Example 13.5 Sampling Frames Can Make a Difference A CNN/USA Today/Gallup poll conducted right before the November 2, 2004,national election asked the following question: “If the election for Congress werebeing held today, which party’s candidate would you vote for in your congressional district, the Democratic Party’s candidate or the Republican Party’s candidate?”

  27. Example 13.5 Sampling Frames Can Make a Difference When the question was asked of 1866 registered voters nationwide, the resultsof the poll were 49% for the Democratic Party candidate, 47% for the RepublicanParty candidate, 4% undecided.When exactly the same question was asked of 1573 likely voters nationwide,the results of the poll were 50% for the Republican Party candidate, 47% for theDemocratic Party candidate, 3% undecided.

  28. Example 13.5 Sampling Frames Can Make a Difference Clearly, one of the two polls had to be wrong, because in the first poll theDemocrats beat out the Republicans, whereas in the second poll it was the otherway around. The only significant difference between the two polls was the choiceof the sampling frame–in the first poll the sampling frame used was all registeredvoters, and in the second poll the sampling frame used was all likely voters.

  29. Example 13.5 Sampling Frames Can Make a Difference Although neither one faithfully represents the target population of actual voters,using likely voters instead of registered voters for the sampling frame gives muchmore reliable data. (The second poll predicted very closely the average results ofthe 2004 congressional races across the nation.) So, why don’t all pre-election polls use likely voters as a sampling frame instead of registered voters?

  30. Example 13.5 Sampling Frames Can Make a Difference The answer is economics. Registered voters are relatively easy to identify–every county registrar can produce an accurate list ofregistered voters. Not every registered voter votes, though, and it is much harderto identify those who are “likely” to vote. Typically, one has to look at demographic factors (age, ethnicity, etc.) aswell as past voting behavior to figure out who is likely to vote and who isn’t.Doing that takes a lot more effort, time, and money.

  31. Sampling The basic philosophy behind sampling is simple and well understood–if wehave a sample that is “representative” of the entire population, then whatever wewant to know about a population can be found out by getting the informationfrom the sample. If we are todraw reliable data from a sample, we must (a) find a sample that is representativeof the population, and (b) determine how big the sample should be. These twoissues go hand in hand, and we will discuss them next.

  32. Sampling Sometimes a very small sample can be used to get reliable information abouta population, no matter how large the population is. This is the case when thepopulation is highly homogeneous. The more heterogeneous a population gets, the more difficult it is to find arepresentative sample. The difficulties can be well illustrated by taking a look atthe history of public opinion polls.

  33. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The U.S. presidential election of 1936 pitted Alfred Landon, the Republicangovernor of Kansas, against the incumbent Democratic President, Franklin D. Roosevelt. At the time of the election, the nation had not yet emerged fromthe Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign.

  34. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest, one of the most respected magazines of the time, conducted a poll a couple of weeks before the election. The magazine had used pollsto accurately predict the results of every presidential election since 1916, andtheir 1936 poll was the largest and most ambitious poll ever.The sampling framefor the Literary Digest poll consisted of an enormous list of names that included:

  35. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL (1) every person listed in a telephone directory anywhere in the United States,(2) every person on a magazine subscription list, and(3) every person listed onthe roster of a club or professional association. From this sampling frame a list ofabout 10 million names was created, and every name on this list was mailed amock ballot and asked to mark it and return it to the magazine.

  36. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL Based on the poll results, the Literary Digestpredicted a landslide victory forLandon with 57% of the vote, against Roosevelt’s 43%. Amazingly, the electionturned out to be a landslide victory for Roosevelt with 62% of the vote, against 38% for Landon. The difference between the poll’s prediction and the actualelection results was a whopping 19%, the largest error ever in a majorpublic opinion poll.

  37. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL For the same election, a young pollster named George Gallup was able to predict accurately a victory for Roosevelt using a sampleof “only”50,000 people. In fact, Gallup also publicly predicted, towithin 1%, the incorrect results that the Literary Digestwould getusing a sample of just 3000 people taken from the same samplingframe the magazine was using. What went wrong with the LiteraryDigest poll and why was Gallup able to do so much better?

  38. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The first thing seriously wrong with the Literary Digestpoll wasthe sampling frame, consisting of names taken from telephonedirectories, lists of magazine subscribers, rosters of club members, andso on. Telephones in 1936 were something of a luxury, and magazinesubscriptions and club memberships even more so, at a time when 9 million people were unemployed.

  39. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL When it came to economic statusthe Literary Digestsample was far from being a representative crosssection of the voters. This was a critical problem, because voters oftenvote on economic issues, and given the economic conditions of the time,this was especially true in 1936.

  40. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL When the choice of the sample has a built-in tendency (whether intentionalor not) to exclude a particular group or characteristic within the population, wesay that a survey suffers from selection bias. It is obvious that selection bias mustbe avoided, but it is not always easy to detect it ahead of time. Even the mostscrupulous attempts to eliminate selection bias can fall short.

  41. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The second serious problem with the Literary Digestpoll was the issue ofnonresponse bias. In a typical survey it is understood that not every individual iswilling to respond to the survey request (and in a democracy we cannot forcethem to do so). Those individuals who do not respond to the survey request arecalled nonrespondents, and those who do are called respondents. The percentageof respondents out of the total sample is called the response rate.

  42. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL For the LiteraryDigest poll, out of a sample of 10 million people who were mailed a mock ballotonly about 2.4 million mailed a ballot back, resulting in a 24% response rate.When the response rate to a survey is low, the survey is said to suffer fromnonresponse bias.

  43. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL One of the significant problems with the Literary Digest poll was that the pollwas conducted by mail. This approach is the most likely to magnify nonresponsebias, because people often consider a mailed questionnaire just another form ofjunk mail. Of course, given the size of their sample, the Literary Digest hardly hada choice. This illustrates another important point: Bigger is not better, and a bigsample can be more of a liability than an asset.

  44. CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest story has two morals: (1) You’ll do better with a well-chosensmall sample than with a badly chosen large one, and (2) watch out for selectionbias and nonresponse bias.

  45. Quota Sampling Quota sampling is a systematic effort to force the sample to be representative of agiven population through the use of quotas–the sample should have so manywomen, so many men, so many blacks, so many whites, so many people living inurban areas, so many people living in rural areas, and so on. The proportions ineach category in the sample should be the same as those in the population.

  46. Quota Sampling If wecan assume that every important characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will berepresentative of the population and produce reliable data.

  47. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION George Gallup had introduced quota sampling as early as 1935 and had used itsuccessfully to predict the winner of the 1936, 1940, and 1944 presidential elections. Quota sampling thus acquired the reputation of being a “scientificallyreliable” sampling method, and by the 1948 presidential election all three majornational polls–the Gallup poll, the Roper poll, and the Crossley poll–usedquota sampling to make their predictions.

  48. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION For the 1948 election between Thomas Dewey and Harry Truman, Gallupconducted a poll with a sample of approximately 3250 people. Each individual in the sample was interviewed in person by a professional interviewer to minimizenonresponse bias, and each interviewer was given a very detailed set of quotasto meet–for example, 7 white males under 40 living in a rural area, 5 blackmales over 40 living in a rural area, 6 white females under 40 living in an urbanarea, and so on.

  49. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION By the time all the interviewers met their quotas, the entiresample was expected to accurately represent the entire population in every respect:gender, race, age, and so on. Based on his sample, Gallup predicted that Dewey, the Republican candidate, would win the election with 49.5% of the vote to Truman’s 44.5% (withthird-party candidates Strom Thurmond and Henry Wallace accounting for theremaining 6%).

  50. CASE STUDY 3 THE 1948 PRESIDENTIAL ELECTION The Roper and Crossley polls also predicted an easy victory forDewey. The actual results of the election turnedout to be almost the exact reverse of Gallup’s prediction: Truman got 49.9% andDewey 44.5% of the national vote. Truman’s victory was a great surprise to the nation as awhole. So convinced was the Chicago Daily TribuneofDewey’s victory that it went to press on its early editionfor November 4, 1948, with the headline “Deweydefeats Truman.”

More Related