Designing Samples and Experiments: Understanding Data Production Methods

Chapter 3Producing Data 產生資料 3.1 Designing Samples 3.2 Designing Experiments

動機 • 從(少量)的樣本(samples)推估母體(population)的特性 • 選用可代表母體的樣本 • 以儘可能不干擾母體的情況下取得資訊

觀察 vs. 實驗 • 觀察性研究 (observational study)。 • 對每一個體僅做觀察並量測有興趣的變數，並不試圖影響反應值。 • 抽樣調查(sample survey)。 • 實驗性研究(experimental study)。 • 對每一個體給予特定的處理(treatment)，再觀察並量測有興趣的變數的反應值。 • 採用隨機比較性實驗(randomized comparative experiments)。

例3.1 • 幫助接受社會救濟的母親找工作。 • 欲了解參加工作訓練計劃的母親是否較容易找到工作。 • 因為我們無法在觀察性研究時控制被觀察者的年齡、學歷、身體狀況等會影響找工作的背景因素。觀察性研究無法告訴我們該政策的效果。

混合(Confounding)效果 • 兩個或兩個以上之變數(解釋變數或隱藏變數)對反應變數的影響無法區別時，稱為混合(Confounding)(又稱交絡)效果。 • 例如:接受職業訓練和就業輔導的效果，與受補助者本身的背景(教育程度及年齡等)的效果無法區別。 • 一般多以較複雜的實驗設計方法，來達到區隔效果的目的。

3.1 Designing Samples抽樣設計

母體與樣本 • 欲推論的所有範圍稱為母體(population) ，從母體中選出的部分個體，我們據以獲取資料，稱為樣本(sample)。 • 選擇樣本的方法稱為樣本設計(sample design)。

自發性回應樣本與方便抽樣 • 自發性回應樣本(voluntary response sample)：主動對議題表達意見。 • 如Call-in，網頁問卷調查等。自發性回應樣本多數表達較強烈意見，因此多有偏差。 • 方便抽樣(convenience sampling)：依調查的方便主觀選取樣本。 • 如街頭訪問，賣場問卷調查等。因主觀選取的地點與方法而有不同程度的偏差意見。 • 偏差(bias)：樣本設計若造成系統性的傾向於某些結論，則稱為偏差。

用機率選取樣本 • 母體中每一個體被賦與一已知機率(0~1)，根據個體的已知機率選出的樣本組，稱為機率樣本(probability sample)。

簡單隨機樣本(Simple Random Sample) • 樣本數為 n的樣本組，若母體中每一個體被選到的機會一樣，且每一樣本數為 n的樣本組都有相同的機會被選到，稱為簡單隨機樣本(Simple Random Sample, 縮記為SRS)：。 • 簡單隨機樣本是一種機率樣本，機會一樣。 • 多以電腦程式、軟體選取或以隨機亂數表(table of random digits) 選取。

系統隨機樣本 • 系統隨機樣本(systematic random sample)：樣本數為 n的樣本，母體的總數為 N。令 N/n = k，1~k中隨機選出一數 a，則{a, a+k, a+2k, …, a+(n-1)k}為一組樣本數為 n的系統隨機樣本。 • 每一個個體被選到的機會一樣。 • 但每一組樣本數為 n的樣本未必有相同的機會被選到。

簡單隨機樣本的選取 • 步驟1：編號，母體中每一個體給一個號碼。 • 步驟2：查表，使用隨機亂數表選號。 • 30個企業體中選五個 • 步驟1 ：列冊編號 • 步驟2：查表：130行資料為 • 69051 64817 87174 09517 84534 06489 87201 97245 • 前10組 2位數為 69 05 16 48 17 87 17 40 95 17 • 00, 31~99略去，選 05, 16, 17, 17, 17，17重複繼續 • 再10組 2位數為 84 53 40 64 89 87 20 19 72 45 • 補選 20, 19，最後選出05, 16, 17, 20, 19。

分層樣本(Stratified Sample) • 分層樣本的選取 • 步驟一：將母體中每一個體，依有特別興趣，或是有接近性質為標準，分為若干群，稱為層(stratum)。 • 步驟二：每層各取一個SRS，全部合起來就是分層樣本。

歌曲著作權使用費的分配 • 作曲家組織(ASCAP)每年向廣播電台收取播曲權利金$ 435百萬(每年播放53百萬小時歌曲)，將分配給作曲家會員。 • 將所有電台依社區種類(都會區、鄉村等) 、地區(新英格蘭、太平洋等)及付出權利金額度(反應電台聽眾數)等特性分成432層。 • 每層隨機選幾台隨機錄音數小時，共錄音6萬小時。由專家辨認所有歌曲的作曲作詞者，記錄後依比例分配權利金。

多階段樣本(Multistage Samples) • 使用多階段抽樣設計(multistage sample design)取得的樣本。 • 常用於全國性家戶調查 • SRS 樣本分散太廣，非電話調查不易。 • 母體結構複雜，代表性樣本不易取得。

多階段樣本實例 • 全國性家戶調查 • 步驟一：將美國分成2007地理區域，稱為主要樣本單位(Primary Sampling Units, PSUs)。選出754 PSUs，包括428個人口最多之PSUs ，其餘隨機選出。 • 步驟二：選出的每個PSU分為若干小區(blocks)，各小區依種族等因素分層，選出分層樣本(小區)。 • 步驟三：選出的小區中依各戶相近程度每四戶成一集群(cluster)。隨機選出集群調查。

抽樣調查應注意事項 • 未涵蓋(undercoverage) • 母體資料不完整 • 部分族群無法聯絡，以致遺漏 • 如電訪中之住校生、犯人與街頭游民(homeless) • 遺漏之族群可能較貧窮 • 無回應(nonresponse) • 聯絡不到或拒答 • 此為較嚴重的偏差來源 • 回應偏差(response bias)

問卷用詞效果 • 拋棄式尿布製造商贊助的調查 • It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contrast, beverage containers, third-class mail and yard wasters are estimated to account for about 21% of the trash in landfills. Given this, in your opinion, would it be fair to ban disposable diapers? • 僅提供有利資訊

問卷用詞效果 (續) • 猶太人大屠殺的意見調查 • 1992年第一次問卷調查 • 問題為“Does it seem possible or does it seem impossible to you that the Nazi extermination of the Jews never happened?---22%回答”possible” • 第二次問卷調查 • 問題為“Does it seem possible to you that the Nazi extermination of the Jews never happened, or do you feel certain that it happened?---1%回答”possible” • 第一次問卷問題的複雜措詞造成困惑

抽樣調查應注意事項(續) • Never trust the results of a sample survey until you have read the exact questions posed. • The amount of nonresponse and the date of the survey are also important. • Good statistical design is a part, but only a part, of a trustworthy survey.

母體的推論 • 依據機率樣本的結果，所做的母體的推論，比較不會有偏差。 • 每次樣本的結果，不可能完全相同。我們可以根據抽樣結果的機率行為，推導出推論的可能誤差有多大。 • 一般而言，機率樣本的樣本數越大，所做的母體推論的誤差越小。

Designing Samples and Experiments: Understanding Data Production Methods

Designing Samples and Experiments: Understanding Data Production Methods

Presentation Transcript

Chapter 2. Using Silica Fume in Concrete

Chapter 7 Using Data Flow Diagrams

Chapter 3: Data Mining and Data Visualization

Chapter 8

Data Management: Databases and Organizations Richard Watson

Chapter 13

Chapter 1 Exploring Data

Data Structures for 3D Searching

Chapter 2

Crystallization

Chapter 6 Applications

Chapter 5: The Data Link Layer

Chapter 12

Chapter 2

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

CHAPTER 3 Data Description

Chapter 19

Data Description

Chapter 2

Chapter 12

Chapter 5 Structured Query Language (SQL)

Chapter 15 Quantifying Data