**2. **Analyzing Data NHANES 1999-2004Preparing your data files Downloading demographic, questionnaire, exam and lab files.
Files are no longer available as self-extracting zip files.
Documentation and procedure files are now in Adobe PDF format and can be viewed or accessed directly via the web link
Clicking on the data link will allow you to store the data file or open it directly with SAS.
Data files are in SAS transport (.xpt) format. No more zip
Adobe
Open store direct link
SAS xport
The first part of analyzing NHANES data is to create and prepare your data set. Ms. Louis has already discussed with you how to find and download the demographic, lab, exam and questionnaire files.
In the past files were available both as all in one self extracting zip file and each individually as xpt or pdf. Now only xpt and pdf no more zip.
No more zip
Adobe
Open store direct link
SAS xport
The first part of analyzing NHANES data is to create and prepare your data set. Ms. Louis has already discussed with you how to find and download the demographic, lab, exam and questionnaire files.
In the past files were available both as all in one self extracting zip file and each individually as xpt or pdf. Now only xpt and pdf no more zip.

**3. **Know your data
Read the documentation !!
Read the documentation !!
Read the documentation !!
Read the documentation!!
Before you begin to really work with your data the most important thing to do is READ the documentation.
Before you begin to really work with your data the most important thing to do is READ the documentation.

**4. **Preparing your data files Merging:
Merge all files by sequence number to the demographic file.
Verify the numbers of records merged and the final sample number against the published frequencies on the web.
Be sure they are what you expected and all merges worked correctly.
Merge demo
Log verify
Freq verify
Correct / expected
It is important that you use the demographic file as your primary file and merge the relevant lab, exam or quest files on sequence number.
Be sure to verify by reviewing your SAS log that records have merged successfully. That the total number of records is equal to the number of records in the demographic file.
Run frequencies on variables from each file and check against expected numbers found in the data file documentation.Merge demo
Log verify
Freq verify
Correct / expected
It is important that you use the demographic file as your primary file and merge the relevant lab, exam or quest files on sequence number.
Be sure to verify by reviewing your SAS log that records have merged successfully. That the total number of records is equal to the number of records in the demographic file.
Run frequencies on variables from each file and check against expected numbers found in the data file documentation.

**5. **Know your data Run basic frequencies and cross tabulations.
Know your target population.
Understand how item was measured
(how is the item defined, topcoded, recoded)
Recode variables as necessary
(example: age groups, positive/negative lab tests, high/low BP, high/low cholesterol etc.).
Recode unknown/refusals as missing data
(77, 99 recode to missing).
Check your coding ? run frequencies in SAS.
Freq
Target
Coded
Review lit ? recode
77/99/missing - recode
Check code
Run basic frequencies on ALL discreet data to see how it is coded.
Understand how items are measured and outcomes defined before you proceed
You then can recode your variables as necessary Many outcome variables that are being reported as raw numbers (lab titers, bp, ht/wt) you will want to recode as categorical variables (i.e. pos/neg hi/low, obese/overweight etc).
Understand the background literature on how best to recode these variables.
Recode missing values reported as 88/99 appropriately.
Be sure to verify any recoding you do by running frequencies and x-tabs in SAS.Freq
Target
Coded
Review lit ? recode
77/99/missing - recode
Check code
Run basic frequencies on ALL discreet data to see how it is coded.
Understand how items are measured and outcomes defined before you proceed
You then can recode your variables as necessary Many outcome variables that are being reported as raw numbers (lab titers, bp, ht/wt) you will want to recode as categorical variables (i.e. pos/neg hi/low, obese/overweight etc).
Understand the background literature on how best to recode these variables.
Recode missing values reported as 88/99 appropriately.
Be sure to verify any recoding you do by running frequencies and x-tabs in SAS.

**6. **Know your data Continuous Outcome Data:
Look for outliers in your measure.
Run Proc Univariate.
Look for outliers among the weights.
Use Proc Univariate on the weight variable.
Outlying variables especially those with large weights can really influence your estimates.
Look at normality.
Consider transformations.
Log, square root, power.
Continuous
Outliers
Outlier wt
Influence
Normality/transform
For continuous variables ? again read the documentation and recode the missing values as appropriate.
In addition ?
Continuous
Outliers
Outlier wt
Influence
Normality/transform
For continuous variables ? again read the documentation and recode the missing values as appropriate.
In addition ?

**7. **NHANES Sample Design NHANES is a complex, multistage,
probability cluster design of the civilian,
noninstitutionalized US population.
We?ve discussed the data - downloading, documentation and recoding
Now we will talk a little about the survey sample design
The sample does not include persons residing in nursing homes, members of the armed forces, institutionalized persons, or U.S. nationals living abroad.
We?ve discussed the data - downloading, documentation and recoding
Now we will talk a little about the survey sample design
The sample does not include persons residing in nursing homes, members of the armed forces, institutionalized persons, or U.S. nationals living abroad.

**8. **Sample Weights To analyze NHANES data you must use the sample weights to account for :
Because the sample is drawn from a complex sampling scheme you must use the sample weights given to account for :
Because the sample is drawn from a complex sampling scheme you must use the sample weights given to account for :

**9. **NHANES sample was obtained by 1st sampling counties, within counties ? segments, within segments households and within household individuals.
IND SAMPLE NOT HH
Our sample is based on individuals although household sample used.
Not everyone in a HH was interviewed. About 1.6 members of each HH on average were selected.
1st stage ? we sampled counties as our primary sampling unit (PSU).? ??Counties were divided into strata based on size and 2 counties (PSU?s ) were selected on average per strata.
2nd stage ? the PSU?s were divided up into segments (city blocks or their equivalent) and segments randomly selected.
3rd stage ? households in each segment were listed and a subsample drawn.? The probability of selection was greater in geographic areas whose population had a higher proportion of the subpopulations that were being oversampled (i.e. Mexican Americans see Step 1.2 for other oversampled groups).
4th stage ? individuals within households were selected.? All persons within selected households were listed, the subsamples drawn at random within age-sex-race/ethnicity subdomains. On average 1.6 persons were selected per household.?
NHANES sample was obtained by 1st sampling counties, within counties ? segments, within segments households and within household individuals.
IND SAMPLE NOT HH
Our sample is based on individuals although household sample used.
Not everyone in a HH was interviewed. About 1.6 members of each HH on average were selected.
1st stage ? we sampled counties as our primary sampling unit (PSU).? ??Counties were divided into strata based on size and 2 counties (PSU?s ) were selected on average per strata.
2nd stage ? the PSU?s were divided up into segments (city blocks or their equivalent) and segments randomly selected.
3rd stage ? households in each segment were listed and a subsample drawn.? The probability of selection was greater in geographic areas whose population had a higher proportion of the subpopulations that were being oversampled (i.e. Mexican Americans see Step 1.2 for other oversampled groups).
4th stage ? individuals within households were selected.? All persons within selected households were listed, the subsamples drawn at random within age-sex-race/ethnicity subdomains. On average 1.6 persons were selected per household.?

**10. **2. Over sampling NHANE 1999-2004 - Oversampled
African Americans
Mexican Americans
Persons with low income
Adolescents aged 12-19
Persons aged 60+ In NHANES 99/02 we oversampled?.
NHANES was also designed to sample larger numbers of certain subgroups (see list below), so that reliable estimates of health status could be produced for these population subgroups.
Examples of some oversampled groups for 1999-2004 include:
African Americans
Mexican Americans
Persons with low income
Adolescents aged 12-19
Persons age 60+
Subgroups oversampled have varied in prior NHANES surveys and will continue to vary in future survey cycles.
It is vital to carefully review the documentation for each survey cycle to determine which subgroups were oversampled.
In NHANES 99/02 we oversampled?.
NHANES was also designed to sample larger numbers of certain subgroups (see list below), so that reliable estimates of health status could be produced for these population subgroups.
Examples of some oversampled groups for 1999-2004 include:
African Americans
Mexican Americans
Persons with low income
Adolescents aged 12-19
Persons age 60+
Subgroups oversampled have varied in prior NHANES surveys and will continue to vary in future survey cycles.
It is vital to carefully review the documentation for each survey cycle to determine which subgroups were oversampled.

**11. **Non-response to the interview & exam Sample persons age 20+ Finally, the weights account for the non-response between selection to the sample at the screening interview
and those that complete the hh interview - these are the interview weights
In addition, there are weights that account for the non-response between those completing the HH interview and the MEC exam. These are the MEC EXAM wts
Finally, the weights account for the non-response between selection to the sample at the screening interview
and those that complete the hh interview - these are the interview weights
In addition, there are weights that account for the non-response between those completing the HH interview and the MEC exam. These are the MEC EXAM wts

**12. **Non-response issues for NHANES Non-response:
Most components have some level of individual item or component non-response.
ONLY non-response to the interview and exam has already been accounted for in the weights.
All additional non-response to the outcome measure of interest should be examined against all possible predictors.
Potential biases should be discussed.
If non-response is ?high?, re-weighting should be considered. Item ? component NR
Only interview and exam NR accounted for by wts
Examine against predictors/risk factors
Discuss bias
If ?high? - reweightItem ? component NR
Only interview and exam NR accounted for by wts
Examine against predictors/risk factors
Discuss bias
If ?high? - reweight

**13. **Why weight? 13% of the US 2000 Census population was non-Hispanic black,
the unweighted sample for NHANES 1999-2002 was 25% non-Hispanic black because non-Hispanic blacks were oversampled in NHANES.
Once we apply the appropriate weights, our weighted sample was only 12% non-Hispanic black. This is much closer to that seen in the 2000 US Census population (numbers differ slightly due to rounding)..
Similarly, if one looks at Mexican Americans and persons age 12-19 years, two subpopulations also oversampled in NHANES
you can see that the US census population and the weighted sample consist of both 9% Mexican Americans and 12% persons age 12-19 years but the percents in the unweighted sample (28% and 24% respectively) were much greater for these two subpopulations.13% of the US 2000 Census population was non-Hispanic black,
the unweighted sample for NHANES 1999-2002 was 25% non-Hispanic black because non-Hispanic blacks were oversampled in NHANES.
Once we apply the appropriate weights, our weighted sample was only 12% non-Hispanic black. This is much closer to that seen in the 2000 US Census population (numbers differ slightly due to rounding)..
Similarly, if one looks at Mexican Americans and persons age 12-19 years, two subpopulations also oversampled in NHANES
you can see that the US census population and the weighted sample consist of both 9% Mexican Americans and 12% persons age 12-19 years but the percents in the unweighted sample (28% and 24% respectively) were much greater for these two subpopulations.

**14. **Sample weights ? Which weights? Now you know why we use wts. We will now discuss which wt to use.
If you are planning to analyze on 2 years of data (either 99/00 or 01/02 or 03/04) and are using ONLY the HH interview data you would use the weight WTINT2YR.
If you are to include any exam variables your are essentially subsetting the sample to only those with mec exam data.
Therefore, you use the WTMEC2YR weight.
For 4 years of data from 1999-2002 you MUST use the 4 year weights provided. This is because:
Sample weights for NHANES 1999-2000 were based on population estimates developed by the Bureau of the Census before the Year 2000 Decennial Census counts became available.?
The two year sample weights for NHANES 2001-2002 and all other two?year cycles are based on population estimates that incorporate the year 2000 Census counts.?
The two year weights for 1999-2000 and 2001-2002 are not directly comparable since different population bases were used.
Therefore when combining 1999-2000 with 2001-2002 , the analyst must use the four-year sample weights given that have been created to account for the two different reference populations for the post-stratification .
For 4 years of data from 99-02 only the variables are WTINT4yr for interview ONLY data.
And WTMEC4YR for any mec exam data.
TO combine all 6 years of data and to combine all other two year pairs of data (such as 01/02 with 03/04), weights will be calculated based on the 2 or 4 year weight variables above as follows:
Now you know why we use wts. We will now discuss which wt to use.
If you are planning to analyze on 2 years of data (either 99/00 or 01/02 or 03/04) and are using ONLY the HH interview data you would use the weight WTINT2YR.
If you are to include any exam variables your are essentially subsetting the sample to only those with mec exam data.
Therefore, you use the WTMEC2YR weight.
For 4 years of data from 1999-2002 you MUST use the 4 year weights provided. This is because:
Sample weights for NHANES 1999-2000 were based on population estimates developed by the Bureau of the Census before the Year 2000 Decennial Census counts became available.?
The two year sample weights for NHANES 2001-2002 and all other two?year cycles are based on population estimates that incorporate the year 2000 Census counts.?
The two year weights for 1999-2000 and 2001-2002 are not directly comparable since different population bases were used.
Therefore when combining 1999-2000 with 2001-2002 , the analyst must use the four-year sample weights given that have been created to account for the two different reference populations for the post-stratification .
For 4 years of data from 99-02 only the variables are WTINT4yr for interview ONLY data.
And WTMEC4YR for any mec exam data.
TO combine all 6 years of data and to combine all other two year pairs of data (such as 01/02 with 03/04), weights will be calculated based on the 2 or 4 year weight variables above as follows:

**15. **Two, Four, Six, Eight - How can we estimate? For 4 years of data from 2001-2004 -
MEC4YR = 1/2 WTMEC2YR ;
For 6 years of data from 1999-2004 ?
if sddsrvyr=1 or sddsrvyr=2 then
MEC6YR = 2/3 WTMEC4YR ; /* for 1999-2002 */
If sddsrvyr=3 then
MEC6YR = 1/3 WTMEC2YR ; /* for 2003-2004 */
* Only when analyzing years 1999-2002, you should not combined 2 year weights but use the 4 year weights provided.
4 years 01-04
6 years 99-04
Only 99-02 do not combine use combined wts given
For 4 years of data that do not include years 1999-2000 ? you will combine the weights as follows:
For 6 years of data from 1999-2004 ? you utilize the 4 year weights for 1999-2002 and combine that with the 2 year wts for 03-04 as follows:4 years 01-04
6 years 99-04
Only 99-02 do not combine use combined wts given
For 4 years of data that do not include years 1999-2000 ? you will combine the weights as follows:
For 6 years of data from 1999-2004 ? you utilize the 4 year weights for 1999-2002 and combine that with the 2 year wts for 03-04 as follows:

**16. **Two, Four, Six, Eight - How can we estimate? Future years of data will be combined similarly:
For 6 years of data from 2001-2006 -
if sddsrvyr in (2,3,4) then
MEC6YR = 1/3 WTMEC2YR;
For 8 years of data from 1999-2006 ?
if sddsrvyr=1 or sddsrvyr=2 then
MEC8YR = 1/2 WTMEC4YR ; /* for 1999-2002 */
if sddsrvyr=3 or sddsrvyr=4 then
MEC8YR = 1/4 WTMEC2YR etc; /* for 2003-2006 */
Other 6
Other 8
Etc.
Combining future years of data and their corresponding weights will continue to be done in this same fashion.
All other 4 year combinations of data can be calculated using the method stated in the 2001-2004 example above.
The only 4 years of data that will have precalculated 4 year weights will be 1999-2002.
All remaining 2 year data releases will be combined with other years of data when appropriate to do so as you see here.
Other 6
Other 8
Etc.
Combining future years of data and their corresponding weights will continue to be done in this same fashion.
All other 4 year combinations of data can be calculated using the method stated in the 2001-2004 example above.
The only 4 years of data that will have precalculated 4 year weights will be 1999-2002.
All remaining 2 year data releases will be combined with other years of data when appropriate to do so as you see here.

**17. **Sample Weights - Subsamples Subsamples and appropriate weights:
Look at your primary variable of interest and the corresponding weight.
Look at all other variables you want to combine with it.
Are all from the interview? Exam? Subsample (i.e. fasting, audiometry, dioxin, VOC?s ?) ?
Use the weight from the smallest subsample for your analysis.
Be consistent!
In addition to interview and exam weights, some components of the survey were performed on a subsample of individuals.
For example: there are fasting blood samples, audiometry, testing for heavy metals, dioxin, TSH/T4 (Thyroid stimulating hormone), VOC?s volatile organic compounds), pesticides.
Weights created for each of these subsamples will reside on the same file as the data itself.
Careful attention must be taken to combine data sets appropriately and utilize the correct weight for the analysis.
Basic approach is to use the weight from the smallest subsample in your analysis.
Combining demo, exam and fasting data ? use the fasting weight ? because it is the smallest sample.
If the main focus of your analysis is the data from the fasting sample, use those individuals only and their corresponding weights throughout your analysis. CONSISTENT THROUGHOUT AND EXPLAIN
If you utilize both the full sample and a subsample, be sure to specify in your methods which weight you are using for each part of your analysis and be sure you use the appropriate one.
In addition to interview and exam weights, some components of the survey were performed on a subsample of individuals.
For example: there are fasting blood samples, audiometry, testing for heavy metals, dioxin, TSH/T4 (Thyroid stimulating hormone), VOC?s volatile organic compounds), pesticides.
Weights created for each of these subsamples will reside on the same file as the data itself.
Careful attention must be taken to combine data sets appropriately and utilize the correct weight for the analysis.
Basic approach is to use the weight from the smallest subsample in your analysis.
Combining demo, exam and fasting data ? use the fasting weight ? because it is the smallest sample.
If the main focus of your analysis is the data from the fasting sample, use those individuals only and their corresponding weights throughout your analysis. CONSISTENT THROUGHOUT AND EXPLAIN
If you utilize both the full sample and a subsample, be sure to specify in your methods which weight you are using for each part of your analysis and be sure you use the appropriate one.

**18. **Sample Weights - Subsamples Subsamples and appropriate weights:
Be careful about combining subsamples beyond MEC + VOC?s, Interview + Dioxin etc.
Combining subsamples such as Environmental + AM fasting could be problematic.
Some subsamples are mutually exclusive.
Weights were not designed for combining subsamples and may not produce good estimates. COMBINE mec and other OK, interview and other OK
COMBINE SUBSAMPLES ? TSH and VOC problematic
Some are mutually exclusive subsamples
NOT DESIGNED to combine subsamples.
COMBINE mec and other OK, interview and other OK
COMBINE SUBSAMPLES ? TSH and VOC problematic
Some are mutually exclusive subsamples
NOT DESIGNED to combine subsamples.

**19. **Preparing for Analyses Subsetting the data for SUDAAN:
If using MEC exam weights - SUBSET the data on those MEC EXAMINED in SAS before using SUDAAN.
If using other subsample weights ? subset the data on those in the subsample corresponding to the weights you are using.
Then use the SUBPOPN statement in the SUDAAN procedure to further subset your data by age, gender etc. to reflect the target population you are interested in analyzing. To estimate the variance more precisely ? SUDAAN requires that you include in the analysis procedure all individuals possessing the wt that your are using at the start of the procedure.
If using mec wts ? subset data on only those with mec wt
If using subsample ? subset in SAS first on those with that subsample wt.
(items 1 and 2)
To further subset the data to reflect the age, gender, or whatever target population you wish to analyze you MUST use the SUBPOPN statement in the procedure itself
If you subset in SAS on age or other factors and use a smaller sub domain than that reflected by the weights you are using. You will get incorrect SE estimates
SUDAAN needs the whole data set reflected by the weights being used to calculate the correct SE.
To estimate the variance more precisely ? SUDAAN requires that you include in the analysis procedure all individuals possessing the wt that your are using at the start of the procedure.
If using mec wts ? subset data on only those with mec wt
If using subsample ? subset in SAS first on those with that subsample wt.
(items 1 and 2)
To further subset the data to reflect the age, gender, or whatever target population you wish to analyze you MUST use the SUBPOPN statement in the procedure itself
If you subset in SAS on age or other factors and use a smaller sub domain than that reflected by the weights you are using. You will get incorrect SE estimates
SUDAAN needs the whole data set reflected by the weights being used to calculate the correct SE.

**20. **Sample Weights Example:
You are interested in examining the association of high triglycerides, blood pressure, and body mass index (BMI) controlling for race/ethnicity on females age 20-59 from the 6 years of data from 1999-2004.

**21. **Sample Weights Step 1 ? Determine the smallest sample population for the analysis to determine the correct weight to use.
Race/ethnicity, gender and age are in the interview.
Blood pressure and weight come from the MEC exam a subset of those interviewed.
Triglycerides were measured on a subsample of those MEC examined who fasted for 8 hours and came to the AM MEC exam.
Therefore, the fasting subsample is the smallest subsample in the analysis and you would use the AM fasting weights (WTSAF2YR and WTSAF4YR).

**22. **Sample Weights Step 2 ? Combine weights in SAS prior to the SUDAAN procedure for the 6 years from 1999-2004:
If sddsrvyr in (1,2) then
WEIGHT6 =2/3*WTSAF4YR ; /* 1999-2002 */
If sddsrvyr=3 then
WEIGHT6= 1/3*WTSAF2YR ; /* 2003-2004*/

**23. **Sample Weights Step 3 ? Subset your data set in SAS to reflect the weight being used (AM fasting weights WTSAF2YR or WTSAF4YR):
SAS Code:
IF WTSAF2YR ne . or WTSAF4YR ne . ;

**24. **Sample Weights Step4 ? Last specify the correct weight to use using the weight statement in SUDAAN
and subset your data to obtain the subpopulation of interest using the SUBPOPN statement in SUDAAN (females age 20-59):
WEIGHT WEIGHT6 ;
SUBPOPN riagendr=2 and ridageyr > 19 and ridageyr < 60 ; Subsample is females age 20-59
Subsample is females age 20-59

**25. **NHANES 1999-2000Variance Estimation Why must you use the sample design to estimate the variance?
NHANES is a cluster design
Individual within a cluster are more similar than those in other clusters.
This homogeneity or clustering results in a reduction of our effective sample size because we choose individuals within cluster vs randomly throughout the population.
Typically individuals within a cluster (i.e. school, city, census block) are more similar to one another than those in other clusters.
This homogeneity of of individuals within a given cluster (homogeneity of the variance) is measured by the intra cluster correlation.
Ideally in a complex sample, we would want to decrease the amount of correlation between sample persons within clusters.
To do this we would want fewer sample persons within each cluster and to sample more clusters.
But because of logistical limitations (i.e. cost of moving the survey MEC?s, geographic distances between primary sampling units, etc.) NHANES can only sample 15PSU?s within a two year survey cycle.
Another way to think about clustering is the loss of precision and the reduction in the effective sample size because we are choosing individuals within clusters instead of sampling them randomly throughout the population. When the design effect is greater than one, the effective sample size is less than the number of sample persons but greater than the number of clusters. Typically individuals within a cluster (i.e. school, city, census block) are more similar to one another than those in other clusters.
This homogeneity of of individuals within a given cluster (homogeneity of the variance) is measured by the intra cluster correlation.
Ideally in a complex sample, we would want to decrease the amount of correlation between sample persons within clusters.
To do this we would want fewer sample persons within each cluster and to sample more clusters.
But because of logistical limitations (i.e. cost of moving the survey MEC?s, geographic distances between primary sampling units, etc.) NHANES can only sample 15PSU?s within a two year survey cycle.
Another way to think about clustering is the loss of precision and the reduction in the effective sample size because we are choosing individuals within clusters instead of sampling them randomly throughout the population. When the design effect is greater than one, the effective sample size is less than the number of sample persons but greater than the number of clusters.

**26. **NHANES 1999-2004Variance Estimation Why must you use the sample design to estimate the variance?
Variance estimates that do not account for this intra cluster correlation are too low and biased.
Survey software such as SUDAAN or SAS Survey procedures must be used to account for the complex design and produce unbiased variance estimates
These procedures require information on the sample design (i.e. identification of the PSU and strata) for each sample person.
In a complex sample survey setting such as NHANES, variance estimates computed using standard statistical software packages that assume simple random sampling are generally too low and biased.
Software such as SUDAAN or SAS Survey procedures that account for the sampling design effect must be used to calculate an unbiased estimate of the variance
and should be used for all statistical tests and the construction of confidence limits.
These procedures require information on the first stage of the sample design (identification of the PSU and stratum) for each sample person.
In a complex sample survey setting such as NHANES, variance estimates computed using standard statistical software packages that assume simple random sampling are generally too low and biased.

**27. **NHANES 1999-2000Variance Estimation For the initial 1999-2000 data release we recommended:
Using JK-1/Jackknife/?leave-one-out? procedure.
Required 52 replicate weights for each of 52 groups created. Only provided for 1999-2000.
Can still be used if you have software that can produce the replicate weights.
Replicate weights for this procedure will no longer be created on the data set.
Too cumbersome
Now we have discussed how and why to use the weights to get a good estimate. Next we will discuss how to incorporate the complex sample design in order to get a correct Variance and SE estimate.
Initially JK1 ? confidentiality 52 replicate wts.
Cumbersome when adding years of data
Only 99-00
Can still use it if you have software
No longer providing.
In the past for analyzing NHANES 99/00 we recommended using the jk1?
This preserved the basic design structure, did not disclose geo identity but required the creation of 52 replicate wts for each data set.
Therefore, it would fast become extremely cumbersome to create and use the wts as the years and combinations of years of data increased over time.?
Review 3 and 4
Why did we use JK1? ? For first data release we were faced with confidentiality issues that required us not to release the stratum and psu designators. As a temporary solution, we recommend using the JK-1 procedure. Further research showed that this procedure under estimated the SE?s more than we liked. In addition, releasing the replicate wts for all 4, 6 8 etc year combinations would be extremely cumbersome and not useful. As a result we moved to using linearization methods for analysis.
We recommend using a stratum and PSU design method but you can use replicate methods if you have access to a software package that can produce the replicate weights.
Now we have discussed how and why to use the weights to get a good estimate. Next we will discuss how to incorporate the complex sample design in order to get a correct Variance and SE estimate.
Initially JK1 ? confidentiality 52 replicate wts.
Cumbersome when adding years of data
Only 99-00
Can still use it if you have software
No longer providing.
In the past for analyzing NHANES 99/00 we recommended using the jk1?
This preserved the basic design structure, did not disclose geo identity but required the creation of 52 replicate wts for each data set.
Therefore, it would fast become extremely cumbersome to create and use the wts as the years and combinations of years of data increased over time.?
Review 3 and 4
Why did we use JK1? ? For first data release we were faced with confidentiality issues that required us not to release the stratum and psu designators. As a temporary solution, we recommend using the JK-1 procedure. Further research showed that this procedure under estimated the SE?s more than we liked. In addition, releasing the replicate wts for all 4, 6 8 etc year combinations would be extremely cumbersome and not useful. As a result we moved to using linearization methods for analysis.
We recommend using a stratum and PSU design method but you can use replicate methods if you have access to a software package that can produce the replicate weights.

**28. **NHANES 1999-2004Variance Estimation We now recommend:
Using the Taylor series (linearization) method
Same as that used in NHANES III.
We now provide ?Masked Variance Units? (MVU?s) in place of primary sampling units (PSU?s) to maintain confidentiality.
Design variables are called - SDMVSTRA and SDMVPSU.
NOW Taylor
MVU?s for PSU?s as in NH3
Variables are
Now we are recommending for all analyses (both 2 year 99/00 01/02 ) and 4 years (99-02) to return to using the Taylor series linearization method.
This is the same as NH3
Provided MVU?s vs PSU to maintain confidentiality
The design Variables are called
NOW Taylor
MVU?s for PSU?s as in NH3
Variables are
Now we are recommending for all analyses (both 2 year 99/00 01/02 ) and 4 years (99-02) to return to using the Taylor series linearization method.
This is the same as NH3
Provided MVU?s vs PSU to maintain confidentiality
The design Variables are called

**29. **Design Variables SDMVSTRA and SDMVPSU
Found in the demographic file.
Found in all two year data sets and can be combined for 4 or 6 or ? year data sets.
Can be used the same as the actual stratum and PSU variables.
Produce variance estimates close to those using the ?true? design.
Data MUST be sorted by SDMVSTRA and SDMVPSU first, before using SUDAAN. DEMO file
ALL years
Same as PSU?s
Close to true design
Sort first
Don?t forget to sort first in SASDEMO file
ALL years
Same as PSU?s
Close to true design
Sort first
Don?t forget to sort first in SAS

**30. **Sample SUDAAN Code Here is an example of BOTH the SAS and SUDAAN code for any SUDAAN procedure that defines the survey design, weights to be used, and target subpopulation - PROC DESCRIPT is being used as a generic example, but these statements apply to all SUDAAN procedure
First in SAS ? subset data set to those individuals with weights that correspond to those being used in this procedure (i.e. mec exam wts).
For every SUDAAN procedure you must specify the input data set - The input data file must be sorted first in SAS by the same design variables as listed on the NEST statement.
In addition, in the procedure statement you designate the appropriate sample design using the DESIGN parameter.
SUDAAN offers six design options including simple random sampling (SRS). SUDAAN assumes a with replacement (WR) design if the design parameter is omitted. NHANES data we use the (WR) design.
The NEST statement lists the variables that identify or label the sampling levels or stages used in your sample design, in the order of sample selection. For all designs except SRS the NEST statement is required.
For the multistage with-replacement design (WR), you must specify for the first-stage, both the strata and PSU variables.
For NHANES the variables that identify the first stage sampling variables are SDMVSTRA (stratum) and SDMVPSU (PSU?s).The WEIGHT statement in SUDAAN is required for all NHANES analyses. - It identifies the variable whose values are the sample weights.
In this example, the MEC weight for four years of data is used.
you can then subset your data to your target population of interest using the SUBPOPN statement ? i.e. 12-49 year olds toxo tested.
Finally, you can specify the subdomains you wish to look at or stratification using the tables statement
Here is an example of BOTH the SAS and SUDAAN code for any SUDAAN procedure that defines the survey design, weights to be used, and target subpopulation - PROC DESCRIPT is being used as a generic example, but these statements apply to all SUDAAN procedure
First in SAS ? subset data set to those individuals with weights that correspond to those being used in this procedure (i.e. mec exam wts).
For every SUDAAN procedure you must specify the input data set - The input data file must be sorted first in SAS by the same design variables as listed on the NEST statement.
In addition, in the procedure statement you designate the appropriate sample design using the DESIGN parameter.
SUDAAN offers six design options including simple random sampling (SRS). SUDAAN assumes a with replacement (WR) design if the design parameter is omitted. NHANES data we use the (WR) design.
The NEST statement lists the variables that identify or label the sampling levels or stages used in your sample design, in the order of sample selection. For all designs except SRS the NEST statement is required.
For the multistage with-replacement design (WR), you must specify for the first-stage, both the strata and PSU variables.
For NHANES the variables that identify the first stage sampling variables are SDMVSTRA (stratum) and SDMVPSU (PSU?s).The WEIGHT statement in SUDAAN is required for all NHANES analyses. - It identifies the variable whose values are the sample weights.
In this example, the MEC weight for four years of data is used.
you can then subset your data to your target population of interest using the SUBPOPN statement ? i.e. 12-49 year olds toxo tested.
Finally, you can specify the subdomains you wish to look at or stratification using the tables statement

**31. **Preparing for AnalysisSetting up the procedure in SAS Surveymeans I am now going to show a brief example of how to use a basic SAS Survey procedure:using PROC SURVEYMEANS as our generic example. (This procedure can be used to calculate means and standard errors)
As with SUDAAN, for every SAS Survey procedure we will specify the input data set - does NOT have to be presorted by the sample design variables as in SUDAAN.
With the SAS Survey procedures, the sample design is not directly specified in the PROC statement but is inferred by the cluster and strata variables that are specified.
Use the STRATA statement to specify the strata and account for the design effects of stratification. - For NHANES the variable that identifies the sample strata is SDMVSTRA (stratum).
Use the CLUSTER statement to specify primary sampling unit (PSU) to account for the design effects of clustering. - For NHANES the variable that identifies the sample clusters is SDMVPSU (primary sampling units or PSU?s
As stated prev,In a probability-based sample survey like NHANES each sampled individual has a sample weight associated with his/her data.
Use the WEIGHT statement to account for the unequal probability of sampling and non-response. -In this example, the MEC weight for four years of data is used.
Use the domain statement to obtain estimates by or for a particular domain or subpopulation not the by statement.
You should not use a WHERE clause or BY-group processing in order to analyze a subpopulation with the survey procedures.
?
The steps in this task explained the most basic statements used to specify the complex sample design when using a SAS survey procedure.
There are MANY MANY more options within these statements and many mores statements that can be used in these procedures that affect how variance estimates and statistics are calculated as well as how to customize the output and information produced by the procedure.
For more information on these various options and statements, please consult the SAS/STAT manual for each survey procedure. I am now going to show a brief example of how to use a basic SAS Survey procedure:using PROC SURVEYMEANS as our generic example. (This procedure can be used to calculate means and standard errors)
As with SUDAAN, for every SAS Survey procedure we will specify the input data set - does NOT have to be presorted by the sample design variables as in SUDAAN.
With the SAS Survey procedures, the sample design is not directly specified in the PROC statement but is inferred by the cluster and strata variables that are specified.
Use the STRATA statement to specify the strata and account for the design effects of stratification. - For NHANES the variable that identifies the sample strata is SDMVSTRA (stratum).
Use the CLUSTER statement to specify primary sampling unit (PSU) to account for the design effects of clustering. - For NHANES the variable that identifies the sample clusters is SDMVPSU (primary sampling units or PSU?s
As stated prev,In a probability-based sample survey like NHANES each sampled individual has a sample weight associated with his/her data.
Use the WEIGHT statement to account for the unequal probability of sampling and non-response. -In this example, the MEC weight for four years of data is used.
Use the domain statement to obtain estimates by or for a particular domain or subpopulation not the by statement.
You should not use a WHERE clause or BY-group processing in order to analyze a subpopulation with the survey procedures.
?
The steps in this task explained the most basic statements used to specify the complex sample design when using a SAS survey procedure.
There are MANY MANY more options within these statements and many mores statements that can be used in these procedures that affect how variance estimates and statistics are calculated as well as how to customize the output and information produced by the procedure.
For more information on these various options and statements, please consult the SAS/STAT manual for each survey procedure.

**32. **Other data analysis issues from NHANES Calculating Population Totals
Estimates of the number of persons in the U.S. population with a particular condition must be done carefully.
Recommended procedure is to:
First, estimate the proportion with the condition for each subdomain of interest.
Mutliply that by the population control totals for that subdomain.
Tables are available on the NCHS web site with the current March 2001 CPS control totals as part of the analytic guidelines.
Many people are often interested in not only an estimate of the prevalence of a given condition
But also an estimate of the number of person in the US population affected with a particular condition
These estimates must be done carefully
Est prev and est # persons
Careful
Est. prop each subdomain
Multiply prop by pop control totals
See NCHS website for 20001 CPS by subdomains
Many people are often interested in not only an estimate of the prevalence of a given condition
But also an estimate of the number of person in the US population affected with a particular condition
These estimates must be done carefully
Est prev and est # persons
Careful
Est. prop each subdomain
Multiply prop by pop control totals
See NCHS website for 20001 CPS by subdomains

**33. **Other data analysis issues from NHANES Calculating Population Totals
Estimates of number of persons with a condition can be obtained by summing the weights of those positive.
These estimates will be less reliable due to
item non response
and sampling error
Not the recommended method.
Can also sum wts.
Less reliable ? item NR and sampling error not accounted for
Not recommended
Can also sum wts.
Less reliable ? item NR and sampling error not accounted for
Not recommended

**34. **Analyzing within NHANES 1999-2004 Things to consider:
Data released in two year cycles.
We STRONGLY RECOMMEND using two or more cycles (4 or more years )to produce reliable estimates.
Verify data items collected were comparable in wording and methods.
When combining years remember to use correct combined weights.
Now I am going to review a variety of issues you must consider when analyzing NHANES data
2 yrs
4-6 recommended
Verify methods
Correct combined wts
OK to look at last 2 years for lastest estimates.
For indepth analyses of risk factors etc. ? should use 4 or more years of data
Trends between 2 year intervals should be analyzed carefully.
Modelling procedures for analyzing these trends will be discussed in greater detail in the near future in the analytic guidelines available on our website.
Now I am going to review a variety of issues you must consider when analyzing NHANES data
2 yrs
4-6 recommended
Verify methods
Correct combined wts
OK to look at last 2 years for lastest estimates.
For indepth analyses of risk factors etc. ? should use 4 or more years of data
Trends between 2 year intervals should be analyzed carefully.
Modelling procedures for analyzing these trends will be discussed in greater detail in the near future in the analytic guidelines available on our website.

**35. **Analyzing trends with NHANES NHANES III to NHANES 1999-2004 Things to consider:
What is your sample from each survey?age?
How different was the question worded or the interview methods ?
How different were the lab or exam methodologies ? Cutoffs used? Definitions?
For current NHANES 1999-2004 sample sizes may be smaller depending on number of years measured - especially in sub domains
Larger sampling variation.
May need to limit comparisons.
Other things you must consider are:
Sample
Word/interview change
Lab ? measles ? changed can?t compare trends to nh3
99+ smaller sample ? larger variance.
1. .. Which age groups were tested/questioned.
2. ? did questions differ/ did the interview method change (i.e. from interviewer based to audio casi or capi?
3. did the lab test change or exam method. Did a different lab perform the assay?
4. Also, because NHANES 99/02 sample is smaller ? especially in specific age/race/gender subdomains you will face larger variances and may need to limit the detail of your comparisons
Other things you must consider are:
Sample
Word/interview change
Lab ? measles ? changed can?t compare trends to nh3
99+ smaller sample ? larger variance.
1. .. Which age groups were tested/questioned.
2. ? did questions differ/ did the interview method change (i.e. from interviewer based to audio casi or capi?
3. did the lab test change or exam method. Did a different lab perform the assay?
4. Also, because NHANES 99/02 sample is smaller ? especially in specific age/race/gender subdomains you will face larger variances and may need to limit the detail of your comparisons

**36. **Race/Ethnicity NHANES 1999-2004 Two variables available
RIDRETH1
&
RIDRETH2
One of the most used demographic variables is race/ethnicity.
You will immediately notice that two race/ethnicity variables have been provided.
One of the most used demographic variables is race/ethnicity.
You will immediately notice that two race/ethnicity variables have been provided.

**37. **Race/Ethnicity NHANES 1999-2004 Ridreth1- Use for analyses of 1999-2004 data alone.
1=Mexican American
2=other Hispanic
3=non-Hispanic white
4=non-Hispanic black
5=other races including multiracial.
For 2 and 4 years of data we know there is insufficient sample size to analyze ?other Hispanics? (group 2) alone or to analyze ?all Hispanics?.
Analyses to evaluate whether 6 years of data (1999-2004) are sufficient to analyze these Hispanic groups are ongoing.
Groups 2 and 5 can AND should continue to be combined to represent all other races.
Code
2-4 yrs no other/all hispanic
6 maybe
2 and 5 make other
The first ? Ridreth1 is to be used for .
We know that for two and 4 years of data that there is NOT sufficient sample size to analyze the other Hispanic group alone or to combine groups 1 and 2 to analyze all Hispanics.
We are currently working on some sample analyses to determine whether there are enough sample persons represented in the sample to analyze these two categories with the 6 years of data.
Code
2-4 yrs no other/all hispanic
6 maybe
2 and 5 make other
The first ? Ridreth1 is to be used for .
We know that for two and 4 years of data that there is NOT sufficient sample size to analyze the other Hispanic group alone or to combine groups 1 and 2 to analyze all Hispanics.
We are currently working on some sample analyses to determine whether there are enough sample persons represented in the sample to analyze these two categories with the 6 years of data.

**38. **Race/Ethnicity NHANES 1999-2004 Ridreth2
Use for analyzing trends from NHANES III to NHANES 1999-2004.
Most comparable to race/ethnicity variable collected in NHANES III.
Coded as :
1=non-Hispanic white
2=non-Hispanic black
3=Mexican American
4=other ? including Multi-Racial
5=other Hispanic
The 2nd race/ethnicity variable is to be used for ?
Trends NH3The 2nd race/ethnicity variable is to be used for ?
Trends NH3

**39. **Analyzing data from NHANES 1999-2004 Crude versus Age Standardized Estimates:
Age distributions within survey samples vary by racial/ethnic group.
Age distributions also vary by survey ? NHANES III vs. NHANES 1999-2004.
When comparing estimates across racial/ethnic groups or between surveys you may need to age standardize.
Also present all age specific estimates! Some issues to consider when analyzing any trends between surveys is whether to use
Crude vs age standardized estimates
Within a sample ? age dist varied by race ? Mex Amer younger than whites
Between surveys ? age dist also varied ? different subgroups oversampled
Age standardization must be considered
Age specific estimates should also be provided.
Some issues to consider when analyzing any trends between surveys is whether to use
Crude vs age standardized estimates
Within a sample ? age dist varied by race ? Mex Amer younger than whites
Between surveys ? age dist also varied ? different subgroups oversampled
Age standardization must be considered
Age specific estimates should also be provided.

**40. **Analyzing data from NHANES 1999-2004 When Age Standardizing:
Use the 2000 U.S. Census Population for consistency for both NHANES III and all NHANES 1999-2000 or above.
For guidelines and population proportions see the website below for the Klein and Schoenborn HP2010 Statistical Notes on ?Age Adjustment using the 2000 Projected U.S. Population?.
http://www.cdc.gov/nchs/data/statnt/statnt20.pdf
USE 2000 census both
Population proportions ? reference
Website
USE 2000 census both
Population proportions ? reference
Website

**41. **Analyzing data from NHANES 1999-2004 When Age Standardizing:
In SUDAAN, use the STDVAR and STDWGT statements.
STDVAR ?variable name for the age groups.
STDWGT ? corresponding proportion of the 2000 U.S. Census population for that age subgroup. These statements can be used in proc descript when finding prevalence estimates, means, doing contrast and calculating percentiles.
These statements can be used in proc descript when finding prevalence estimates, means, doing contrast and calculating percentiles.

**42. **Age standardization for NHANES Crude vs. Age Standardized Estimates Example:
Example of why to age standardize using NH3 data:
No difference between wh and ma in crude data
Because variable increased with age ? wh who were older prev dec
Ma who were younger the prev inc
After age standardization difference between wh and ma was significantExample of why to age standardize using NH3 data:
No difference between wh and ma in crude data
Because variable increased with age ? wh who were older prev dec
Ma who were younger the prev inc
After age standardization difference between wh and ma was significant

**43. **OH9900 Other data analysis issues from NHANES Design Effect
Sample design effect - the ratio of the variance estimated under the complex sample design to the variance under simple random sampling
Var (CSD) / Var (SRS)
SUDAAN - DEFT2 option in Proc Descript
Design effect can be averaged DEFF ? var complex/var srs
Deft2
Average across
We discussed earlier that variance estimates under a complex sample design are usually larger than those under SRS
Design effect is defined as the ratio of the variance under the complex sample design / Variance under SRS
In Sudaan we use the DEFT2 option to get the appropriate design effect.
Design effects can be averaged across relevant subdomains of interest (i.e. age/race/sex) for a given outcome variable.
This is especially useful when the design effect is unstable across subdomains.DEFF ? var complex/var srs
Deft2
Average across
We discussed earlier that variance estimates under a complex sample design are usually larger than those under SRS
Design effect is defined as the ratio of the variance under the complex sample design / Variance under SRS
In Sudaan we use the DEFT2 option to get the appropriate design effect.
Design effects can be averaged across relevant subdomains of interest (i.e. age/race/sex) for a given outcome variable.
This is especially useful when the design effect is unstable across subdomains.

**44. **OH9900 Other data analysis issues from NHANES Effective Sample sizes
Sample sizes should be adjusted by the sample design effect (DEFF)
Effective N = N/DEFF
Minimum sample size for reporting each individual estimate depends on the statistic being calculated, its relative size, stability of the SE estimate, degrees of freedom and other special circumstances.
Please refer to the Analytic Guidelines on our web site for more details.
Detail ? look at SS
Eff SS is SS/DE
Another factor to consider when deciding at what level of detail to provide an estimate is the minimum effective sample size
The effective sample size is equivalent to the sample size for that subgroup divided by the design effect.
See NHANES AG for details on what minimum sample size is considered sufficient for a particular estimate.
These detailed guidelines take into consideration the type of estimate being considered (means, proportions, percentiles, totals), it?s relative size (0.01 ? 0.5) , method being used to calculate CL?s (binomial, arcsin, woodurff), the degrees of freedom represented in the sample and other factors.
Detail ? look at SS
Eff SS is SS/DE
Another factor to consider when deciding at what level of detail to provide an estimate is the minimum effective sample size
The effective sample size is equivalent to the sample size for that subgroup divided by the design effect.
See NHANES AG for details on what minimum sample size is considered sufficient for a particular estimate.
These detailed guidelines take into consideration the type of estimate being considered (means, proportions, percentiles, totals), it?s relative size (0.01 ? 0.5) , method being used to calculate CL?s (binomial, arcsin, woodurff), the degrees of freedom represented in the sample and other factors.

**45. **OH9900 Other data analysis issues from NHANESEstimate Stability Relative Standard Errors :
For estimates such as means/prevalences ? calculate the relative standard error (RSE) as follows: (SE mean / mean) X 100%
For prevalence estimates near 100% (i.e. > 90%), look at the RSE for the percent negative not just percent positive.
i.e. calculate RSE for minimum p or 1-p
Other data analysis issues one should consider especially when analyzing NHANES data ? include looking at estimate stability by examining the relative standard error of your estimate.
NHANES was designed to provide reliable estimates of
conditions by demographic groups based on minimum 10 percent prevalence with a RSE of <30%.
SE/Mean * 100
Min pos of neg
P or 1-p
Other data analysis issues one should consider especially when analyzing NHANES data ? include looking at estimate stability by examining the relative standard error of your estimate.
NHANES was designed to provide reliable estimates of
conditions by demographic groups based on minimum 10 percent prevalence with a RSE of <30%.
SE/Mean * 100
Min pos of neg
P or 1-p

**46. **OH9900 Other data analysis issues from NHANES Relative Standard Errors and ?Rare? Events:
RSE?s <20%, estimates are most likely reportable.
RSE?s >30%, consider whether the estimate provides useful information.
Estimates of 50% with SE of 15% and RSE of 30% give a 95% CI?s approximately 20-80%. Is this really useful information?
Estimates of low prevalence (i.e. 5%) with SE of 1.5 also give RSE of 30% but the 95% CI is approximately 2-8%. This may be very useful information.
<20 OK
>30 maybe not
50% no
5% probably OK
Between 20 and 30 ? OK use discretion.
<20 OK
>30 maybe not
50% no
5% probably OK
Between 20 and 30 ? OK use discretion.

**47. **OH9900 Other data analysis Issues from NHANES Confidence Limits for ?rare? (>90% or <10%) events:
Standard normal approaches for calculating 95% CI?s may give lower bounds < 0 or upper bounds > 100.
Statistical literature describes alternative methods under these situations.
Evaluation of these various methods - see analytic guidelines on NCHS web site.
NH was designed to provide reliable estimates of conditions by demographic groups based on a 10 percent prevalence with a RSE of
<30%. But we include many many measures in our survey that do not meet this criteria.
Again care must be taken when estimating 95% CI on what we will call ?rare? events.
These are prevalence estimates >90% or <10%.
Alt methods exist. These include the arc sine and logit transformations. Exact methods such as the exact binomial or Clopper-Pearson,
Wilson method.
These methods vary in there useful in different situations.
See AG on NCHS web for evaluation of these different methods.
Normal approximations ? Wald ? not good near 0 or 100
Mantel Haensel asymmetric ? not good near 100
Collett or Clopper Pearson ? exact methods
Arcsine, log transformed, Wilson
NH was designed to provide reliable estimates of conditions by demographic groups based on a 10 percent prevalence with a RSE of
<30%. But we include many many measures in our survey that do not meet this criteria.
Again care must be taken when estimating 95% CI on what we will call ?rare? events.
These are prevalence estimates >90% or <10%.
Alt methods exist. These include the arc sine and logit transformations. Exact methods such as the exact binomial or Clopper-Pearson,
Wilson method.
These methods vary in there useful in different situations.
See AG on NCHS web for evaluation of these different methods.
Normal approximations ? Wald ? not good near 0 or 100
Mantel Haensel asymmetric ? not good near 100
Collett or Clopper Pearson ? exact methods
Arcsine, log transformed, Wilson

**48. **OH9900 Other data analysis Issues from NHANES Degrees of freedom (DF) for t-statistics:
Must calculate the DF to obtain a correct t-statistic for calculating confidence limits.
DF are = number of clusters in the 2nd level of sampling (# PSU?s) ? number of clusters in the 1st level of sampling (#strata) in your subgroup of interest.
Same for both SAS and SUDAAN when all strata and PSU?s are represented in your subgroup. Often estimates are calculated for various subgroups of interest within the total NHANES population.
To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance you must calculate the proper degrees of freedom for the estimate for your subpopulation of interest.
Degrees of freedom are properly calculated by subtracting the number of clusters in the first level of sampling (strata) from the number of clusters in the second level of sampling (PSU?s) for each subgroup you are analyzing (#PSU?s - # Strata).
NOTE: For both SUDAAN and SAS Survey Procedures the degrees of freedom are calculated in the same way when looking at the entire sample population or in subgroups where all stratum and PSU?s are represented.
But they vary when not all stratum and psu?s are represented by the sample persons in a subgroup. Both SAS and SUDAAN do not correct for missing psu?s and stratum in the calculation of their CI?s in these situations.
We still Often estimates are calculated for various subgroups of interest within the total NHANES population.
To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance you must calculate the proper degrees of freedom for the estimate for your subpopulation of interest.
Degrees of freedom are properly calculated by subtracting the number of clusters in the first level of sampling (strata) from the number of clusters in the second level of sampling (PSU?s) for each subgroup you are analyzing (#PSU?s - # Strata).
NOTE: For both SUDAAN and SAS Survey Procedures the degrees of freedom are calculated in the same way when looking at the entire sample population or in subgroups where all stratum and PSU?s are represented.
But they vary when not all stratum and psu?s are represented by the sample persons in a subgroup. Both SAS and SUDAAN do not correct for missing psu?s and stratum in the calculation of their CI?s in these situations.
We still

**49. **OH9900 Other data analysis Issues from NHANES Degrees of freedom (DF) for t-statistics:
SAS and SUDAAN do not calculate DF the same way when your subgroup is NOT represented in all PSU?s and strata.
SAS is currently working on correcting this.
In SUDAAN, to calculate DF you must output the # strata and the # PSU?s using the ATLEVL1=1 and ATLEV2=2 options in your PROC Descript or PROC Crosstab BUT, when one analyzes data on a subgroup of sample persons who may not be represented in all stratum and PSU?s (i.e. Mexican Americans) the degrees of freedom provided in the output for the SAS Survey procedures such as PROC SURVEYMEANS computes the degrees of freedom as the number of clusters (PSU?s) in the non-empty strata minus the number of non-empty strata.
This means that if you have empty strata (no sample persons represented in either PSU) it would actually increase the number of degrees of freedom. This is incorrect and SAS is currently working on correcting the method used to calculate the degrees of freedom in its survey procedures.
In contrast, SUDAAN will count the number of PSU?s and strata with at least one valid observation for each cell of the table being requested for the subpopulation you are analyzing.
In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the PROC statement in Proc DESCRIPT or Proc CROSSTAB to request the counting of the PSU?s and strata. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (stratum and PSU) for which you want counts per table cell. The values 1 and 2 are the positions on the NEST statement of the variables used to designate the stages of sampling.
These options are associated with the keywords ATLEV1 or ATLEV2 respectively on the print or output statements. ATLEV1 is the number of stratum with at least one valid observation and ATLEV2 is the number of PSU?s with at least one valid observation. These numbers are used to calculate degrees of freedom.
In addition, in both SAS Survey Procedures (Proc Survey Means) and starting in SUDAAN ver 9.1 (Proc Descript), 95% confidence limits are available. These 95% CI?s also are calculated using the WALD method based on a t-statistic for the number of degrees of freedom in the entire NHANES sample
but DOES NOT correct for the reduction in the degrees of freedom in subdomains where not all stratum and PSU?s are represented.
BUT, when one analyzes data on a subgroup of sample persons who may not be represented in all stratum and PSU?s (i.e. Mexican Americans) the degrees of freedom provided in the output for the SAS Survey procedures such as PROC SURVEYMEANS computes the degrees of freedom as the number of clusters (PSU?s) in the non-empty strata minus the number of non-empty strata.
This means that if you have empty strata (no sample persons represented in either PSU) it would actually increase the number of degrees of freedom. This is incorrect and SAS is currently working on correcting the method used to calculate the degrees of freedom in its survey procedures.
In contrast, SUDAAN will count the number of PSU?s and strata with at least one valid observation for each cell of the table being requested for the subpopulation you are analyzing.
In SUDAAN, one must specify the ATLEVEL1 and ATLEVEL2 options in the PROC statement in Proc DESCRIPT or Proc CROSSTAB to request the counting of the PSU?s and strata. The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (stratum and PSU) for which you want counts per table cell. The values 1 and 2 are the positions on the NEST statement of the variables used to designate the stages of sampling.
These options are associated with the keywords ATLEV1 or ATLEV2 respectively on the print or output statements. ATLEV1 is the number of stratum with at least one valid observation and ATLEV2 is the number of PSU?s with at least one valid observation. These numbers are used to calculate degrees of freedom.
In addition, in both SAS Survey Procedures (Proc Survey Means) and starting in SUDAAN ver 9.1 (Proc Descript), 95% confidence limits are available. These 95% CI?s also are calculated using the WALD method based on a t-statistic for the number of degrees of freedom in the entire NHANES sample
but DOES NOT correct for the reduction in the degrees of freedom in subdomains where not all stratum and PSU?s are represented.

**50. **Analyzing Data from NHANES 1999-2004 Analytic Guidelines:
Detailed guidelines for working with NHANES data can be found at:
http://www.cdc.gov/nchs/nhanes.htm
This document contains everything discussed today and will continue to grow to include guidelines for statistical tests, multivariate analyses, modeling and more!
Web based tutorial also currently available and continuously being developed.
Although I?ve touched on several topics today, but you most likely have other questions related to analysis of the NHANES data.
Our current analytic guidelines available on the web at the address given here, include everything I?ve discussed today and in more detail.
In addition, we are currently working on designing a web based tutorial program that will assist a user whether novice or seasoned, on how to access, download, merge, analyze and interpret NHANES data.
Parts of this tutorial are already available. It includes how to find, access, download, merge, check and recode NHANES data as well as
How to understand NHANES sampling and weighting.
Other modules on variance estimation, calculating confidence limits, and various statistical procedures for survey data will be added soon.
Consider visiting our web site at least monthly to see what new data is available and what new analytic guidance is provided in the guidelines document or on the tutorial.
This document will continue to grow as we add guidelines on statistical testing, multivariate analyses and modeling, trend analysis and more.
Thank you !
Although I?ve touched on several topics today, but you most likely have other questions related to analysis of the NHANES data.
Our current analytic guidelines available on the web at the address given here, include everything I?ve discussed today and in more detail.
In addition, we are currently working on designing a web based tutorial program that will assist a user whether novice or seasoned, on how to access, download, merge, analyze and interpret NHANES data.
Parts of this tutorial are already available. It includes how to find, access, download, merge, check and recode NHANES data as well as
How to understand NHANES sampling and weighting.
Other modules on variance estimation, calculating confidence limits, and various statistical procedures for survey data will be added soon.
Consider visiting our web site at least monthly to see what new data is available and what new analytic guidance is provided in the guidelines document or on the tutorial.
This document will continue to grow as we add guidelines on statistical testing, multivariate analyses and modeling, trend analysis and more.
Thank you !