1 / 91

SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 4 April 2002

daxia
Download Presentation

SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 4 April 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    2. OUTLINE Statistical Computing Resources Data Management with Stata Table Generation Tab and Table Commands Survey Commands

    4. Computing Resources The Age of ICT has brought about a synergy of computing and communications Implications: More DATA collected More DATA stored More DATA accessible and distributed

    5. Computing Resources There are a host of statistical software that provide pre-programmed analytical and data management capabilities. These software may be classified according to use and cost.

    6. Computing Resources Types of Stat Software by usage General Purpose -- SAS, SPSS, R, Splus, Statistica, Stata Special Purposes -- econometric modeling (Eviews), seasonal adjustment (X12), Bayesian modeling (WINBUGS), survey data tabulation & variance estimation (IMPS, CENVAR)

    7. Computing Resources Types of Stat Software by cost Commercial Software - SAS, SPSS, Stata, S-plus Freeware - R, IMPS, X12

    8. Computing Resources FOR SURVEY DATA Bascula from Statistics Netherlands. CENVAR (& IMPS)from U.S. Bureau of the Census. CLUSTERS from University of Essex. Epi Info from Centers for Disease Control. Generalized Estimation System (GES) from Statistics Canada. IVEWare (beta version) from University of Michigan.

    9. Computing Resources FOR SURVEY DATA PCCARP from Iowa State University. SAS/STAT from SAS Institute. Stata from Stata Corporation. SUDAAN from Research Triangle Institute. VPLX from U.S. Bureau of the Census. WesVar from Westat, Inc.

    10. Computing Resources Lists of Statistical Software http://members.aol.com/johnp71/javasta2.html http://www.stir.ac.uk/Departments/HumanSciences/SocInfo/Statistical.htm http://www.fas.harvard.edu/~stats/survey-soft/ http://www.feweb.vu.nl/econometriclinks/software.html

    11. Computing Resources This afternoon, we will provide a demonstration on how to use STATA for accomplishing some of the most common tasks of data management, statistical computing and analysis of survey data.

    12. Computing Resources Stata Estimation of means, totals, ratios, and proportions; linear regression, logistic regression, and probit. Point estimates, associated standard errors, confidence intervals, and design effects for the full population or subpopulations are displayed.

    13. Computing Resources Stata Auxiliary commands display various information for linear combinations (e.g., differences) of estimators, and conduct hypothesis tests. New in Stata : contingency tables with Rao-Scott corrections of chi-squared tests; new survey-corrected regression commands including tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and Poisson

    14. Computing Resources Stata stratified designs; cluster sampling; FPCs can be calculated for simple random sampling w/o replacement of sampling units within strata; variance estimation for multistage sample data carried out through the customary between-PSU-squared-differences calculation.

    15. Computing Resources Stata Variance estimation is done thru Taylor-series linearization in the survey analysis commands. There are also commands for jackknife and bootstrap variance estimation, but these are not specifically oriented toward survey data.

    16. Computing Resources Note: We will demonstrate the use of STATA version 6. Current version is version 7; even a Special Edition (SE) which can handle up to 32,766 variables w/ strings up to 244 chars, and up to 11,000 x 11,000 matrices.

    18. Data Management STARTING UP Go to Start, Programs, Stata, Intercooled Stata Alternatively, from Windows Explorer, go to folder c:\stata Double click wstata.exe

    19. Data Management

    20. Data Management CREATING A NEW DATASET Open the STATA spreadsheet editor

    21. CREATING A NEW DATASET Enter data into the editor, when done close the editor. Data Management

    22. CREATING A NEW DATASET In the STATA COMMAND window enter the command save newfile Data Management

    23. NOTE A STATA dataset will have extension name dta. That is, newfile is actually newfile.dta Public use files of some surveys, e.g. VLSS (Vietnam Living Standards Survey), are in Stata format. Data Management

    24. INSPECTING DATA BASE In the STATA COMMAND window enter the following commands describe list summarize Data Management

    25. NOTE: Stata is case sensitive. Stata commands may be abbreviated, e.g. D for DESCRIBE, SUM for SUMMARIZE, etc. We may use Page Up/Down keys or mouse for re-selecting commands in the Review window. Data Management

    26. NOTE: Commands and output are shown in Results window. Windows may be re-sized. Commands and output may be logged into a log file by pressing Open Log button. Data Management

    27. RENAMING VARIABLES ONE WAY : (From Data Editor) Double click anywhere in the variable‘s column resulting in a dialogue box Data Management

    28. RENAMING VARIABLES SECOND WAY: (In the STATA COMMAND window) enter rename var1 domain rename var2 hcn rename var3 age label variable age “HH head age” d Data Management

    29. SAVING EDITED DATABASE In the STATA COMMAND window enter the following commands save newfile, replace Note: typing only save newfile will result in an error message Data Management

    30. READING PRE-EXISTING STATA DATASET If dataset is in folder c:\fies2000 and filename is “fies00small.dta”, enter clear set mem 64m cd c:\fies2000 use fies00small Data Management

    31. IMPORTING DATA Suppose we have a dataset try.txt in c:\fies2000 folder Data Management

    32. IMPORTING DATA Suppose we have a dataset try.txt in c:\fies2000 folder Use the infile command with syntax infile variable-list using filename.raw In particular, enter cd c:\fies2000 infile domain hcn age using try.txt, automatic Data Management

    33. TRIVIA ON STRING VARIABLES When using the infile command for character (string) variables, we need to identify these variables. For instance infile domain hcn str30 prov using tr.txt For more details regarding infile, enter help infile1 Data Management

    34. IMPORTING DATA Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields Data Management

    35. IMPORTING DATA Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields Use the infix command infix domain 1 hcn 2 age 3-4 using try2.txt, clear Data Management

    36. Thus, Stata can read text files with Infile (if the data in text is separated by spaces and does not have strings, or if strings are just one word, or if all strings are enclosed in quotes) Infix (fixed format text) Insheet (if text file was created by a spreadsheet or db program) Data Management

    37. NOTE: The commands infile, infix, insheet read data from ASCII files. Outfile is a way to save the data in ASCII. There are third party programs, esp. Stat/Transfer and DBMS/COPY, that perform translations from one data format (e.g., dBASE, Excel, SAS, SPSS, Stata) to another. Data Management

    38. Data Management

    39. Data Management OTHER USEFUL COMMANDS To sort the dataset by age sort age To get a listing of the dataset list To get a listing of the 2nd-4th data list in 2/4

    40. Data Management OTHER USEFUL COMMANDS To summarize the restricted dataset of HHs whose head’s age is less than/equal to 50 summarize if age <=50 HH head age between 35 and 50 summarize if age <50 & age >35

    41. Data Management Comparison operators > > = == < <= != Logical operators & (and) ! (not) | (or) ~ (not)

    42. Data Management OTHER USEFUL COMMANDS To tabulate domain tab domain To generate contingency tables tab domain hcn if age>35 To get the correlation matrix correlate x y z

    43. Data Management GENERATING & REPLACING VARIABLES Suppose we want to obtain per capita income (pci) of FIES 2000 households clear cd d:\fies00 use fies00small gen pci=toinc/hsize

    44. Data Management GENERATING & REPLACING VARIABLES Now tag the household as poor (1) if pci < some threshold, say 13823, determine percent of HHs that are poor. gen poor=1 if pci < 13823 replace poor=0 if poor==. sum poor [aw=rfact] save fies00small, replace

    45. Data Management NOTE Small portion of data set of FIES 2000 was used. The Family Income and Expenditure Survey (FIES) is conducted by the National Statistics Office (NSO)every 3 years. Data may be purchased through the NSO website: www.census.gov.ph

    47. Data Management RECALL That if we use our fies2000 data set set mem 64m cd c:\fies2000 use fies00small sum poor [aw=rfact] Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions)

    49. Estimating Food Poverty Line Food poverty line estimated from low cost one day menus (breakfast, lunch, supper snack) constructed for each urban-rural area of a region by Food and Nutrient Research Institute (FNRI) which meet 100% sufficiency in energy and protein requirements and 80% sufficiency of other nutrients and vitamins. RDA’s for energy: 2000 Kcal per person RDA’s for protein: 50 grams per person 29 such menus constructed on the basis of the 1988 Food Consumption Survey

    50. Annual Per Capita Food Line Urban, by Region

    51. Annual Per Capita Food Line Rural, by Region

    52. Estimating Poverty Line Poverty Line= Food Threshold/ Engel’s Coefficient Engel’s coefficient estimated by analyzing the consumption pattern of families having incomes within plus or minus 10 percentage points from food threshold. Engel’s coeff = Food Exp/ Total Basic Exp

    53. Annual Per Capita Poverty Line Urban, by Region

    54. Annual Per Capita Poverty Line Rural, by Region

    55. Poverty Statistics (Family)

    56. Poverty Incidence All Areas, by Region

    57. Small Area Poverty Stats? Stata has some add ons for generating SEs for poverty stats If we wish to generate provincial poverty statistics, we will find out that SEs are too high, i.e. figures are unreliable

    59. Data Management RECALL That if we use our fies2000 data set set mem 64m cd c:\fies2000 use fies00small sum poor [aw=rfact] Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions)

    60. Data Management NOTE: STATA uses several types of weights fw frequency weights aw analytic weights iw importance weights pw probability weights

    61. Data Management NOTE: Within the command generate or replace, we may transform or create variables by using functions, e.g., generate loginc=ln(toinc) generate y=cos(x*_pi/180) replace newvar=normd(z) generate rvar=uniform()

    62. Data Management DELETING VARIABLES/DATA To drop a variable, say age drop age To drop some observations drop in 2/3 Try also the command keep. To drop all data in memory clear

    63. Data Management NOTE: So far we have used STATA interactively. We can also do batch processing through the DO FILE editor.

    64. Data Management NOTE: The STATA toolbar has 13 buttons. The first three are to OPEN a Stata dataset SAVE to the disk the resident dataset PRINT a graph or log

    65. Data Management The next five are for Starting/stopping/suspending a LOG Bringing the Log to the Front Bringing the Dialog to Front Bringing the Results to Front Bringing the Graph to Front

    66. Data Management The last five are for Opening the DO FILE editor Opening the DATA editor Opening the DATA Browser Telling Stat to continue when it has paused in mid of long output Stopping the current task

    67. Exercise What is the average income of families that are below or above the mean family expenditure?

    68. Exercise Compare correlation of food expenditures (fexp) and nonfood expenditures for families in rural & urban areas.

    69. Extra Enter graph food nfood

    70. Extra Now try sort urb graph food nfood, by (urb) graph food nfood, by (urb) total

    71. Extra Matrix plots graph toinc food nfood, matrix

    73. Table Generation w/ tab Earlier, we showed the use of the tab(ulate) command. Try tab urb tab urb [aw=rfact] tab urb [iw=rfact] tab urb regn

    74. Tab The tab command has options for generating 1-way tables of freqs tab urb, summ(toinc) and two way tables tab urb sex tab urb sex, row tab urb sex, row col chi2 tab urb sex, all exact

    75. Table Generation w/ table Aside from the tab command, we can generate tables of statistics with the table command. Compare tab urb with table urb

    76. Table To generate the average (family) income and average (family) expenditure across urban and rural areas, enter table urb, c(mean toinc mean toexp) Using weights table urb [aw=rfact], c(mean toinc mean toexp)

    77. Table The contents option may specify at most five of the ff statistics: freq (for frequency) mean varname (for mean of varname) sd varname (for standard deviation) sum varname (for sum) rawsum varname (for sums ignoring optionally specified weight) count varname (for count of nonmissing data)

    78. Table The contents option may specify at most five of the ff statistics: n varname (same as count) max varname (for maximum) min varname (for minimum) median varname (for median) p1 varname (for 1st percentile) p2 varname (for 2nd percentile) ... iqr varname (for interquartile range)

    79. Exercise Using Table Obtain the average and median per capita income of households by sex of household head table sex, c(mean pci median pci) Obtain the “weighted” frequency of poor and nonpoor households across regions table poor regn [iw=rfact]

    80. Using Survey Commands STATA has designed a family of commands especially for sample surveys. These commands all begin with svy svyset setting variables svydes describe strata and PSUs svymean estimate popn & subpop means svytotals estimate popn & subpop totals

    81. Using Survey Commands Svy commands svyprop estimate popn & subpop props svyratio estimate popn & subpop ratios svytab for two way tables svyreg for regression svyivreg for instrumental variables reg svylogit for logit reg svyprobit for probit reg

    82. Using Survey Commands Svy commands svytest for hypothesis testing svylc for estimating linear combs svymlog for multinomial logistic reg svyolog for ordered logistic reg svyoprob for ordered probit reg svypois for poisson reg svyintrg for censored & interval reg

    83. Using Survey Commands Before issuing any svy estimation command, we identify the weight, strata and PSU identifier variables svyset pweight rfact svyset strata domain svyset psu hcn

    84. Using Survey Commands To obtain the average family income & average family expenditure svymean toinc toexp To obtain the total family income, total family expenditure by province svytotal toinc toexp, by(regn)

    85. Using Survey Commands To obtain the per capita income & per capita expenditure svyratio toinc/fsize toexp/fsize pci & pce by urban/rural svyratio toinc/fsize toexp/fsize, by(urb)

    86. Using Survey Commands Linear regression of ln(pci) gen loginc=ln(pci) svyreg loginc age fsize sex prov urb Compare the results with the regular regression command reg loginc age fsize sex prov urb

    87. Using Survey Commands Two way tables svytab urb poor, row se compared with tab urb poor [aw=rfact], no freq row

    89. Learning More about Stata Online tutorial, type tutorial intro List of Tutorials Tutorial Description ----------------------------------------------------- intro An introduction to Stata graphics How to make graphs tables How to make tables regress Estimating regression models, inc 2SLS anova Estimating one-, two- and N-way ANOVA and ANCOVA models

    90. Learning More about Stata Tutorial Description ----------------------------------------------------- logit Estimating maximum-likelihood logit and probit models survival Estimating ML survival models factor Estimating factor and principal component models ourdata Description of the data we provide yourdata How to input your own data into Stata

    91. Learning More about Stata Email distribution list. Send email to Majordomo@hsphsun2.harvard.edu In the body of your email message type the message    subscribe statalist email@address or for a daily summary subscribe statalist-digest email@address

More Related