910 likes | 1.08k Views
E N D
2. OUTLINE Statistical Computing Resources
Data Management with Stata
Table Generation
Tab and Table Commands
Survey Commands
4. Computing Resources The Age of ICT has brought about a synergy of computing and communications
Implications:
More DATA collected
More DATA stored
More DATA accessible and distributed
5. Computing Resources There are a host of statistical software that provide pre-programmed analytical and data management capabilities. These software may be classified according to use and cost.
6. Computing Resources Types of Stat Software by usage
General Purpose -- SAS, SPSS, R, Splus, Statistica, Stata
Special Purposes -- econometric modeling (Eviews), seasonal adjustment (X12), Bayesian modeling (WINBUGS), survey data tabulation & variance estimation (IMPS, CENVAR)
7. Computing Resources Types of Stat Software by cost
Commercial Software - SAS, SPSS, Stata, S-plus
Freeware - R, IMPS, X12
8. Computing Resources FOR SURVEY DATA
Bascula from Statistics Netherlands.
CENVAR (& IMPS)from U.S. Bureau of the Census.
CLUSTERS from University of Essex.
Epi Info from Centers for Disease Control.
Generalized Estimation System (GES) from Statistics Canada.
IVEWare (beta version) from University of Michigan.
9. Computing Resources FOR SURVEY DATA
PCCARP from Iowa State University.
SAS/STAT from SAS Institute.
Stata from Stata Corporation.
SUDAAN from Research Triangle Institute.
VPLX from U.S. Bureau of the Census.
WesVar from Westat, Inc.
10. Computing Resources Lists of Statistical Software
http://members.aol.com/johnp71/javasta2.html
http://www.stir.ac.uk/Departments/HumanSciences/SocInfo/Statistical.htm
http://www.fas.harvard.edu/~stats/survey-soft/
http://www.feweb.vu.nl/econometriclinks/software.html
11. Computing Resources This afternoon, we will provide a demonstration on how to use STATA for accomplishing some of the most common tasks of data management, statistical computing and analysis of survey data.
12. Computing Resources Stata
Estimation of means, totals, ratios, and proportions;
linear regression, logistic regression, and probit.
Point estimates, associated standard errors, confidence intervals, and design effects for the full population or subpopulations are displayed.
13. Computing Resources Stata
Auxiliary commands display various information for linear combinations (e.g., differences) of estimators, and conduct hypothesis tests.
New in Stata : contingency tables with Rao-Scott corrections of chi-squared tests; new survey-corrected regression commands including tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and Poisson
14. Computing Resources Stata
stratified designs;
cluster sampling;
FPCs can be calculated for simple random sampling w/o replacement of sampling units within strata;
variance estimation for multistage sample data carried out through the customary between-PSU-squared-differences calculation.
15. Computing Resources Stata
Variance estimation is done thru Taylor-series linearization in the survey analysis commands. There are also commands for jackknife and bootstrap variance estimation, but these are not specifically oriented toward survey data.
16. Computing Resources Note:
We will demonstrate the use of STATA version 6. Current version is version 7; even a Special Edition (SE) which can handle up to 32,766 variables w/ strings up to 244 chars, and up to 11,000 x 11,000 matrices.
18. Data Management STARTING UP
Go to Start, Programs, Stata, Intercooled Stata
Alternatively, from Windows Explorer, go to folder
c:\stata
Double click
wstata.exe
19. Data Management
20. Data Management CREATING A NEW DATASET
Open the STATA spreadsheet editor
21. CREATING A NEW DATASET
Enter data into the editor, when done close the editor.
Data Management
22. CREATING A NEW DATASET
In the STATA COMMAND window enter the command
save newfile
Data Management
23. NOTE
A STATA dataset will have extension name dta. That is, newfile is actually newfile.dta
Public use files of some surveys, e.g. VLSS (Vietnam Living Standards Survey), are in Stata format.
Data Management
24. INSPECTING DATA BASE
In the STATA COMMAND window enter the following commands
describe
list
summarize
Data Management
25. NOTE:
Stata is case sensitive.
Stata commands may be abbreviated, e.g. D for DESCRIBE, SUM for SUMMARIZE, etc.
We may use Page Up/Down keys or mouse for re-selecting commands in the Review window.
Data Management
26. NOTE:
Commands and output are shown in Results window. Windows may be re-sized.
Commands and output may be logged into a log file by pressing Open Log button. Data Management
27. RENAMING VARIABLES
ONE WAY : (From Data Editor) Double click anywhere in the variable‘s column resulting in a dialogue box
Data Management
28. RENAMING VARIABLES
SECOND WAY: (In the STATA COMMAND window) enter
rename var1 domain
rename var2 hcn
rename var3 age
label variable age “HH head age”
d
Data Management
29. SAVING EDITED DATABASE
In the STATA COMMAND window enter the following commands
save newfile, replace
Note: typing only
save newfile
will result in an error message Data Management
30. READING PRE-EXISTING
STATA DATASET
If dataset is in folder c:\fies2000 and filename is “fies00small.dta”, enter
clear
set mem 64m
cd c:\fies2000
use fies00small Data Management
31. IMPORTING DATA
Suppose we have a dataset try.txt in c:\fies2000 folder
Data Management
32. IMPORTING DATA
Suppose we have a dataset try.txt in c:\fies2000 folder
Use the infile command with syntax
infile variable-list using filename.raw
In particular, enter
cd c:\fies2000
infile domain hcn age using try.txt,
automatic Data Management
33. TRIVIA ON STRING VARIABLES
When using the infile command for character (string) variables, we need to identify these variables. For instance
infile domain hcn str30 prov using tr.txt
For more details regarding infile, enter
help infile1
Data Management
34. IMPORTING DATA
Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields
Data Management
35. IMPORTING DATA
Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields
Use the infix command
infix domain 1 hcn 2 age 3-4 using try2.txt, clear
Data Management
36. Thus, Stata can read text files with
Infile (if the data in text is separated by spaces and does not have strings, or if strings are just one word, or if all strings are enclosed in quotes)
Infix (fixed format text)
Insheet (if text file was created by a spreadsheet or db program) Data Management
37. NOTE:
The commands infile, infix, insheet read data from ASCII files. Outfile is a way to save the data in ASCII.
There are third party programs, esp. Stat/Transfer and DBMS/COPY, that perform translations from one data format (e.g., dBASE, Excel, SAS, SPSS, Stata) to another. Data Management
38. Data Management
39. Data Management OTHER USEFUL COMMANDS
To sort the dataset by age
sort age
To get a listing of the dataset
list
To get a listing of the 2nd-4th data
list in 2/4
40. Data Management OTHER USEFUL COMMANDS
To summarize the restricted dataset of HHs whose head’s age is less than/equal to 50
summarize if age <=50
HH head age between 35 and 50
summarize if age <50 & age >35
41. Data Management Comparison operators
> > = ==
< <= !=
Logical operators
& (and) ! (not)
| (or) ~ (not)
42. Data Management OTHER USEFUL COMMANDS
To tabulate domain
tab domain
To generate contingency tables
tab domain hcn if age>35
To get the correlation matrix
correlate x y z
43. Data Management GENERATING & REPLACING VARIABLES
Suppose we want to obtain per capita income (pci) of FIES 2000 households
clear
cd d:\fies00
use fies00small
gen pci=toinc/hsize
44. Data Management GENERATING & REPLACING VARIABLES
Now tag the household as poor (1) if pci < some threshold, say 13823, determine percent of HHs that are poor.
gen poor=1 if pci < 13823
replace poor=0 if poor==.
sum poor [aw=rfact]
save fies00small, replace
45. Data Management NOTE
Small portion of data set of FIES 2000 was used. The Family Income and Expenditure Survey (FIES) is conducted by the National Statistics Office (NSO)every 3 years. Data may be purchased through the NSO website:
www.census.gov.ph
47. Data Management RECALL
That if we use our fies2000 data set
set mem 64m
cd c:\fies2000
use fies00small
sum poor [aw=rfact]
Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions)
49. Estimating Food Poverty Line Food poverty line estimated from low cost one day menus (breakfast, lunch, supper snack) constructed for each urban-rural area of a region by Food and Nutrient Research Institute (FNRI) which meet 100% sufficiency in energy and protein requirements and 80% sufficiency of other nutrients and vitamins.
RDA’s for energy: 2000 Kcal per person
RDA’s for protein: 50 grams per person
29 such menus constructed on the basis of the 1988 Food Consumption Survey
50. Annual Per Capita Food Line Urban, by Region
51. Annual Per Capita Food Line Rural, by Region
52. Estimating Poverty Line Poverty Line= Food Threshold/ Engel’s Coefficient
Engel’s coefficient estimated by analyzing the consumption pattern of families having incomes within plus or minus 10 percentage points from food threshold.
Engel’s coeff = Food Exp/ Total Basic Exp
53. Annual Per Capita Poverty Line Urban, by Region
54. Annual Per Capita Poverty Line Rural, by Region
55. Poverty Statistics (Family)
56. Poverty Incidence All Areas, by Region
57. Small Area Poverty Stats? Stata has some add ons for generating SEs for poverty stats
If we wish to generate provincial poverty statistics, we will find out that SEs are too high, i.e. figures are unreliable
59. Data Management RECALL
That if we use our fies2000 data set
set mem 64m
cd c:\fies2000
use fies00small
sum poor [aw=rfact]
Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions)
60. Data Management NOTE:
STATA uses several types of weights
fw frequency weights
aw analytic weights
iw importance weights
pw probability weights
61. Data Management NOTE:
Within the command generate or replace, we may transform or create variables by using functions, e.g.,
generate loginc=ln(toinc)
generate y=cos(x*_pi/180)
replace newvar=normd(z)
generate rvar=uniform()
62. Data Management DELETING VARIABLES/DATA
To drop a variable, say age
drop age
To drop some observations
drop in 2/3
Try also the command keep.
To drop all data in memory
clear
63. Data Management NOTE:
So far we have used STATA interactively. We can also do batch processing through the DO FILE editor.
64. Data Management NOTE:
The STATA toolbar has 13 buttons.
The first three are to OPEN a Stata dataset
SAVE to the disk the resident dataset
PRINT a graph or log
65. Data Management
The next five are for Starting/stopping/suspending a LOG
Bringing the Log to the Front
Bringing the Dialog to Front
Bringing the Results to Front
Bringing the Graph to Front
66. Data Management
The last five are for
Opening the DO FILE editor
Opening the DATA editor
Opening the DATA Browser
Telling Stat to continue when it has paused in mid of long output
Stopping the current task
67. Exercise What is the average income of families that are below or above the mean family expenditure?
68. Exercise Compare correlation of food expenditures (fexp) and nonfood expenditures for families in rural & urban areas.
69. Extra Enter
graph food nfood
70. Extra Now try
sort urb
graph food nfood, by (urb)
graph food nfood, by (urb) total
71. Extra Matrix plots
graph toinc food nfood, matrix
73. Table Generation w/ tab Earlier, we showed the use of the tab(ulate) command. Try
tab urb
tab urb [aw=rfact]
tab urb [iw=rfact]
tab urb regn
74. Tab The tab command has options for generating 1-way tables of freqs
tab urb, summ(toinc)
and two way tables
tab urb sex
tab urb sex, row
tab urb sex, row col chi2
tab urb sex, all exact
75. Table Generation w/ table Aside from the tab command, we can generate tables of statistics with the table command. Compare
tab urb
with
table urb
76. Table To generate the average (family) income and average (family) expenditure across urban and rural areas, enter
table urb, c(mean toinc mean toexp)
Using weights
table urb [aw=rfact], c(mean toinc mean toexp)
77. Table The contents option may specify at most five of the ff statistics:
freq (for frequency)
mean varname (for mean of varname)
sd varname (for standard deviation)
sum varname (for sum)
rawsum varname (for sums ignoring optionally specified weight)
count varname (for count of nonmissing data)
78. Table The contents option may specify at most five of the ff statistics:
n varname (same as count)
max varname (for maximum)
min varname (for minimum)
median varname (for median)
p1 varname (for 1st percentile)
p2 varname (for 2nd percentile)
...
iqr varname (for interquartile range)
79. Exercise Using Table Obtain the average and median per capita income of households by sex of household head
table sex, c(mean pci median pci)
Obtain the “weighted” frequency of poor and nonpoor households across regions
table poor regn [iw=rfact]
80. Using Survey Commands STATA has designed a family of commands especially for sample surveys. These commands all begin with svy
svyset setting variables
svydes describe strata and PSUs
svymean estimate popn & subpop means
svytotals estimate popn & subpop totals
81. Using Survey Commands Svy commands
svyprop estimate popn & subpop props
svyratio estimate popn & subpop ratios
svytab for two way tables
svyreg for regression
svyivreg for instrumental variables reg
svylogit for logit reg
svyprobit for probit reg
82. Using Survey Commands Svy commands
svytest for hypothesis testing
svylc for estimating linear combs
svymlog for multinomial logistic reg
svyolog for ordered logistic reg
svyoprob for ordered probit reg
svypois for poisson reg
svyintrg for censored & interval reg
83. Using Survey Commands Before issuing any svy estimation command, we identify the weight, strata and PSU identifier variables
svyset pweight rfact
svyset strata domain
svyset psu hcn
84. Using Survey Commands To obtain the average family income & average family expenditure
svymean toinc toexp
To obtain the total family income, total family expenditure by province
svytotal toinc toexp, by(regn)
85. Using Survey Commands To obtain the per capita income & per capita expenditure
svyratio toinc/fsize toexp/fsize
pci & pce by urban/rural
svyratio toinc/fsize toexp/fsize, by(urb)
86. Using Survey Commands Linear regression of ln(pci)
gen loginc=ln(pci)
svyreg loginc age fsize sex prov urb
Compare the results with the regular regression command
reg loginc age fsize sex prov urb
87. Using Survey Commands Two way tables
svytab urb poor, row se
compared with
tab urb poor [aw=rfact], no freq row
89. Learning More about Stata Online tutorial, type
tutorial intro
List of Tutorials
Tutorial Description
-----------------------------------------------------
intro An introduction to Stata
graphics How to make graphs
tables How to make tables
regress Estimating regression models, inc 2SLS
anova Estimating one-, two- and N-way ANOVA and ANCOVA models
90. Learning More about Stata Tutorial Description
-----------------------------------------------------
logit Estimating maximum-likelihood logit and probit models
survival Estimating ML survival models
factor Estimating factor and principal component models
ourdata Description of the data we provide
yourdata How to input your own data into Stata
91. Learning More about Stata Email distribution list. Send email to
Majordomo@hsphsun2.harvard.edu
In the body of your email message type the message subscribe statalist email@addressor for a daily summary
subscribe statalist-digest email@address