Download
when setup files go bad debugging your sas spss and stata code so it works n.
Skip this Video
Loading SlideShow in 5 Seconds..
When Setup Files Go Bad….  Debugging your SAS, SPSS, and STATA code so it works PowerPoint Presentation
Download Presentation
When Setup Files Go Bad….  Debugging your SAS, SPSS, and STATA code so it works

When Setup Files Go Bad….  Debugging your SAS, SPSS, and STATA code so it works

596 Views Download Presentation
Download Presentation

When Setup Files Go Bad….  Debugging your SAS, SPSS, and STATA code so it works

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. When Setup Files Go Bad….  Debugging your SAS, SPSS, and STATA code so it works Felicia B. LeClere, Ph.D. Director, Data Sharing for Demographic Research

  2. Overview of webinar • Broaden the scope a bit….. • No set up files ---this is where we learn to debug • Set up files ---things that might not work • When the double click doesn’t work….

  3. Things we will be looking for • What it looks like when it runs… • When things don’t work… • How to diagnosis what’s wrong…

  4. Its’ just numbers…what do I do? • Many of our historical files require you to create syntax on your own…that means learning to read in ASCII data • You know you are in trouble when the download page looks like this…

  5. Instead of this……..

  6. What to do…. • Find the documentation and look for the following language • Column locations, field length, or variable position • These describe where your variables are in the ASCII data file and mark how you will read them in….

  7. What you will see…. How to read the data The data file location The data file

  8. What you need to do This is from the codebook…called tape position index Variable location Variable

  9. Variable descriptions

  10. And you know the drill…. • Identify method for ASCII for your favorite stat package • Use fixed format infile to read the fields • And build…..

  11. How do you know when its gone wrong • Says the file doesn’t exist or can’t be read or some other message • Doesn’t read a variable or doesn’t recognize a variable name • Frequency counts really don’t match what is in the documentation • The valid values for a variable don’t match what’s in the codebook • The number of cases don’t match the number given in the codebook

  12. This looks right libname in "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002"; data new; infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt"; input pid 1-5 exam 16 lang 17; proc freq; tables lang; run; NOTE: The infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt" is: File Name=D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt, RECFM=V,LRECL=256 NOTE: 11653 records were read from the infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt". The minimum record length was 256. The maximum record length was 256. One or more lines were truncated. NOTE: The data set WORK.NEW has 11653 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 1.75 seconds cpu time 0.15 seconds 6 proc freq; 7 tables lang; 8 run; NOTE: There were 11653 observations read from the data set WORK.NEW. NOTE: PROCEDURE FREQ used (Total process time): real time 0.96 seconds cpu time 0.00 seconds

  13. The SAS System 10:27 Tuesday, April 28, 2009 1 The FREQ Procedure Cumulative Cumulative lang Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 5986 51.53 5986 51.53 2 5631 48.47 11617 100.00 Frequency Missing = 36 Looks good!!

  14. Not so right ……. libname in "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002"; data new; infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt"; input pid 1-5 exam 16 lang 17 Bite 407; proc freq; tables bite; run; NOTE: The infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt" is: File Name=D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt, RECFM=V,LRECL=256 NOTE: LOST CARD. pid=16785 exam=1 lang=1 Bite=. _ERROR_=1 _N_=5827 NOTE: 11653 records were read from the infile "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\08535-0002-data.txt". The minimum record length was 256. The maximum record length was 256. One or more lines were truncated. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.NEW has 5826 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.34 seconds cpu time 0.21 seconds 14 proc freq; 15 tables bite; 16 run; NOTE: There were 5826 observations read from the data set WORK.NEW. NOTE: PROCEDURE FREQ used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

  15. Allowable values don’t match Frequencies don’t match

  16. Why? • The allowable record length in SAS is 256 –it was telling us that in the error. • Once we got past the field position of 256…we got lost. Language was at position 17 ….and Bite at 407 • Solution- • Reset lrecl in infile statement (lrecl=815)

  17. Other reasons things go bad • Multiple lines per record --- a product of times when data were on cards and the record length was fixed at 80 • You read a string as a numeric or vice versa • Data errors or non-standard characters (files converted from main frames or other formats)

  18. You have a syntax file … • You find a file and you download it for your favorite flavor of software • You decide to keep all the variables • You know where the data went (i.e. where you downloaded it to)

  19. Initial steps to test • Get rid of formatting • Add a frequency check for variables at the beginning and end • Simplify if you can (do you really need all those variables…)

  20. libname in "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002"; DATA; INFILE "D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\22627-0001-Data.txt" LRECL=2983; INPUT CASEID 1-8 GENDER 9 AGE 10-11 ETHNONAT 12-13 ETHNOS10 14-15 PANETH4 16-17 GENERAT3 18-20 .1 GENERAT4 21-23 .1 AGEARRV 24-25 ABUELOFB 26-27 QUOGRPS 28-31 INTLANG 32-35 SAMPLE 36-39 QS2AM 40-43 QS2AF 44-47 QS5A 48-51 QS5B 52-55 QS6A 56-59 QS6B 60-63 QS7 64-67 QS8 68-71 What happened? ERROR: Physical file does not exist, D:\fleclere\Desktop\misc documents\10294302\ICPSR_08535\DS0002\22627-0001-Data.txt. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.DATA1 may be incomplete. When this step was stopped there were 0 observations and 657 variables. NOTE: DATA statement used (Total process time): real time 0.29 seconds cpu time 0.29 seconds 1534 Proc freq; 1535 Tables gender polparty; 1536 1537 RUN ; NOTE: No observations in data set WORK.DATA1. NOTE: PROCEDURE FREQ used (Total process time): real time 0.00 seconds cpu time 0.00 seconds

  21. TE: The infile "D:\fleclere\Desktop\misc documents\10294358\ICPSR_22627\DS0001\22627-0001-Data.txt" is: File Name=D:\fleclere\Desktop\misc documents\10294358\ICPSR_22627\DS0001\22627-0001-Data.txt, RECFM=V,LRECL=2983 NOTE: 4655 records were read from the infile "D:\fleclere\Desktop\misc documents\10294358\ICPSR_22627\DS0001\22627-0001-Data.txt". The minimum record length was 2983. The maximum record length was 2983. NOTE: The data set WORK.DATA3 has 4655 observations and 657 variables. NOTE: DATA statement used (Total process time): real time 2.51 seconds cpu time 0.73 seconds 4576 Proc freq; 4577 Tables gender polparty; 4578 4579 RUN ; NOTE: There were 4655 observations read from the data set WORK.DATA3. NOTE: PROCEDURE FREQ used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

  22. This is from our codebook

  23. What else should I check? This is from the original survey documentation before ICPSR standardization. Always validate the data produced against documentation from original data set to be sure. The syntax and the ICPSR codebook have the same origins --- an error in one may be reproduced in another. Total case counts and frequencies.

  24. If the frequencies or case counts don’t match • Check the lrecl against the documentation • Check the field lengths ----the codebooks should contain for each variable its location and field length • Punctuation counts …SAS likes its semicolons and SPSS its spaces and periods and STATA is fussy about what goes before and after a comma

  25. If variable looks weird • Print observations …. Everything checks out but there are weird fields or non-numeric items in a frequency display. Print a record or 2.

  26. I pointed, I clicked, and … • Things to ask yourself Is it a version issue? (SAS in particular has problems reading different versions) Do you have the software? (the icon will not look right)

  27. Why you should always run the ASCII syntax instead? • It allows you to customize the file • It forces you to know where the data are • It forces you to read the log files even if all you are doing is watching them go by • You have to open the software version --- and it will run it in the version you have and create the file the way you need it • It prevents you from being complacent

  28. Steps to prevent bugs • Simplify the syntax…take out all the extraneous stuff • Pick fewer variables • Always add frequency counts • Always check case counts

  29. Steps to prevent bugs • Know where you put the data • Read the documentation first • Save log files as well as program files • Verify, verify, verify

  30. If you find errors in ICPSR syntax • Please help us … send a corrected file and a description of the error to: netmail@icpsr.umich.edu We do updates all the time

  31. If you need help with the basics • Our help site for reading in data Using Data Great help in building and debugging statistical software programs UCLA