1 / 21

Summarizing Data

Summarizing Data. STT 501 Spring 2007. Summary Statistics. Two base SAS procedures can be used to generate summary statistics for quantitative values—PROC MEANS and PROC UNIVARIATE

lavina
Download Presentation

Summarizing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarizing Data STT 501 Spring 2007

  2. Summary Statistics • Two base SAS procedures can be used to generate summary statistics for quantitative values—PROC MEANS and PROC UNIVARIATE • Both procedures are capable of computing several summary statistics: mean, standard deviation, quantiles and several others. • PROC UNIVARIATE has a more comprehensive set of data analysis tools available.

  3. Means Procedure • General Syntax (lots of stuff that we won’t always use): PROC MEANS <option(s)> <statistic-keyword(s)>; BY <DESCENDING> variable-1 <... <DESCENDING> variable-n><NOTSORTED>; CLASSvariable(s) </ option(s)>; FREQvariable; IDvariable(s); OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)> </ option(s)> ; TYPESrequest(s); VARvariable(s) < / WEIGHT=weight-variable>; WAYSlist; WEIGHTvariable; RUN;

  4. Means Procedure • Most Likely Used Statements: • var: specifies the variables you wish to summarize. • class: specifies variables used to group the data before summarizing. • Sometimes Used Statements: • freq: specifies a variable that indicates the frequency of each observation—useful if the data contains several occurrences of the same value that have been grouped together. • id: specifies variables that are not used in the analysis that you would like to keep for creation of an output data set.

  5. Means Procedure • Use the “projects” data set in our data folder: procmeansdata=stt501.projects; run; • What output appears?

  6. Means Procedure • Default behavior is to summarize ALL numeric variables, whether it makes sense or not (SAS doesn’t know any better). Use the var statement to override this behavior. What do you think of these?

  7. Means Procedure • Use a class statement to break up the analysis over 2 or more categories: procmeansdata=stt501.projects; class region; var personel; run; • Summaries for personnel costs are produced for each different value of the variable region.

  8. Means Procedure The variable listed in the var statement is noted here Each level of the class variable produces a separate summary

  9. Means Procedure • Specifying statistics keywords: procmeansdata=stt501.projects minq1medianq3max; class region pol_type; var personel; run; • Keywords override the default summary statistics, you can find a listing of these in the help section. • Note: with two variables in the class statement, a summary is produced for each combination of values for those variables.

  10. Means Procedure With two variables in the class statement, a summary is produced for each combination. Statistics are provided in the order listed in the proc means statement

  11. Univariate Procedure • The univariate procedure is similar to the means procedure in that it accepts var and class statements. • However, it has many more statements/options available to it, and it produces a greater amount of output by default.

  12. Univariate Procedure • Try this code: procunivariatedata=stt501.projects; var personel; run; • The default output includes a more extensive list of summary statistics than the means procedure.

  13. Univariate Procedure These tables contain some basic summary statistics like mean, median and standard deviation This table gives percentiles/ quantiles, including the five number summary. This table gives results of testing whether the mean is zero or not. The second page of output has a table containing the five largest and smallest values

  14. Univariate Procedure • The univariate procedure supports class statements in the same manner as PROC MEANS • Try this code: procunivariatedata=stt501.projects; class region pol_type; var personel; run; • The default output is now produced for each combination of values for the variables listed in the class statement.

  15. Univariate Procedure • Univariate can also produce histograms (for any variable that was listed in the by statement). Try this code: procunivariatedata=stt501.projects; class pol_type; var personel; histogram personel; run; • This produces a histogram of personnel costs for each type of pollution.

  16. Univariate Procedure Note the common scaling on both the vertical and horizontal axes

  17. Boxplots • We can make boxplots in SAS using (surprisingly enough) proc boxplot. It’s general form is: procboxplotdata=dataset; plot analysis-var*group-var; run; quit; • The data must be pre-sorted on the group variable

  18. Boxplots • For example: procsortdata=stt501.projects out=proj_sort; by region; run; procboxplotdata=proj_sort; plot jobtotal*region; run; quit; • Produces a set of side-by-side boxplots.

  19. Boxplots By default, SAS produces skeletal boxplots—no outlier detection is done or shown on the graph. This can be changed with the boxstyle option.

  20. Boxplots • To get an outlier boxplot, use: procboxplotdata=proj_sort; plot jobtotal*region/boxstyle=schematic; run; quit; • Each outlier can be tagged with an id variable as well: procboxplotdata=proj_sort; plot jobtotal*region/boxstyle=schematicid; id date; run; quit;

  21. Boxplots If outliers are compressed together, ids will be also.

More Related