Programming tips and techniques for SAS SQL (no sqealing allowed)

Programming tips and techniques for SAS SQL(no sqealing allowed) Dan Gottlieb, MS, Research Associate Center for the Evaluative Clinical Sciences Dartmouth Medical School Hanover, NH 3/21/03 NH/VT SAS User Group

What is SQL? • Structured Query language. Developed by IBM decades ago. • A method to provide understandable access to database data. • Data merges, manipulation, statistical functions and presentation all in one! • “Industry standards” exist, but SAS does not comply 3/18/2003 NH/VT SAS User Group

Advantages of SQL? • Portability: code can be converted to oracle, SQL server, etc • All in one programming- combine effects of multiple proc and data steps in one statement • flexibility: nesting of statements, different merge types, selective printing 3/18/2003 NH/VT SAS User Group

Advantages of SQL • precise selection of variables in merges • ‘readable’ code- can be understood by non-programmers • No sorting necessary for merges! • Can be more efficient than standard SAS methods • Good for simulation (all possible combinations) • you must specify data set names 3/18/2003 NH/VT SAS User Group

Disadvantages? • Order of commands is important. • Can not specify lists of variables (e.g. var1- var10 or var1--var10) explicitly. • Resulting data set size will explode if not careful when merging! • Unexpected results when not careful or duplicates exist. • You have to specify data set names 3/18/2003 NH/VT SAS User Group

Disadvantages? • Not designed for simply combining multiple data sets (e.g. data one; set two three four;) • not as flexible in printing as proc report or proc print (e.g. no id var or obs # and always prints labels) • Limited in the number of data sets that can be merged (14?) at once • awkward for altering data using conditions (if, case) 3/18/2003 NH/VT SAS User Group

Goals for today • Understand when SQL is preferable to data/proc steps. • Learn merging techniques • Learn debugging uses and techniques. • Learn how to use summarizing techniques 3/18/2003 NH/VT SAS User Group

Disclaimers/warnings • My “style” may not match that of others or best practices among “real” sql programmers • ~3 years of experience and still learning • Use subqueries/nested queries with caution! • Use data/proc steps and sql together when learning/testing. 3/18/2003 NH/VT SAS User Group

Database vernacular • Row=observation • column=variable • table=data set • join=merge • select=print, keep • quit=run • query= where/if or any merge • create table xx = data xx; 3/18/2003 NH/VT SAS User Group

left merge: build a data set with age, id and score with only people who exist in data set one. data set one data set two id score id age wt 1 88 1 25 120 2 85 2 28 130 3 80 3 22 140 4 29 150 3/18/2003 NH/VT SAS User Group

“SAS” method: proc sort data=one; by id; run; proc sort data=two; by id; run; data three; merge one (in=inone) two; by id; if inone; drop weight; run; SEE EXAMPLE 2 for merges SQL technique proc sql; create table three as select a.* , b.age from one a left join two b on a.id=b.id; quit; 3/18/2003 NH/VT SAS User Group

Issues: • “left join” = “merge by” + “in=“ • We specified where the variables came from in sql (if variable score existed in data two, you would not know where the variable came from and SAS wouldn’t warn you. • Created aliases “a” and “b” • Must have commas between variable names • use comma if match merging using where statement 3/18/2003 NH/VT SAS User Group

Printing versus creating data sets (example 1) • proc sql; • create table x as select * from y; • ** creates data set x **; • select var1 format=7., var2 label=‘average’ from y; • **prints to standard output**; • To prevent printing, limit # of observations printed or generated: • proc sql noprint; or.. • Proc sql outobs=10; 3/18/2003 NH/VT SAS User Group

More joins (example 2) Full join: build a data set with id, score and age where ID exists in both data sets. data set one data set two id score id age 1 88 1 25 2 85 2 28 3 80 3 22 5 55 4 29 3/18/2003 NH/VT SAS User Group

“SAS” method: proc sort data=one; by id; run; proc sort data=two; by id; run; data three; merge one (in=inone) two (in =intwo); by id; if inone and intwo; drop weight; run; SQL technique proc sql; create table three as select a.id, a.score , b.age from one a , two b where a.id=b.id order by b.weight; *or use inner join; create table three as select a.id, a.score from one a inner join two b on a.id=b.id; quit; 3/18/2003 NH/VT SAS User Group

Issues: • “where” = “merge by” + “inone and intwo” • Here I opted to specify the source of the variables. • “order by” sorts output data set (even using a variable not used) • quit; ends a SQL set of statements. 3/18/2003 NH/VT SAS User Group

Example 3: Full inner join: merging data to include all observations in both data sets Data set method: proc sort data=car_1995; by model; proc sort data=car_2000; by model; data cars; merge car_1995 car_2000; by model; **need rename; run; SQL technique proc sql; create table cars as select coalesce(a.model,b.model) as model, a.mileage as mileag95, b.mileage as mileage00 from car_1995 a full join car_2000 b on a.model=b.model 3/18/2003 NH/VT SAS User Group

Issues: • Data step method fails- will have to use multiple set statements? • Rename in sql: id as idbase • If you specify “id” SQL will fail because it is in both data sets. • DANGER: if each data set has 1000 observations, this produces 1000000 observations! 3/18/2003 NH/VT SAS User Group

Printing out only a few observations (EXAMPLE 4) Debugging: limit # of observations printed and use a where statement data set one data set two id score id age 1 88 1 25 2 85 2 28 3 80 3 22 5 55 4 29 3/18/2003 NH/VT SAS User Group

“SAS” method: proc print data=one (obs=5); var id score; where score>70; run; options obs=5; print print data=one; where score>70; run; options obs=max; SQL technique proc sql outobs=5; select * from one where score>70; quit; 3/18/2003 NH/VT SAS User Group

full outer join (example 5) Simulation (or perhaps data set explosion..). Create data set that contains all combinations of acid and base reaction data sets. Do not overwrite variable names! data set acid data set base id ph id ph 1 1.2 1 12.5 2 2.3 2 10.2 3 3.3 3 13.0 5 4.0 4 9.9 3/18/2003 NH/VT SAS User Group

“SAS” method: Data all; merge acid (rename=(id=idacid ph=phacid)) base (rename= (id=idbase ph=phbase); run; SQL technique proc sql; create table all as select a.id as idacid, a.ph as phacid, b.id as idbase, b.ph as phbase from acid a, base b; quit; Not all combinations! See SAS example 5 3/18/2003 NH/VT SAS User Group

Issues: • need to either keep model name from both data sets using different names or use coalesce. • If you do not specify which data set variable comes from, sql will fail (good thing) 3/18/2003 NH/VT SAS User Group

Example adding/selecting using summary variables: Add the mean of the overall ph to the data set as well as mean ph by group. Determine difference between ph and mean ph values (see SAS example 6) 3/18/2003 NH/VT SAS User Group

Other features: • select distinct…. Only unique values. Can be used like nodupkey or nodups • select distinct * from one; is = • proc sort noduplicates; • select distinct group, ph from acid is similar to • proc sort nodupkey; • by group ph; 3/18/2003 NH/VT SAS User Group

Summary functions • n, sum, range, mean, max, min , range • cv (coefficient of variation), prt (T distribution), nmiss are useful for trouble shooting or searching for outliers • select model, mileage from car_2000 group by country having cv(mileage)>50 • See SAS example 7 3/18/2003 NH/VT SAS User Group

Other features: • can use drop/keep on data statements: create table new as select a.*, b.weight from acid (keep=id ph) a, base b where a.id=b.id; • adding constants: select a.*, 1 as one from acid; • coalesce: take the first non-missing value • alter/case/insert: data step logic 3/18/2003 NH/VT SAS User Group

Other features: • global macro variable creation using insert into (SAS example 6) • Proc sql; • select distinct state into :state1-:state999 from people; • select count(*) into :nstates from (select distinct state from people) 3/18/2003 NH/VT SAS User Group

Other features: • Count(*): proc freq replacement (ex 14b) • select distinct dept, count(*) as nppl from employee group by dept having nppl>10 • Selecting ranges of variables • select a.* , b.id from employee (keep=id val1-val10) a , empfile b where a.empid=b.empid; 3/18/2003 NH/VT SAS User Group

Views • Views are not “real” tables, but instead instructions for creating a table • Create view myview (label=‘monthly sales report’) as select dept, sum(sales) as sales from monthly group by dept; • Data version: • data myview / view=myview; • set … • run; 3/18/2003 NH/VT SAS User Group

Selecting data IN: All cars made in USA or Japan select model from car_2000 having country in (‘USA’,’Japan’); All cars in 2000 from countries that produced cars in 1995 Select model from car_2000 having country in (select distinct country from car_1995) 3/18/2003 NH/VT SAS User Group

Subquery logic subqueries are any nested selection within another query select model, mileage from (select distinct mileage from car_2000); NO!! Subquery must contain variable(s) that outer query is expecting (and in the correct form) select model, mileage from (select mileage, model from car_2000 group by country having mileage>avg(mileage)); 3/18/2003 NH/VT SAS User Group

Creating new variables Nearly all functions are available (*,/,+, logic) select mileage, miles/(mileage-avg(mileage) as wasted_gas from car_2000; select mileage, model, mileage>25 as goodmpg; (created binary variable goodmpg) select country, model, “Mexico” as real_source from car_2000 where model contains “Olds”; (example 22) 3/18/2003 NH/VT SAS User Group

Selecting data • contains: analagous to index command. Searches for the STRING (not for numeric variables). Is case sensitive • All cars made by toyota • select model from car_2000 where upcase(model) contains “TOY”; • All or any on subqueries: • Select model, mileage from car_2000 where mileage> all (select mileage from car_1995 where country=‘USA’); 3/18/2003 NH/VT SAS User Group

Having vs where: select car from car_2000 having mileage>20;OK select car from car_2000 where mileage>20;OK select car, country, mileage from car_2000 group by country having avg(mileage)>25; OK select car, country, mileage from car_2000 group by country where avg(mileage)>25; NO Where is for individual values, having applies to groups 3/18/2003 NH/VT SAS User Group

Order matters! • select car, country, mileage from car_2000 having avg(mileage)>20 group by country; NO! • If using a summary function, must use group before summary. • Select car, mileage from car_2000 group by country having avg(mileage)>20; OK! • Normal ordering: • select, from, group, where/having , order 3/18/2003 NH/VT SAS User Group

Looking for observations NOT in another table create a data set that contains all values of id, ph from acid where id does NOT exist in base- good for evaluating missing data data set acid data set base id ph id ph 1 1.2 1 12.5 2 2.3 2 10.2 3 3.3 3 13.0 5 4.0 4 9.9 3/18/2003 NH/VT SAS User Group

“SAS” method: proc sort data=acid; by id; run; proc sort data=base; by id; run; data three; merge one (in=inone) two (in=intwo keep=id); by id; if inone and not intwo; run; ** time consuming if large data sets! **; SQL technique proc sql; create table three as select a.* from acid a having id not in (select distinct id from base); See example 21 for an alternative method 3/18/2003 NH/VT SAS User Group

Additional info • SAS guide to the SQL procedure • zillions of database books (amazon) • SAS short courses • 3rd party short courses (destiny, paragon) • Google! 3/18/2003 NH/VT SAS User Group

Recommendations for a neophyte • Start slowly • Limit use of subqueries • Get a book! • Play with sample code 3/18/2003 NH/VT SAS User Group

3/18/2003 NH/VT SAS User Group

Thanks!!!! 12/13/02 NH/VT SAS User Group

Programming tips and techniques for SAS SQL (no sqealing allowed)