Efficient Data Restructuring Techniques for Relational Models

HRP 223 - 2008 Topic 6 – Relational Data

HW 2 • SORRY! I apologize for not getting it posted before yesterday! It is due in two weeks. • The new datasets have less variables and one variable renamed. You want to change newID to have the same name as the old subject ID. You can put a rename command on the line that does the import: proc import out = wide (rename = (dude = subjectID)) datafile = "C:\Projects\classes\HRP223-2008\day6\wideDx.xls" replace; mixed = yes; sheet = "Sheet1"; run;

Flat Files • Some people try to store all their data in a single file. This causes lots of extra work because of holes in the tables and repeated information. • Both problems can be fixed by a relational model. • Split the data into many tables. • You need to use SQL to work with data split across multiple tables.

Not Normalized I frequently get data, from people who are not professional programmers, where the diagnosis data is organized “wide” across the page. Where the first diagnosis is in the first column, the second is in the second, etc. and the task is to find or fix a diagnosis.

Subsetting Based on 5 Variables

SQL vs. Datastep The GUI generates this code: Or you could write a little data step program:

Change All 9s to 999s? It is a lot of clicking.

Code The SQL is a bit complicated

As Data Step If it is more than 5 columns, things get unruly. Imagine doing this across 20 possible diagnoses. There is an easy solution in data step code. First, the SQL code can be done easily in a data step.

A List As you can see, there is a list of variables and you are doing the same things over and over. You want to make a list called dx and have the 1st element refer to dx1, the 2nd thing refer to dx2, etc. The concept of a named list of variables or an alias to a bunch of variables is instantiated as an array.

Arrays A major improvement….. Ummmm. You want to process the same one line over and over. You need to count from 1 to 5…. Sounds like a loop.

Change Lots of Things If you have an array, you can process wide files easily.

Restructuring with Arrays You can use similar code to restructure data so that you have only a couple of columns of data. Add a new column that is called dxNum and another called theDX. Those two columns plus the subject ID number can contain the same information without all the “holes”.

How does that work? • Go through all five variables, one at a time. • If the variable is not missing, you need to do three things: • Copy the diagnosis counter number into the dxNum variable. • Copy the diagnosis code number into the variable called theDx. • Write to the new data set.

Repeated Ifs This is a lot of typing and it obscures the fact that you are doing three things if a condition is true:

do end You need both do and end You have seen do statements in the context where you do stuff over and over. There is also a do end command for when you need to do a block of instructions if a condition is true.

Actual Code

Normalization Part 2 • I got data where I needed to analyze age for people who have a particular diagnosis. The data was a not-normalized mess:

You will end up with the same code repeated as many times as you have repetitions. Normalization Part 2 The Wrong Way • If your database is like this, you need code like this: data bad2; set bad; if (dob1 ne . and not missing(dx1)) thendo; if code1= 22then IsCase1=1; else Iscase1=0; end; if (dob2 ne . and not missing(dx2)) thendo; if code2=22then IsCase2=1; else Iscase2=0; end; if (dob3 ne . and not missing(dx3)) thendo; if code3=22then IsCase3=1; else Iscase3=0; end; if (dob4 ne . and not missing(dx4)) thendo; if code4=22then IsCase4=1; else Iscase4=0; end; if (dob5 ne . and not missing(dx5)) thendo; if code5=22then IsCase5=1; else Iscase5=0; end; run;

Normalization Part 2 The Right Way • Instead, you should have a record in a table corresponding to each repetition. • With code like this: data good2; set good; if code= 22then isCase1=1; else isCase1=0; run;

But you end up with just as many blocks of code. • Your first attempt could go something like this: data normal1 (keep = sid mid dob dx code); set bad; format dob dx mmddyy8.; if (dob1 ne . and dx1 ne . and code1 ne .) thendo; mid = 1; dob = dob1; dx = dx1; code = code1; output; end; if (dob2 ne . and dx2 ne . and code2 ne .) thendo; mid = 2; dob=dob2; dx=dx2; code=code2; output; end; if (dob3 ne . and dx3 ne . and code3 ne .) thendo; mid=3; dob=dob3; dx=dx3; code=code3; output; end; if (dob4 ne . and dx4 ne . and code4 ne .) thendo; mid=4; dob=dob4; dx=dx4; code=code4; output; end; if (dob5 ne . and dx5 ne . and code5 ne .) thendo; mid=5; dob=dob5; dx=dx5; code=code5; output; end; run;

Setting up Aliases (Arrays) • What you want is a way to repeat this code over the five sets of variables: if (dob1 ne . and dx1 ne . and code1 ne .) thendo; mid = 1; dob = dob1; dx = dx1; code = code1; output; end; • You need: • A dob alias (dob_a) to refer to dob1, dob2, dob3, dob4 and dob5 • A dx alias (dx_a) to refer to dx1, dx2, dx3, dx4 and dx5 • A code alias (code_a) to refer to code1, code2, code3, code4 and code5

This sets up the arrays but they are not used in this program. Setting up Aliases (Arrays) data normal2a; set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; if (dob1 ne . and dx1 ne . and code1 ne .) thendo; mid = 1; dob = dob1; dx = dx1; code = code1; output; end; run;

Setting up Aliases (Arrays) data normal2a; set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; if (dob_a[1] ne . and dx_a[1] ne . and code_a[1] ne .) thendo; mid = 1; dob = dob_a[1]; dx = dx_a[1]; code = code_a[1]; output; end; run;

Setting up Aliases (Arrays) data normal2c (keep = sid mid dob dx code); set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; do c = 1to5by1; if (dob_a[c] ne . and dx_a[c] ne . and code_a[c] ne .) thendo; mid = c; dob = dob_a[c]; dx = dx_a[c]; code = code_a[c]; output; end; end; run;

Arrays • You can tell SAS that a set of variables are related by putting them into an array statement. • Arrays in SAS are not like arrays in other languages like BASIC or C. SAS arrays are only aliases to an existing set of variables. They are created using the array statement: My notation for arrays array times_a [365] time1-time365; An optional size of the array What the array refers to

Arrays(2) • If your array references variables that do not exist, they will be created. Make sure to use the $ if you intend to create character variables. • If you want to reference all numeric variables between theValue and thingy2, do it like this: array x theValue -- thingy2 _numeric_; --means all values between and including the starting and ending variables - indicates the numeric sequence starting with the first variable and ending with the second

SQL and Colors You may have noticed that the guys who made the enhanced editor don’t know SQL commands because some of the key words were not colorized. There are lots of them, but they can be easily fixed.

Fix Color Go to Tools > Options ….> SAS Programs and then click Editor Options… then User Defined Keywords

Missing Words • Add • calculated • coalesce • corresponding • except • full • group • inner • intersect • join • left • on • or • order • outer • right • union

Minimal SQL Put a comma-delimited list of variables here. Specify a library.table here. Print a report showing the contents of variables from a single data set.

What variables? Use an * to indicate that you want all variables instead of typing them all. There is no syntax to specify variables based on position in the source files. That is, you can not specify that you want to select the 2nd and 7th variables (from left to right) or to select the first 3 variables.

Likely Tweaks You can rename a variable in the list with an as statement. You can also specify variable formats and labels.

More Tweaks Print the column called dude from the table reportedCancers which is in the ovCancer library. Here the c. is optional because dude is only in one table (the query only uses one table). The from line references tables which are in libraries. Complex queries require you to reference the table name over and over again. Instead of having to type the long library and dataset names repeatedly, you can refer to the files as an alias.

Stacking You already know how to use proc append or the Data > Append Table menu item to combine two sets of data on top of one another. How do you “copy/paste” to insert columns from one table into another?

The GUI can do easy SQL. You could write data step or proc sql code. Happily, most of the merges you need are in the graphical user interface.

How are tables linked? You need to tell it who is matched with whom in the tables. If you have a demographics table and a disease table, you need to specify which column says which disease belongs to which person. In this case you would say match on the subject ID numbers in the two tables using a key column.

Inner Join • If you want records where there is a match in both tables, you want an inner join (aka, equijoin or natural join). • For example, which subjects have demographic and cancer information?

Alternate Syntax This is what I write.

All Information from the Left Table If you want all the demographics, as well as the cancers if they occur:

Left Join Code

All Information from the Right Table

If you wanted the cancers info plus demographics where there were any:

Right Join Code

Full Join It would be nice if you could combine the two dude variables so the first not –missing value was used. If you wanted all information:

Full Join Code

Coalesce Coalesce says take the first not-missing value from the set of variables.

Checking for ID Numbers with SQL • A task that I need to do frequently is to build a list of all subject IDs when data is coming from multiple sources. • List IDs with duplicates. • List unique ID numbers. • List who is in both files. • List who is in one file but not the second. • Make a summary showing all IDs and an indicator for who appears where.

PROC SQL - Set OperatorsNO GUI • Outer Union Corresponding • concatenates • Unions • unique rows from both queries • Except • rows that are part of first query • Intersect • rows common to both queries

Efficient Data Restructuring Techniques for Relational Models

Efficient Data Restructuring Techniques for Relational Models

Presentation Transcript

Philosophy 223

Philosophy 223

HRP 223 - 2008

HRP 223 - 2008

HRP 223 - 2008

Philosophy 223

Philosophy 223

Philosophy 223

HRP 223 – 2008

HRP 223 - 2008

HRP

HRP 223 - 2008

Physics 223

CH 223

Philosophy 223

Physics 223

HRP 223 - 2008

223 Ammo

Урок 223