1 / 75

Getting data

Getting data. The first step in data analysis. Learning Objective. What is a relational database? Contact your DBA for how to connect to your database(s)? How to write queries using PROC SQL. Using SAS/BASE® to connect to third-party relational data base software to extract data needed for

suchi
Download Presentation

Getting data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting data The first step in data analysis

  2. Learning Objective What is a relational database? Contact your DBA for how to connect to your database(s)? How to write queries using PROC SQL Using SAS/BASE® to connect to third-party relational data base software to extract data needed for • program evaluation • research using administrative data • operational reports e.g. routine surveillance SHRUG, 2014-05-02

  3. What is a relational database? • Set of tables • tables made up of rows and columns • Trade names of relational databases (RDB): • Oracle, Teradata, SQL Server, DB2, Access • RDB is software which is designed to retain large amounts of data • transactional DB • reporting/warehousing DB SHRUG, 2014-05-02

  4. What is a relational database? • Transactional DB designed to increase the speed for front-end users • complex table and table join structures • Warehousing DB designed for efficient storage and retrieval for reporting • simpler table designs and table join structures • Queries for either design use same syntax (code) • queries for warehouses will be simpler to write SHRUG, 2014-05-02

  5. What is a relational database? • Why use relational databases? • relational databases use a concept called “normalization” • Normalization reduces the amount of redundant data and allows for updates to data with less error • There are degrees of normalization • first degree • second degree • third degree and higher degrees SHRUG, 2014-05-02

  6. First degree normalization Table 1: Subjects with demographic information • First degree normalization • each row pertains to a single entity: a patient, an encounter, a physician • each column pertains to a characteristic of the entity: e.g. date of birth, sex, date of encounter, etc SHRUG, 2014-05-02

  7. Violation of first degree normalization Table 1: Subjects with improper 1NF • What impact does violating the first degree normalization have on your query • if you want all patients born in Canada? • if you want all male patients? SHRUG, 2014-05-02

  8. Second degree normalization Table 2: Business addresses • Table 2 has employer information about rows in Table 1 • The table above has some redundant information: • name is repeated from Table 1, province is embedded in the postal code • Better design – two or even 3 tables SHRUG, 2014-05-02

  9. Second degree normalization Table 2: Revised with 2NF Table 3: Creating a secondary table for 2NF SHRUG, 2014-05-02

  10. Second degree normalization • Table 2 now no longer contains name – it’s replaced with the subject ID • to get the subject’s name we link the table to the table in the first example, using SUBJID/ID column • we get the province and city by linking Table 2 and 3 using the POSTALCODE column • SUBJID is a primary key in Tables 1 and 2 • POSTALCODE is a foreign key in Table 2, but a primary key in Table 3 SHRUG, 2014-05-02

  11. Primary/Foreign Keys • primary key – a column or combination of columns that uniquely identify each row in the table • e.g. patient medical record needs at least 3 columns to identify a unique record: patient ID, date of encounter, and provider ID • foreign key – a column or combination of columns that is used to link data between two tables SHRUG, 2014-05-02

  12. Questions about 2NF? • Can you see the advantage of splitting the data into different tables? • share examples of your data where normalization is used • higher degrees of normalization work similarly to the examples above • you have to go through more tables for higher levels of normalization in order to link to the data that you need SHRUG, 2014-05-02

  13. Getting access to data: What do you need from DBA? • Explain to DBA that you need to query data, but have no need to write to the database • this helps them to determine where you belong on a user matrix • DBA or IT install necessary software on your machine • Google has lots of information on SAS Connect • SAS Connect documentation SHRUG, 2014-05-02

  14. How SAS authenticates User name is provided by DBA/IT In this example the password is held in the macro DBPASS Statement to have Oracle print any messages to the SAS log proc sql; connect to oracle (user = <userid> password="&dbpass” path = prod ); %put &sqlxmsg; This is an example of “pass-through” code SHRUG, 2014-05-02

  15. Using a LIBNAME to connect • Recall that slide 13 showed pass-through facility in SAS • most of the query is done on the database • Can use libname statement to connect instead of pass-through • advantage to this method is that you are programming in SAS (using SAS functions and formats) • SAS determines which program (SAS or RDB) will handle statements more efficiently SHRUG, 2014-05-02

  16. Using a LIBNAME to connect 1. 2. 3. • The name of the library • Tells SAS that you are using an ODBC engine • DSN – use the name of the database that was used to set up the odbc connection • NOTE: schema statement is not always required Example using a libname statement: libnameoncoodbcdsn='Oncolog' schema=dbo; SHRUG, 2014-05-02

  17. Seeing your data - Views Once view is created, you use the EXPLORER tab in SAS and use as normal dataset SHRUG, 2014-05-02

  18. Seeing your data - Views Using the “view columns” in SAS EXPLORER SHRUG, 2014-05-02

  19. Seeing your data - Views • Double click on table to get to see the data • NOTE: columns that identify personal information have been removed from this screen shot SHRUG, 2014-05-02

  20. Other ways to view data • You may have software from the RDB: • TOAD (for Oracle) • SQL Developer (for Oracle) • SQL Server • Teradata • All vendors may have some limited function “development” software that allows: • Viewing data • Viewing the “type” of a column: char, num, date, etc. • Writing SQL queries SHRUG, 2014-05-02

  21. Sample view from SQL Developer SHRUG, 2014-05-02

  22. Syntax: Single table - 1 of 2 PROC SQL DATA STEP proc sql; create <table/view> <name> as select <var1> , <var2> , etc from <table/view> where <apply data filters> quit; data <dataset name>; set <dataset> ( keep= <list of variables> where=(<apply filters>)); run; Example: Create a dataset (table) with men aged 50 to 74. Assume the source table is called “demographics” and contains variables: subjectID, age and sex SHRUG, 2014-05-02

  23. Syntax: Single table – 2 of 2 PROC SQL DATA STEP proc sql; create table men5074 as select subjectID , age from work.demographics where sex=‘M’ and age between 50 and 74 ; quit; data men5074 (drop=sex); set work.demographics (keep=subjectid sex age where=(sex='M' and 50<=age<=74)); run; SHRUG, 2014-05-02

  24. Syntax: Multiple tables Request for report received Realize as you look through data elements needed for complete the request that relevant columns reside in two or more tables Background Information • Need to know which tables and which columns are relevant. Useful to have data dictionary, otherwise 3rd party software very helpful • Need to know what filters to apply: sex, time period of interest, diagnosis codes, etc are all commonly applied filters SHRUG, 2014-05-02

  25. Map: Multiple tables • Create a map to guide your query • names of tables that go in ‘FROM’ statement of SQL or ‘SET’ statement in DATA step • names of columns that you need • use meaningful arrows to connect SHRUG, 2014-05-02

  26. Syntax: Multiple table DATA step – 1 of 2 *** sort the first dataset; proc sort data=<dataset1>; by <var(s)>; run; *** sort the second dataset; proc sort data=<dataset2>; by <var(s)>; *** same var(s) as first sort; run; SHRUG, 2014-05-02

  27. Syntax: Multiple table DATA step - 2 of 2 *** find records common to both tables; data <result dataset>; merge <dataset1> (in=in_a) <dataset2> (in=in_b); by <var(s)>; *** we only want a list of records with data in table a AND in table b; if in_a and in_b; run; SHRUG, 2014-05-02

  28. Syntax: Multiple table PROC SQL using temporary tables proc sql feedback; create table <result table name> as select <columns from either or both tables below> from *** temporary table from table1; (select <column(s)> from <table 1> where <apply filter(s)>) a inner join *** temporary table from table2; (select <column(s)> from <table 2> where <apply filter(s)>) b on a.<pk>=b.<pk> ; quit; SHRUG, 2014-05-02

  29. Syntax: Multiple table PROC SQL. “Oracle-style (PL/SQL)” proc sql feedback; create table <result table> as select <columns from one or more tables> from <table1> a , <table2> b where a.pk=b.pk <apply additional filters> ; quit; SHRUG, 2014-05-02

  30. Types of joins RIGHT join – join Table A to B only if an observation exists in Table B SHRUG, 2014-05-02

  31. Compare syntax PROC SQL DATA STEP • no need to sort temporary tables • needed to think about type of join • in this case wanted patients only if they were in both tables • join columns need to be same type but can have different names (slides 6 and 9) • needed to sort data by subjectID • key variable to join demographic to cancersite table • “by” variables need to have same name and type What would you do if you found out that one record in table 1 matched to multiple records in table 2? SHRUG, 2014-05-02

  32. Issues arising from multiple table queries • Table relationships are important: • one-to-one: each record in first table has a maximum of one record in the second table (through primary key) • one-to-many: each record in one table may have multiple rows in second table. Example: • Table 1 contains all patients with a flag indicating whether or not they are “active” • Table 2 contains all GP appointments for each patient • many-to-many SHRUG, 2014-05-02

  33. Examples! Task 1 - single table Task 2 – two tables Task 3 – multiple tables Task 4 – reusing a table multiple times SHRUG, 2014-05-02

  34. Task 1a - Participants Approach 1 Create view of the table required and use SAS to analyze data You are asked to provide a count of the female participants in a cancer screening program who are aged 50 years as of May 31, 2013. Break down the birth dates by month SHRUG, 2014-05-02

  35. Task 1a – background information • demographic data for participants is stored in table “PARTICIPANTS” • sex_cd is a coded variable: 222=F, 223=M, 240=U • birth_dt is the column containing birth dates • although birth_dt appears as a date type column, in SAS Oracle dates are datetime types in SAS • For a participant to be considered 50 years of age on May 31, 2013, their birthday must occur between June 1, 1962 and May 31, 1963 SHRUG, 2014-05-02

  36. Task 1a – Create view using pass-through code proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create view participant as select * from connection to myconn (select * from csprod.participant where sex_cd=222 and trunc(birth_dt) between to_date('19620601','YYYYMMDD‘) and to_date('19630501','YYYYMMDD‘) and del_dt is null ); disconnect from myconn; quit; SHRUG, 2014-05-02

  37. Task 1a – Create view: Parsing the code. Slide 1 of 3 • Create view participant • this syntax translates to “Create a view called ‘participant’ • select * • ‘*’ is a wildcard and means select all • where • “%”, “_” – see Task 1b • multiple (%) or single (_) byte of data, in contrast to the entire column. Only used to scan a column. SHRUG, 2014-05-02

  38. Task 1a – Create view: Parsing the code. Slide 2 of 3 • trunc() • recall that SAS will treat Oracle, Teradata, SQL dates as DATETIME • trunc() is an Oracle function that looks only at the DATE part of the column SHRUG, 2014-05-02

  39. Task 1a – Create view: Parsing the code. Slide 3 of 3 • to_date(‘<yourdate>’, ‘<yourdate format>’) • in this example to_date(‘19620601’, ‘YYYYMMDD’) • take the string 19620601 and treat it as a date with the format YYYYMMDD • could use other formats: YYYYMONDD, MM-DD-YYYY, etc • BETWEEN operator – works as you would expect, includes both lower and upper limits specified SHRUG, 2014-05-02

  40. Task 1a – Contents of view NOTE: The contents show exactly the same columns as slide 31 SHRUG, 2014-05-02

  41. Task 1a – Count of birth month • More detailed analysis uncovered that missing month/day combinations were defaulted to January 1 SHRUG, 2014-05-02

  42. Task 1b - Like operator (1 of 4) REQUEST Find a list of CCI codes for hysterectomy use of single table example of PROC SQL with SAS data filter using “like” operator SHRUG, 2014-05-02

  43. Task 1b – Like operator (2 of 4) SHRUG, 2014-05-02

  44. Task 1b – Like operator (3 of 4) proc sql feedback; create table hystcd as select a.f1 as cci_code , a.f3 as long_desc , substr(a.f1,1,5) as rubric from work.cci_raw as a where upcase(a.f3) like '%HYSTERECTOMY%' or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%UTERUS%') or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%CERVIX%') order by a.f1 ; quit; SHRUG, 2014-05-02

  45. Task 1b – Like operator (4 of 4) RESULTING DATASET SHRUG, 2014-05-02

  46. Task 2 – Participants in the CRC program • Report the number of men and women who turned 60 as of May 31, 2013, enrolled in the colorectal cancer screening program. Do not include participants with unknown sex • participant table contains demographic information: sex, birth date • participant_program table contains data for participants and screening program • program_id=1 indicates colorectal cancer screening program • program_status_cd=263 indicates that a participant is active in the program SHRUG, 2014-05-02

  47. Task 2 – Mapping your query SHRUG, 2014-05-02

  48. Translating your map into sql code - 1 of 3 *** METHOD 2 - Oracle pass through. Simple code; proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create table part60 as select * from connection to myconn (select ptc.gender , count(*) from (select participant_id , sex_cd SHRUG, 2014-05-02

  49. Translating your map into sql code – 2 of 3 *** METHOD 2 - Oracle pass through. Simple code; , case when sex_cd=222 then 'F' else 'M' end as gender from csprod.participant where trunc(birth_dt) between to_date('19520601','YYYYMMDD') and to_date('19530531','YYYYMMDD') and sex_cd <> 240 and del_dt is null) ptc SHRUG, 2014-05-02

  50. Translating your map into sql code – 3 of 3 inner join (select participant_id from csprod.participant_program where program_id=1 and program_status_cd=263 and del_dt is null) pp on ptc.participant_id=pp.participant_id group by ptc.gender ; disconnect from myconn; quit; SHRUG, 2014-05-02

More Related