1 / 42

Motivation

Dynamic approaches for conforming data from multiple EXCEL spreadsheets Larry Worley, Ph.D. Texas Education Agency Prepared for 2010 SCSUG Conference Austin, Tx. Motivation. Data for Program Evaluations 40 to 60 evaluations per year May have scores of campuses where data is collected

dimaia
Download Presentation

Motivation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic approachesfor conforming datafrom multiple EXCELspreadsheets Larry Worley, Ph.D.Texas Education AgencyPrepared for2010 SCSUG ConferenceAustin, Tx

  2. Motivation Data for Program Evaluations 40 to 60 evaluations per year May have scores of campuses where data is collected Data collection is treated as an Ad Hoc process. Spreadsheets are used for gathering data Some programmatically filled by local IT systems Others completed by hand Design Creep in the spreadsheets Data delivery is done by Ad Hoc processes.

  3. Motivation So in a nutshell, the files to be processed may be in a messy state. In past, cleansing has been performed on individual worksheets. This works fine for projects with 1 or 2 files but can break down very quickly when scores of spreadsheets are received for a particular project. So we need something scalable for a large number of spread sheets.

  4. Outline Obtain dynamic list of received files – illustrates using filename piping. Obtain spreadsheet metadata on sheets and ranges and on spreadsheet columns – Uses Excellibname statements Conform column names and verify correct sheet or range is read from each file. Read appropriate spreadsheet sheets and ranges – dynamic code(using macro) and Excel Libname Fix column type mismatches – dictionary views and dynamic code (call executes) Set individual data sets together to provide single file for project -- dictionary views and dynamic code (call executes)

  5. 1. Build List of Files – File name statement %let path = C:\Documents and Settings\lworley\Desktop\Present\Input ; filename fldrlst pipe "dir ""&path."" /b /a:-D" ; /b – use ‘bare’ format – only file name /a:-D – a display files with specifed attributes ‘-’ without instead of with ‘D’  subdirectory files Use ‘dir /?’ to get a list of all options for DOS dir command.

  6. 1. Build List of Files – Data Step data file_names ; length file_type $ 9 full_path_name $ 256 ; infilefldrlst length = reclen ; input file_name_full $varying256. reclen ; file_type = scan(file_name_full,2,'.') ; file_number + 1 ; full_path_name = "&path.\"||trim(file_name_full) ; run ;

  7. 1. Build List of Files – Data Set File_names.

  8. 1. Build List of Files – Checking code procsql ; create table type_counts as select file_type, count(*) as num_occurences from file_names group by file_type; quit ; This is an useful check to run to ensure you do not have some bad file types.

  9. 2. Get Spreadsheet Metadata – Data Step OverviewStep i data _null_ ; set file_names; lib_path = "%trim(&path)\"||trim(file_name_full) ; 1. call execute( <setup excel libname>) ; 2. call execute( <proc contents to get metadata>) ; 3. call execute( < process metadata data set>) ; 4. call execute( 'Libname XLS_LIB clear ;' ) ; 5. call execute( <proc append>) ; run ;

  10. 2. Get Spreadsheet Metadata – setup excel libnameStep i -- Call Execute #1 call execute( 'Libname XLS_LIB excel “' ||trim(lib_path) ||‘” scan_textsize = yes' ||' stringdates=yes' ||' mixed=yes ;' ) ;

  11. 2. Get Spreadsheet Metadata – proc contentsStep i -- Call Execute #2 call execute( 'proc contents' ||' out=work.xls_col_def‘ ||put(file_number,z4.) ||' data = XLS_LIB._all_ ;' ||' run ; ' ) ;

  12. 2. Get Spreadsheet Metadata – data stepStep i -- Call Execute #3 call execute( ' data work.xls_col_def'||put(file_number,z4.) ||' ; set work.xls_col_def'||put(file_number,z4.) ||' ; by memnamenotsorted ; ' ||' if first.memname then member_counter+1 ;' ||' file_number = '||put(file_number,best4.) ||' ; run ; ' ) ; Adds 1. Excel file counter – file_number And 2. Sheet/Range counter – member_number

  13. 2. Get Spreadsheet Metadata – proc appendStep i -- Call Execute #5 call execute( 'proc append out=work.file_columns_in' ||' data=work.xls_col_def‘ ||put(file_number,z4.) ||' ; run ;' ) ;

  14. 2. Get Spreadsheet Metadata – Expanded CodeIteration 1 -- Call Execute #1 Libname XLS_LIB excel "C:\Documents and Settings\lworley\Desktop\Present\Input\MIC Student Roster Upload Fall 09 Abilene ISD.xls" scan_textsize= yes stringdates=yes mixed=yes ;

  15. 2. Get Spreadsheet Metadata – Expanded CodeIteration 1 -- Call Execute #2 data work.xls_col_def0001 ; set work.xls_col_def0001 ; by memnamenotsorted ; if first.memnamethen member_counter+1 ; file_number= 1 ; run ;

  16. 3. Conform Column Names – Data Step OverviewStep i proc sort data=file_columns_in (where = ( substr(name,1,1) ne 'F' or anyalpha(substr(name,2)) > 0 )) out=file_columns; by file_numbermember_countervarnum; run ; This step deletes columns with no headers. Unless ranges are defined for only the relevant data, you can get a large number of empty columns with names like ‘F25’, ‘F26’, etc.

  17. 3. Conform Column Names – Data Step OverviewStep iia data sheet_meta (keep = memnamefile_numbermember_counternum_varsselect_sheet) col_meta; set file_columns (keep = memname name type length varnum label format formatl formatd nobs member_counter file_number) ; by file_numbermember_counter ; length rename_val $ 32 ;

  18. 3. Conform Column Names – Data Step OverviewStep iib if last.member_counter then do ; * depending on your standards you might want to modify this ; if member_counter = 1 then select_sheet = 1 ; else select_sheet= 0 ; num_vars= varnum ; output sheet_meta ; end ; Note: Sheet_meta dataset could be edited to switch which sheet/range is read in later. Only those with select_sheet= 1 will be included Also num_varscan be used for diagnostic purposes.

  19. 3. Conform Column Names – Data Step OverviewStep iic rename_val = name ; if substr(rename_val,1,1) = "_" then rename_val = substr(rename_val,2) ; if substr(rename_val,length(rename_val),1) = "_" then rename_val = substr(rename_val,1,length(rename_val) -1) ; rename_val = propcase(tranwrd(rename_val,"__","_" )) ; rename_val= tranwrd(rename_val,"_"," " ) ; rename_val = tranwrd(trim(rename_val)," ","_" ) ; output col_meta ; run ; Code eliminates extra underscores from the SAS-generated name.

  20. 3. Conform Column Names – Data Step OverviewStep iii procsql; create table col_counts_in as select rename_val ,count(*) as ncount label = "number of occurences of column name" from work.col_meta where member_counter = 1 group by rename_val order by rename_val; quit ; This a check to determine all necessary column renaming is being done.

  21. 3. Conform Column Names – Checking Column Names Step iii Course_id value was changed to Course_code, along with other mis-matched column names. The edited data set is called in the next step.

  22. 4. Read the data in – Macro codeOverview For each input sheet or range, two data sets will be created in the work directory. WORK.FILE0xxx – this is the file as directly created by reading member from Excel libname with some column renaming, but no rows are deleted. WORK.STRIPPED0xxx – reduces the rows present by eliminating those without appropriate keys.

  23. 4. Read the data in – Macro codePart i %macro iter() ; procsql ; select left(put(count(*),best.)) ,put(min(file_number),z4.) ,put(max(file_number),z4.) into :number_of_spreadsheets, :file_min_index ,:file_max_index from work.file_names ; quit ;

  24. 4. Read the data in – Macro codePart ii %do i_file = &file_min_index %to &file_max_index; procsql ; select quote(trim(full_path_name)) into :lib_path from work.file_names where file_number = &i_file ;

  25. 4. Read the data in – Macro codePart iii create table temp_controlas select sheet_meta.file_number,sheet_meta.memname ,col_meta.name ,col_meta.rename_val ,col_meta.label from work.sheet_meta inner join control.'col_meta$'ncol_meta on sheet_meta.file_number = col_meta.file_number and sheet_meta.member_counter = col_meta.member_counter where sheet_meta.file_number = &i_file and sheet_meta.select_sheet = 1 ;

  26. 4. Read the data in – Macro codePart iv select distinct quote(trim(memname))||'n' into :member from temp_control; select trim(name) || ' = ' ||trim(rename_val) into :rename_stringseparated by ' ' from temp_control; select trim(name) into :keep_listseparated by ' ' from temp_control; quit ;

  27. 4. Read the data in – Macro codePart v Libname XLS_LIB excel &lib_pathscan_textsize = yes stringdates=yes mixed=yes ; data file%sysfunc(putn(&i_file,z4.)) ; set xls_lib.&member ( rename = ( &rename_string ) keep = &keep_list) ; run ;

  28. 4. Read the data in – Macro codePart vi data stripped%sysfunc(putn(&i_file,z4.)) ; set file%sysfunc(putn(&i_file,z4.)) ; if student_first_name ne '' and student_last_name ne '' ; file_number = &i_file ; run ; %end ; %mend ; %iter() ;

  29. 5. Fix column type mismatches Query dictionary view to see if we have any. procsql ; create table cdc_2fix as select memname from sashelp.vcolumn where libname = 'WORK' and memname like 'STRIPPED0%' and UPCASE(name) = 'CDC_NUMBER' and type = 'num' ; quit ;

  30. 5. Fix column type mismatches Data Step Call Execute Code. data _null_ ; set cdc_2fix ; number = substr(memname,5) ; call execute ( 'data '||trim(memname) ||' ; set '||trim(memname) ||' (rename = ( CDC_Number = cdc_file)) ; ' ||' drop cdc_file ; ' ||' length Cdc_Number $ 9 ; ' ||' CDC_Number = put(cdc_file,z9.) ;' ||' run ; ' ) ; run ;

  31. 5. Fix column type mismatches Expanded Call Execute Code. data STRIPPED0002 ; set STRIPPED0002 (rename = ( CDC_Number = cdc_file)) ; drop cdc_file ; length Cdc_Number$ 9 ; CDC_Number = put(cdc_file,z9.) ; run ;

  32. 6. Stack the data togetherBuild Control Table -- a procsql ; create table ds_to_combine as select memname from sashelp.vtable where libname = 'WORK' and memname like 'STRIPPED0% ' ; create table column_positions_step1 as select name ,varnum,count(*) as position_count from sashelp.vcolumn where libname = 'WORK' and memname like 'FILE0% ' group by name , varnum;

  33. 6. Stack the data togetherBuild Control Table -- b create table column_positions as select varnum, name from column_positions_step1 group by name having position_count = max(position_count) ; create table column_lengths as select name, type ,max(length) as length from sashelp.vcolumn where libname = 'WORK' and memname like 'STRIPPED0% ' group by name, type ;

  34. 6. Stack the data togetherBuild Control Table -- c create table column_defs as select column_lengths.name ,column_lengths.type, column_lengths.length ,coalesce(column_positions.varnum,999) as varnum from column_lengths left join column_positions on upcase(column_lengths.name) = upcase(column_positions.name) order by varnum ; quit ;

  35. 6. Stack the data togetherControl Table Explanations Table : column_positions Table : column_positions_step1

  36. 6. Stack the data togetherControl Table Explanations Table : column_def

  37. 6. Stack the data togetherData Step Call Execute - a data _null_ ; if _n_ eq 1 then call execute ('data work.combined1 ; ') ; retain start_data_sets 1 ; if not all_columns then do ; set column_defs end = all_columns; if type = 'char' then length_format = '$' ; else length_format = ' ' ; call execute ( 'length '||trim(name) ||' '||trim(length_format)||put(length,3.) ||' ;' ); end ;

  38. 6. Stack the data togetherData Step Call Execute - b else if not all_data_sets then do ; if start_data_sets then do ; call execute ('set ') ; start_data_sets = 0 ; end ; set ds_to_combine end = all_data_sets ; call execute ( trim(memname) ); end ; else call execute (' ; run; ') ; run ;

  39. 6. Stack the data togetherExpanded Call Execute - c NOTE: CALL EXECUTE generated line. 1 + data work.combined1 ; + length Cdc_number $ 9 ; … 20 + length file_number 8 ; 21 + set 22 + STRIPPED0001 … 57 + STRIPPED0036 58 + ; run;

  40. Conclusion We first showed how to read a list of files from a given directory. This list will be used to tell subsequent programs what Excel files are available in a given directory. We illustrated using the ‘filename pipe’ command. We then showed a process for obtaining information about the sheets, ranges and column headers in the Excel files by using SAS Excel libname statements and proc contents. This required 5 steps for each Excel file. We used “Call execute” commands to implement these steps as we iterated across the Excel files.

  41. Conclusion Using the output from step 2 as input, this step used standard data step and procsql code to generate metadata tables for the Excel sheet/ranges and columns. These two tables are modified to control the mapping of column names and which sheets/ranges are read. This step used the control tables created in step 3 and macro code to create two data sets in the work directory for each input sheet/range. One data set had all original records but did have conformed variable names. The second data set had eliminated rows without valid keys and did have the conformed variable names.

  42. Conclusion This step fixed column type mismatches. It used the sashelp.vcolumn dictionary view to determine what input datasets had numerical columns where strings where expected. It then executed a data step with call executes which re-cast that column as character. This step used dictionary views to determine the maximum length and the appropriate position for the variables. From the resulting data set, a data step used call executes to put the data together Author Contact Info: larry.worley@ymail.com

More Related