Topics in data management
Download
1 / 16

Topics in Data Management - PowerPoint PPT Presentation


  • 259 Views
  • Updated On :

Topics in Data Management. SAS Data Step. Combining Data Sets I - SET Statement. Data available on common variables from different sources. Multiple datasets with common variable names, possibly different sampling/experimental units Exam scores from students in various sections of STA 2023

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Topics in Data Management' - Sophia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Combining data sets i set statement l.jpg
Combining Data Sets I- SET Statement

  • Data available on common variables from different sources. Multiple datasets with common variable names, possibly different sampling/experimental units

    • Exam scores from students in various sections of STA 2023

    • County level data from different state databases

    • Flight departure/arrival data from different months


Combining data sets i set statement3 l.jpg
Combining Data Sets I- SET Statement

options nodate nonumber ps=54 ls=80;

data one;

input student $ 1-8 idnum 9-12 exam1 14-16 exam2 18-20 exam3 22-24;

section=1;

cards;

Amy 1456 98 78 84

Zed 2234 68 84 75

;

run;

data five;

input student $ 1-8 idnum 9-12 exam1 14-16 exam2 18-20 exam3 22-24;

section=5;

cards;

Alex 3410 74 68 .

Zach 4561 92 74 88

;

run;

data all;

set one five;

run;

proc print;

run;

quit;


Combining data sets i set statement4 l.jpg
Combining Data Sets I- SET Statement

The SAS System

Obs student idnum exam1 exam2 exam3 section

1 Amy 1456 98 78 84 1

2 Zed 2234 68 84 75 1

3 Alex 3410 74 68 . 5

4 Zach 4561 92 74 88 5


Combining data sets ii merge statement l.jpg
Combining Data Sets II - MERGE Statement

  • Data on common sampling/experimental units, different variables/characteristics measured in different datasets.

    • County data from different government sources

    • Store sales data updated over time


Combining data sets ii merge statement6 l.jpg
Combining Data Sets II - MERGE Statement

options nodate nonumber ps=54 ls=80;

data s2003;

input store $ 1-8 sales03 10-14;

cards;

Atlanta 1459

Zurich 1383

;

run;

data s2004;

input store $ 1-8 sales04 10-14;

cards;

Atlanta 1459

Zurich 1383

;

run;

proc sort data=s2003; by store;

proc sort data=s2004; by store;

data s0304;

merge s2003 s2004;

by store;

run;

proc print;

run;

quit;

The SAS System

Obs store sales03 sales04

1 Atlanta 1459 1459

2 Zurich 1383 1383


Creating new variables from existing ones l.jpg
Creating New Variables From Existing Ones

  • Creating Final Grade for Students (Exams 1 and 2 Each Count 30%, Exam 3 40%)

    • Total = (0.3*Exam1)+(0.3*Exam2)+(0.4*Exam3)

  • Obtaining Sales Growth (%) for stores

    • Grow0403=100*(sales04-sales03)/sales03


Grades example l.jpg
Grades Example

data all;

set one five;

total=(0.3*exam1)+(0.3*exam2)+(0.4*exam3);

run;

proc print;

var student idnum total;

run;

quit;

The SAS System

Obs student idnum total

1 Amy 1456 86.4

2 Zed 2234 75.6

3 Alex 3410 .

4 Zach 4561 85.0


Building case histories l.jpg
Building Case Histories

  • Have multiple observations of same variable on individual units (not necessarily the same number across individuals).

  • Want to summarize the measurements for each individual and obtain single “record”.

    • Summary of all Delta flights for each ATL route to other cities for October 2004

    • Arrest record for juveniles over a 5 year period

    • Sales histories for individual stores in a retail chain


Building case histories10 l.jpg
Building Case Histories

  • Step 1: SORT dataset on the variable(s) that define(s) the individual units/cases.

  • Step 2: Set the previous dataset into a new one, using the same BY statement as in the SORT.

    • The new dataset “sees” the old dataset as a series of “blocks” of measurements by individual cases

  • Step 3: Define any variables you want to use to summarize cases in RETAIN statement.

  • Step 4: At beginning of each individual, reset variables in Step 3 (typically to 0)

  • Step 5: At end of each individual OUTPUT record


Example brookstone store sales inventory l.jpg
Example - Brookstone Store Sales&Inventory

  • 8 EXCEL Spreadsheets - 4 Quarters X 2 Measures

  • 520 stores observed over 52 weeks

  • Typical Spreadsheet Portion (4 stores X 6 weeks):

Note that the company provides 13 columns representing the 13 weeks in the quarter for each store…not the way we want to analyze it. Also, got rid of commas in EXCEL before exporting to text file.


Reading the data in sas l.jpg
Reading the Data in SAS

Data inv1;

infile ‘filename’;

input storeid 6-8 storename $ 10-38 @;

do week=1 to 13;

input inv @;

output;

end;

run;

This creates 13 “observations” per store and single inv variable



Building a store record for year l.jpg
Building a Store Record for Year

  • Suppose Management wants following summary measures for each store:

    • Total sales

    • Average sales to inventory ratio

    • Mean and standard deviation of sales

    • Correlation between sales and inventory

  • We need the following quantities counted across weeks:

    • SALES, SALES2, INV, INV2, SALES*INV, SALES/INV


Sas code to obtain measures by store p1 l.jpg
SAS Code to Obtain Measures by Store (P1)

Data inv; set inv1-inv4; run; proc sort; by storeid; run;

Data sales; set sales1-sales4; run; proc sort; by storeid; run;

Data invsales; merge inv sales; by storeid; run;

proc sort; by storeid; run;

Data invsales1;

set invsales;

by storeid;

retain sumsales sumsales2 suminv suminv2 salesxinv sales_inv;


Slide16 l.jpg

If first.storeid then do;

sumsales=0; sumsales2=0; suminv=0; suminv2=0; salesxinv=0; sales_inv=0;

end;

sumsales=sumsales+sales; sumsales2=sumsales2+(sales**2);

suminv=suminv+inv; suminv2=suminv2+(inv**2);

salesxinv=salesxinv+(sales*inv); sales_inv=sales_inv+(sales/inv);

if last.storeid then do;

totsales=sumsales; meansal_inv=sales_inv/52;

meansales=totsales/52;

varsales=(sumsales2-(sumsales**2)/52)/51; stdsales=sqrt(varsales);

varinv=(suminv2-(suminv**2)/52)/51; stdinv=sqrt(varinv);

covslinv=(salesxinv-(sumsales*suminv)/52)51;

corrslinv=covslinv/(stdsales*stdinv);

output; end; run;


ad