Aggregate data and statistics l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 103

Aggregate Data and Statistics PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

Aggregate Data and Statistics. Wendy Watkins Carleton University. Chuck Humphrey University of Alberta. Statistics Canada Data Liberation Initiative . Outline. What are aggregate data? Why aggregate? How to aggregate? Computing exercise. What are aggregate data?.

Download Presentation

Aggregate Data and Statistics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Aggregate data and statistics l.jpg

Aggregate Data and Statistics

Wendy Watkins Carleton University

Chuck Humphrey University of Alberta

Statistics Canada Data Liberation Initiative


Outline l.jpg

Outline

  • What are aggregate data?

  • Why aggregate?

  • How to aggregate?

  • Computing exercise


What are aggregate data l.jpg

What are aggregate data?

Let’s start with the relationship between statistics and data.


Statistics and data l.jpg

Statistics and Data

Data

  • numeric files created and organized for analysis

  • requires processing

  • not ready for display

Statistics

  • numeric facts/figures

  • created from data, i.e, already processed

  • presentation-ready


Statistics and data5 l.jpg

Statistics and Data


Statistics and data7 l.jpg

Statistics and Data


Statistics and data8 l.jpg

Statistics and Data

In short, statistics are created from data and represent summaries of the detail observed in the data.


What is aggregation l.jpg

What is aggregation?

Building on this example, let’s explore aggregation.

We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.


Slide10 l.jpg

Categories of Periods

Statistics

Categories of Sex

Categories of Region

Age and Education are in the background and display totals.


What is aggregation11 l.jpg

What is aggregation?

Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.


The summary statistic l.jpg

The summary statistic

The summary statistic in this example is the total number of smokers.


Variables and categories l.jpg

Variables and categories

The variables and their categories are:

Region (11): Canada and the ten provinces

Age (5) : Total, 15-19, 20-44, 45-64, 65+

Sex (3) : Total, Female, Male

Education (4) : Total, Some secondary or less, Secondary graduate or more, Not stated

Periods (5) : 1985, 1989, 1991, 1994-95, 1996-97


Variables and categories14 l.jpg

Variables and categories

The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.

11 x 5 x 3 x 4 x 5 = 3300 category

combinations


Tabulating or aggregating l.jpg

Tabulating or aggregating

One might be wondering if there is a difference between tabulating and aggregating.

Usually, they are the same thing.


Tabulating aggregating l.jpg

Tabulating = aggregating

In creating tables from data, the variables are arranged in various combinations along the columns and the rows.


Tabulating aggregating17 l.jpg

Tabulating = aggregating

Placing multiple variables along the columns or rows is called nesting.

Tables may have variables nested on both the columns and rows.


Slide18 l.jpg

Categories of Sex nested within Periods


Slide19 l.jpg

Categories of Education nested within Sex

Categories of Sex nested within Region


A quick summary l.jpg

A quick summary

Up to this point, we have noted that

  • statistics are created from data

  • aggregations consist of tabulating statistics within the categories of selected variables

  • variables may be nested within columns and rows to display these tabulations


What are aggregate data21 l.jpg

What are aggregate data?

Q: What is the difference between an aggregation or tabulation and aggregate data?

A: The display of the aggregation (that is, the structure of the tabulated output).


Statistical data structure l.jpg

Statistical data structure

A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.

V1

V2

V3

V4

V5

V6

V7

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7


Statistical data structure23 l.jpg

Statistical data structure

Aggregate data require the same type of statistical data structure.

With aggregate data, the variables are the cells of a tabulation while the cases are the categories of one or more of the table’s dimensions.


What are aggregate data24 l.jpg

What are aggregate data?

Here is an example of a tabulation of three variables. One variable consists of five levels and will be used to represent the cases in an aggregate data file. The other two variables make up a two-way table and the cells in this table will be the variables in the aggregate data file.


From tabulation to data l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2


From tabulation to data26 l.jpg

From tabulation to data

Start with dimension 3 and convert this dimension into the rows or cases of an aggregate data file.


From tabulation to data27 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2


From tabulation to data28 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data29 l.jpg

1

2

3

4

5

From tabulation to data

Dimension 3


From tabulation to data30 l.jpg

From tabulation to data

Working with the six cells from the tabulation for level 1 of dimension 3, locate these six cells as six variables in the new data structure.


From tabulation to data31 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

R1C1

Dimension 1

Dimension 2


From tabulation to data32 l.jpg

1

2

3

4

5

From tabulation to data

R1C1

Dimension 3


From tabulation to data33 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

R2C1

Dimension 1

Dimension 2


From tabulation to data34 l.jpg

1

2

3

4

5

From tabulation to data

R1C1

R2C1

Dimension 3


From tabulation to data35 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

R3C1

Dimension 2


From tabulation to data36 l.jpg

1

2

3

4

5

From tabulation to data

R1C1

R2C1

R3C1

Dimension 3


From tabulation to data37 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

R1C2

Dimension 1

Dimension 2


From tabulation to data38 l.jpg

1

2

3

4

5

From tabulation to data

R1C2

R1C1

R2C1

R3C1

Dimension 3


From tabulation to data39 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R2C2

R3 C2

From tabulation to data

R1C2

Dimension 1

Dimension 2


From tabulation to data40 l.jpg

1

2

3

4

5

From tabulation to data

R1C2

R2C2

R1C1

R2C1

R3C1

Dimension 3


From tabulation to data41 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R3 C2

From tabulation to data

Dimension 1

R3C2

Dimension 2


From tabulation to data42 l.jpg

1

2

3

4

5

From tabulation to data

R1C2

R2C2

R3C2

R1C1

R2C1

R3C1

Dimension 3


From tabulation to data43 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data44 l.jpg

From tabulation to data

Repeat this for level 2 of dimension 3.


From tabulation to data45 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data46 l.jpg

1

2

3

4

5

R1C2

R1C2

R2C2

R2C2

R3C2

R3C2

R1C1

R1C1

R2C1

R2C1

R3C1

R3C1

From tabulation to data

Dimension 3


From tabulation to data47 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data48 l.jpg

From tabulation to data

Now for level 3 of dimension 3, etc.


From tabulation to data49 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data50 l.jpg

1

2

3

4

5

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

From tabulation to data

Dimension 3


From tabulation to data51 l.jpg

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data52 l.jpg

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

From tabulation to data

Dimension 3


From tabulation to data53 l.jpg

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

From tabulation to data

Dimension 1

Dimension 2


From tabulation to data54 l.jpg

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

R3C1

From tabulation to data

Dimension 3


Aggregate data structure l.jpg

Aggregate data structure

The data structure just shown is the one used in the aggregate data files for 1991 and earlier censuses.


Aggregate data structure56 l.jpg

Aggregate data structure

Another data structure for aggregate files consists of records built using nested variables along the rows of a table. The categories of these nested variables serve as index values.


Aggregate data structure57 l.jpg

Aggregate data structure

This is the data structure for aggregate data from the 1996 Census.


Slide58 l.jpg

(11)

(5)

(3)

(4)

(5)


Aggregate data structure59 l.jpg

Aggregate data structure

age by sex by education by periods

5 x 3 x 4 x 5 = 300 cells

For each level of geography (region), there are 300 cells.


Aggregate data structure60 l.jpg

Aggregate data structure

With the previous data structure, there would be one case for each level of region with 300 variables on each line.


Aggregate data structure61 l.jpg

Aggregate data structure

With a fully “indexed” aggregate data structure, there are 300 lines for each level of region with each line representing a combination of the values of the remaining nested or “index” variables, i.e., age, sex, education, period.


Aggregate data structure62 l.jpg

Aggregate data structure

In Beyond 20/20, the smokers’ statistics are arranged as the column dimension while region, age, sex, education and period nested along the rows.


Slide63 l.jpg

(11)

(5)

(3)

(4)

(5)


Aggregate data structure64 l.jpg

Aggregate data structure

Because the original table was not created using numeric codes for the categories of the “index” variables, writing the data to a file produces the following format. Notice that the categories are in string or text format.


Aggregate data structure66 l.jpg

Aggregate data structure

A recode of these “string” values to numeric codes displays the index structure for this type of aggregate data file.


Aggregate data structure68 l.jpg

Aggregate Data Structure

This is a fully indexed data structure. One row in the file represents one cell from the table while the index values define the combination of categories that define the cell.


Aggregate data structure69 l.jpg

Aggregate Data Structure

A partial index data structure is used by Statistics Canada in distributing the 1996 aggregate data. This kind of structure keeps one variable in the column of the table.


Another example l.jpg

Another example

Let’s look at a table that consists of the average length of stay in a hospital by sex, age, diagnostic chapter, region, and time period.


Variables and categories75 l.jpg

Variables and categories

diagnostic chapter : 19 levels

sex : 3 levels

age : 6 levels

region : 13 levels

period : 28 levels


Aggregate data structure76 l.jpg

Aggregate data structure

The number of category combinations is equal to:

13 x 28 x 3 x 6 x 19 = 124,488 category

combinations


Aggregate data structure77 l.jpg

Aggregate data structure

Let’s convert this into a partially indexed aggregate data file structure with the categories of region making up the cases; the column variable consisting of the three latest time periods; and the summary statistic representing the average number of days in the hospital.


Aggregate average length of hospital stay in days l.jpg

Aggregate average length of hospital stay in days

The aggregate structure consists of three index variables (sex, age, and chapter) with 342 indexed levels or records per category of region.

3 x 6 x 19 = 342


What are aggregate data81 l.jpg

What are aggregate data?

Definition:Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a statistical data structure.


Time series aggregate data l.jpg

Time series aggregate data

When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.


Slide83 l.jpg

Time Series aggregate data structure


Slide84 l.jpg

Annual Time Series


Geo spatial aggregate data l.jpg

Geo-spatial aggregate data

When the data structure of the summaries is organized around geography, we recognize these aggregations as geo-referenced data.


Slide86 l.jpg

Geo-spatial aggregate data structure


Slide87 l.jpg

Province

Census Divisions

Census Sub-divisions


Why aggregate l.jpg

Why aggregate?

Statistics Canada creates aggregate statistics from its major surveys, including the Census, to publish select findings.


Why aggregate89 l.jpg

Why aggregate?

The release of aggregate statistics instead of the raw data is a way of protecting the confidentiality of the respondents and safeguarding against disclosure.


Why aggregate90 l.jpg

Why aggregate?

Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.


Why aggregate91 l.jpg

Why aggregate?

Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.


Why aggregate92 l.jpg

Why aggregate?

  • To publish findings

  • To safeguard against disclosure

  • To provide geographic distributions of statistics

  • To present statistics over time


Why aggregate93 l.jpg

Why aggregate?

DLI reasons to aggregate

  • To modify geo-referenced statistics for GIS applications

    • for example, finding postal codes within their corresponding EA and then aggregating data from the postal code level up to the EA level


Why aggregate94 l.jpg

Why aggregate?

Other reasons to aggregate

  • To change the unit of analysis

    • for the purposes of a specific research question

    • to create a common, higher-level unit of analysis that can be used in merging files


How does one aggregate l.jpg

How does one aggregate?

1. Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.

  • This group structure defines a new unit of analysis; and

  • with an indexed structure, the levels of indexing.


How does one aggregate96 l.jpg

How does one aggregate?

2. Establish the sort-order for the grouping variables, i.e., decide which variable increments the slowest, the next slowest, until you reach the variable that changes the fastest.


How does one aggregate98 l.jpg

How does one aggregate?

3. Select the summary statistics, such as sums, averages, minimums, maximums, etc.


How does one aggregate99 l.jpg

How does one aggregate?

As an example, here are the summary statistics available in the SPSS Aggregate procedure.


How does one aggregate100 l.jpg

How does one aggregate?

The actual aggregation is performed using statistical software such as SAS or SPSS.

SAS offers the Data step and a few procedures that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.


How does one aggregate101 l.jpg

How does one aggregate?

SPSS, as already shown, provides the Aggregate procedure.


Time for an exercise l.jpg

Time for an exercise


  • Login