1 / 103

Aggregate Data and Statistics - PowerPoint PPT Presentation

Aggregate Data and Statistics. Wendy Watkins Carleton University. Chuck Humphrey University of Alberta. Statistics Canada Data Liberation Initiative . Outline. What are aggregate data? Why aggregate? How to aggregate? Computing exercise. What are aggregate data?.

Related searches for Aggregate Data and Statistics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Aggregate Data and Statistics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Aggregate Data and Statistics

Wendy Watkins Carleton University

Chuck Humphrey University of Alberta

Outline

• What are aggregate data?

• Why aggregate?

• How to aggregate?

• Computing exercise

Statistics and Data

Data

• numeric files created and organized for analysis

• requires processing

Statistics

• numeric facts/figures

• created from data, i.e, already processed

Statistics and Data

In short, statistics are created from data and represent summaries of the detail observed in the data.

What is aggregation?

Building on this example, let’s explore aggregation.

We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.

Categories of Periods

Statistics

Categories of Sex

Categories of Region

Age and Education are in the background and display totals.

What is aggregation?

Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.

The summary statistic

The summary statistic in this example is the total number of smokers.

Variables and categories

The variables and their categories are:

Region (11): Canada and the ten provinces

Age (5) : Total, 15-19, 20-44, 45-64, 65+

Sex (3) : Total, Female, Male

Education (4) : Total, Some secondary or less, Secondary graduate or more, Not stated

Periods (5) : 1985, 1989, 1991, 1994-95, 1996-97

Variables and categories

The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.

11 x 5 x 3 x 4 x 5 = 3300 category

combinations

Tabulating or aggregating

One might be wondering if there is a difference between tabulating and aggregating.

Usually, they are the same thing.

Tabulating = aggregating

In creating tables from data, the variables are arranged in various combinations along the columns and the rows.

Tabulating = aggregating

Placing multiple variables along the columns or rows is called nesting.

Tables may have variables nested on both the columns and rows.

Categories of Sex nested within Periods

Categories of Education nested within Sex

Categories of Sex nested within Region

A quick summary

Up to this point, we have noted that

• statistics are created from data

• aggregations consist of tabulating statistics within the categories of selected variables

• variables may be nested within columns and rows to display these tabulations

What are aggregate data?

Q: What is the difference between an aggregation or tabulation and aggregate data?

A: The display of the aggregation (that is, the structure of the tabulated output).

Statistical data structure

A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.

V1

V2

V3

V4

V5

V6

V7

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7

Statistical data structure

Aggregate data require the same type of statistical data structure.

With aggregate data, the variables are the cells of a tabulation while the cases are the categories of one or more of the table’s dimensions.

What are aggregate data?

Here is an example of a tabulation of three variables. One variable consists of five levels and will be used to represent the cases in an aggregate data file. The other two variables make up a two-way table and the cells in this table will be the variables in the aggregate data file.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2

From tabulation to data

Start with dimension 3 and convert this dimension into the rows or cases of an aggregate data file.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

Dimension 3

From tabulation to data

Working with the six cells from the tabulation for level 1 of dimension 3, locate these six cells as six variables in the new data structure.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C1

Dimension 1

Dimension 2

1

2

3

4

5

R1C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R2C1

Dimension 1

Dimension 2

1

2

3

4

5

R1C1

R2C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

R3C1

Dimension 2

1

2

3

4

5

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R2C2

R3 C2

R1C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R2C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R3 C2

Dimension 1

R3C2

Dimension 2

1

2

3

4

5

R1C2

R2C2

R3C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

From tabulation to data

Repeat this for level 2 of dimension 3.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R2C2

R2C2

R3C2

R3C2

R1C1

R1C1

R2C1

R2C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

From tabulation to data

Now for level 3 of dimension 3, etc.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

R3C1

Dimension 3

Aggregate data structure

The data structure just shown is the one used in the aggregate data files for 1991 and earlier censuses.

Aggregate data structure

Another data structure for aggregate files consists of records built using nested variables along the rows of a table. The categories of these nested variables serve as index values.

Aggregate data structure

This is the data structure for aggregate data from the 1996 Census.

(11)

(5)

(3)

(4)

(5)

Aggregate data structure

age by sex by education by periods

5 x 3 x 4 x 5 = 300 cells

For each level of geography (region), there are 300 cells.

Aggregate data structure

With the previous data structure, there would be one case for each level of region with 300 variables on each line.

Aggregate data structure

With a fully “indexed” aggregate data structure, there are 300 lines for each level of region with each line representing a combination of the values of the remaining nested or “index” variables, i.e., age, sex, education, period.

Aggregate data structure

In Beyond 20/20, the smokers’ statistics are arranged as the column dimension while region, age, sex, education and period nested along the rows.

(11)

(5)

(3)

(4)

(5)

Aggregate data structure

Because the original table was not created using numeric codes for the categories of the “index” variables, writing the data to a file produces the following format. Notice that the categories are in string or text format.

Aggregate data structure

A recode of these “string” values to numeric codes displays the index structure for this type of aggregate data file.

Aggregate Data Structure

This is a fully indexed data structure. One row in the file represents one cell from the table while the index values define the combination of categories that define the cell.

Aggregate Data Structure

A partial index data structure is used by Statistics Canada in distributing the 1996 aggregate data. This kind of structure keeps one variable in the column of the table.

Another example

Let’s look at a table that consists of the average length of stay in a hospital by sex, age, diagnostic chapter, region, and time period.

Variables and categories

diagnostic chapter : 19 levels

sex : 3 levels

age : 6 levels

region : 13 levels

period : 28 levels

Aggregate data structure

The number of category combinations is equal to:

13 x 28 x 3 x 6 x 19 = 124,488 category

combinations

Aggregate data structure

Let’s convert this into a partially indexed aggregate data file structure with the categories of region making up the cases; the column variable consisting of the three latest time periods; and the summary statistic representing the average number of days in the hospital.

Aggregate average length of hospital stay in days

The aggregate structure consists of three index variables (sex, age, and chapter) with 342 indexed levels or records per category of region.

3 x 6 x 19 = 342

What are aggregate data?

Definition:Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a statistical data structure.

Time series aggregate data

When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.

Time Series aggregate data structure

Annual Time Series

Geo-spatial aggregate data

When the data structure of the summaries is organized around geography, we recognize these aggregations as geo-referenced data.

Geo-spatial aggregate data structure

Province

Census Divisions

Census Sub-divisions

Why aggregate?

Statistics Canada creates aggregate statistics from its major surveys, including the Census, to publish select findings.

Why aggregate?

The release of aggregate statistics instead of the raw data is a way of protecting the confidentiality of the respondents and safeguarding against disclosure.

Why aggregate?

Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.

Why aggregate?

Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.

Why aggregate?

• To publish findings

• To safeguard against disclosure

• To provide geographic distributions of statistics

• To present statistics over time

Why aggregate?

DLI reasons to aggregate

• To modify geo-referenced statistics for GIS applications

• for example, finding postal codes within their corresponding EA and then aggregating data from the postal code level up to the EA level

Why aggregate?

Other reasons to aggregate

• To change the unit of analysis

• for the purposes of a specific research question

• to create a common, higher-level unit of analysis that can be used in merging files

How does one aggregate?

1. Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.

• This group structure defines a new unit of analysis; and

• with an indexed structure, the levels of indexing.

How does one aggregate?

2. Establish the sort-order for the grouping variables, i.e., decide which variable increments the slowest, the next slowest, until you reach the variable that changes the fastest.

How does one aggregate?

3. Select the summary statistics, such as sums, averages, minimums, maximums, etc.

How does one aggregate?

As an example, here are the summary statistics available in the SPSS Aggregate procedure.

How does one aggregate?

The actual aggregation is performed using statistical software such as SAS or SPSS.

SAS offers the Data step and a few procedures that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.

How does one aggregate?

SPSS, as already shown, provides the Aggregate procedure.