- 106 Views
- Uploaded on
- Presentation posted in: General

Aggregate Data and Statistics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Aggregate Data and Statistics

Wendy Watkins Carleton University

Chuck Humphrey University of Alberta

Statistics Canada Data Liberation Initiative

- What are aggregate data?
- Why aggregate?
- How to aggregate?
- Computing exercise

Let’s start with the relationship between statistics and data.

Data

- numeric files created and organized for analysis
- requires processing
- not ready for display

Statistics

- numeric facts/figures
- created from data, i.e, already processed
- presentation-ready

In short, statistics are created from data and represent summaries of the detail observed in the data.

Building on this example, let’s explore aggregation.

We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.

Categories of Periods

Statistics

Categories of Sex

Categories of Region

Age and Education are in the background and display totals.

Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.

The summary statistic in this example is the total number of smokers.

The variables and their categories are:

Region (11): Canada and the ten provinces

Age (5) : Total, 15-19, 20-44, 45-64, 65+

Sex (3) : Total, Female, Male

Education (4) : Total, Some secondary or less, Secondary graduate or more, Not stated

Periods (5) : 1985, 1989, 1991, 1994-95, 1996-97

The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.

11 x 5 x 3 x 4 x 5 = 3300 category

combinations

One might be wondering if there is a difference between tabulating and aggregating.

Usually, they are the same thing.

In creating tables from data, the variables are arranged in various combinations along the columns and the rows.

Placing multiple variables along the columns or rows is called nesting.

Tables may have variables nested on both the columns and rows.

Categories of Sex nested within Periods

Categories of Education nested within Sex

Categories of Sex nested within Region

Up to this point, we have noted that

- statistics are created from data
- aggregations consist of tabulating statistics within the categories of selected variables
- variables may be nested within columns and rows to display these tabulations

Q: What is the difference between an aggregation or tabulation and aggregate data?

A: The display of the aggregation (that is, the structure of the tabulated output).

A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.

V1

V2

V3

V4

V5

V6

V7

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7

Aggregate data require the same type of statistical data structure.

With aggregate data, the variables are the cells of a tabulation while the cases are the categories of one or more of the table’s dimensions.

Here is an example of a tabulation of three variables. One variable consists of five levels and will be used to represent the cases in an aggregate data file. The other two variables make up a two-way table and the cells in this table will be the variables in the aggregate data file.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2

Start with dimension 3 and convert this dimension into the rows or cases of an aggregate data file.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

5

4

Dimension 3

3

2

1

Dimension 2

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

Dimension 3

Working with the six cells from the tabulation for level 1 of dimension 3, locate these six cells as six variables in the new data structure.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C1

Dimension 1

Dimension 2

1

2

3

4

5

R1C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R2C1

Dimension 1

Dimension 2

1

2

3

4

5

R1C1

R2C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

R3C1

Dimension 2

1

2

3

4

5

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R2C2

R3 C2

R1C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R2C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R3 C2

Dimension 1

R3C2

Dimension 2

1

2

3

4

5

R1C2

R2C2

R3C2

R1C1

R2C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

Repeat this for level 2 of dimension 3.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R2C2

R2C2

R3C2

R3C2

R1C1

R1C1

R2C1

R2C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

Now for level 3 of dimension 3, etc.

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

Dimension 3

R1C1

R2 C1

R3 C1

R1C2

R2 C2

R3 C2

Dimension 1

Dimension 2

1

2

3

4

5

R1C2

R1C2

R1C2

R1C2

R1C2

R2C2

R2C2

R2C2

R2C2

R2C2

R3C2

R3C2

R3C2

R3C2

R3C2

R1C1

R1C1

R1C1

R1C1

R1C1

R2C1

R2C1

R2C1

R2C1

R2C1

R3C1

R3C1

R3C1

R3C1

R3C1

Dimension 3

The data structure just shown is the one used in the aggregate data files for 1991 and earlier censuses.

Another data structure for aggregate files consists of records built using nested variables along the rows of a table. The categories of these nested variables serve as index values.

This is the data structure for aggregate data from the 1996 Census.

(11)

(5)

(3)

(4)

(5)

age by sex by education by periods

5 x 3 x 4 x 5 = 300 cells

For each level of geography (region), there are 300 cells.

With the previous data structure, there would be one case for each level of region with 300 variables on each line.

With a fully “indexed” aggregate data structure, there are 300 lines for each level of region with each line representing a combination of the values of the remaining nested or “index” variables, i.e., age, sex, education, period.

In Beyond 20/20, the smokers’ statistics are arranged as the column dimension while region, age, sex, education and period nested along the rows.

(11)

(5)

(3)

(4)

(5)

Because the original table was not created using numeric codes for the categories of the “index” variables, writing the data to a file produces the following format. Notice that the categories are in string or text format.

A recode of these “string” values to numeric codes displays the index structure for this type of aggregate data file.

This is a fully indexed data structure. One row in the file represents one cell from the table while the index values define the combination of categories that define the cell.

A partial index data structure is used by Statistics Canada in distributing the 1996 aggregate data. This kind of structure keeps one variable in the column of the table.

Let’s look at a table that consists of the average length of stay in a hospital by sex, age, diagnostic chapter, region, and time period.

diagnostic chapter : 19 levels

sex : 3 levels

age : 6 levels

region : 13 levels

period : 28 levels

The number of category combinations is equal to:

13 x 28 x 3 x 6 x 19 = 124,488 category

combinations

Let’s convert this into a partially indexed aggregate data file structure with the categories of region making up the cases; the column variable consisting of the three latest time periods; and the summary statistic representing the average number of days in the hospital.

The aggregate structure consists of three index variables (sex, age, and chapter) with 342 indexed levels or records per category of region.

3 x 6 x 19 = 342

Definition:Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a statistical data structure.

When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.

Time Series aggregate data structure

Annual Time Series

When the data structure of the summaries is organized around geography, we recognize these aggregations as geo-referenced data.

Geo-spatial aggregate data structure

Province

Census Divisions

Census Sub-divisions

Statistics Canada creates aggregate statistics from its major surveys, including the Census, to publish select findings.

The release of aggregate statistics instead of the raw data is a way of protecting the confidentiality of the respondents and safeguarding against disclosure.

Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.

Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.

- To publish findings
- To safeguard against disclosure
- To provide geographic distributions of statistics
- To present statistics over time

DLI reasons to aggregate

- To modify geo-referenced statistics for GIS applications
- for example, finding postal codes within their corresponding EA and then aggregating data from the postal code level up to the EA level

Other reasons to aggregate

- To change the unit of analysis
- for the purposes of a specific research question
- to create a common, higher-level unit of analysis that can be used in merging files

1. Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.

- This group structure defines a new unit of analysis; and
- with an indexed structure, the levels of indexing.

2. Establish the sort-order for the grouping variables, i.e., decide which variable increments the slowest, the next slowest, until you reach the variable that changes the fastest.

3. Select the summary statistics, such as sums, averages, minimums, maximums, etc.

As an example, here are the summary statistics available in the SPSS Aggregate procedure.

The actual aggregation is performed using statistical software such as SAS or SPSS.

SAS offers the Data step and a few procedures that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.

SPSS, as already shown, provides the Aggregate procedure.