Aggregate Data and Statistics. Wendy Watkins Carleton University. Chuck Humphrey University of Alberta. Statistics Canada Data Liberation Initiative . Outline. What are aggregate data? Why aggregate? How to aggregate? Computing exercise. What are aggregate data?.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Wendy Watkins Carleton University
Chuck Humphrey University of Alberta
Statistics Canada Data Liberation Initiative
Let’s start with the relationship between statistics and data.
In short, statistics are created from data and represent summaries of the detail observed in the data.
Building on this example, let’s explore aggregation.
We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.
Categories of Sex
Categories of Region
Age and Education are in the background and display totals.
Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.
The summary statistic in this example is the total number of smokers.
The variables and their categories are:
Region (11): Canada and the ten provinces
Age (5) : Total, 15-19, 20-44, 45-64, 65+
Sex (3) : Total, Female, Male
Education (4) : Total, Some secondary or less, Secondary graduate or more, Not stated
Periods (5) : 1985, 1989, 1991, 1994-95, 1996-97
The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.
11 x 5 x 3 x 4 x 5 = 3300 category
One might be wondering if there is a difference between tabulating and aggregating.
Usually, they are the same thing.
In creating tables from data, the variables are arranged in various combinations along the columns and the rows.
Placing multiple variables along the columns or rows is called nesting.
Tables may have variables nested on both the columns and rows.
Categories of Sex nested within Region
Up to this point, we have noted that
Q: What is the difference between an aggregation or tabulation and aggregate data?
A: The display of the aggregation (that is, the structure of the tabulated output).
A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.
Aggregate data require the same type of statistical data structure.
With aggregate data, the variables are the cells of a tabulation while the cases are the categories of one or more of the table’s dimensions.
Here is an example of a tabulation of three variables. One variable consists of five levels and will be used to represent the cases in an aggregate data file. The other two variables make up a two-way table and the cells in this table will be the variables in the aggregate data file.
Start with dimension 3 and convert this dimension into the rows or cases of an aggregate data file.
Working with the six cells from the tabulation for level 1 of dimension 3, locate these six cells as six variables in the new data structure.
Repeat this for level 2 of dimension 3.
Now for level 3 of dimension 3, etc.
The data structure just shown is the one used in the aggregate data files for 1991 and earlier censuses.
Another data structure for aggregate files consists of records built using nested variables along the rows of a table. The categories of these nested variables serve as index values.
This is the data structure for aggregate data from the 1996 Census.
age by sex by education by periods
5 x 3 x 4 x 5 = 300 cells
For each level of geography (region), there are 300 cells.
With the previous data structure, there would be one case for each level of region with 300 variables on each line.
With a fully “indexed” aggregate data structure, there are 300 lines for each level of region with each line representing a combination of the values of the remaining nested or “index” variables, i.e., age, sex, education, period.
In Beyond 20/20, the smokers’ statistics are arranged as the column dimension while region, age, sex, education and period nested along the rows.
Because the original table was not created using numeric codes for the categories of the “index” variables, writing the data to a file produces the following format. Notice that the categories are in string or text format.
A recode of these “string” values to numeric codes displays the index structure for this type of aggregate data file.
This is a fully indexed data structure. One row in the file represents one cell from the table while the index values define the combination of categories that define the cell.
A partial index data structure is used by Statistics Canada in distributing the 1996 aggregate data. This kind of structure keeps one variable in the column of the table.
Let’s look at a table that consists of the average length of stay in a hospital by sex, age, diagnostic chapter, region, and time period.
diagnostic chapter : 19 levels
sex : 3 levels
age : 6 levels
region : 13 levels
period : 28 levels
The number of category combinations is equal to:
13 x 28 x 3 x 6 x 19 = 124,488 category
Let’s convert this into a partially indexed aggregate data file structure with the categories of region making up the cases; the column variable consisting of the three latest time periods; and the summary statistic representing the average number of days in the hospital.
The aggregate structure consists of three index variables (sex, age, and chapter) with 342 indexed levels or records per category of region.
3 x 6 x 19 = 342
Definition:Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a statistical data structure.
When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.
When the data structure of the summaries is organized around geography, we recognize these aggregations as geo-referenced data.
Statistics Canada creates aggregate statistics from its major surveys, including the Census, to publish select findings.
The release of aggregate statistics instead of the raw data is a way of protecting the confidentiality of the respondents and safeguarding against disclosure.
Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.
Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.
DLI reasons to aggregate
Other reasons to aggregate
1. Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.
2. Establish the sort-order for the grouping variables, i.e., decide which variable increments the slowest, the next slowest, until you reach the variable that changes the fastest.
3. Select the summary statistics, such as sums, averages, minimums, maximums, etc.
As an example, here are the summary statistics available in the SPSS Aggregate procedure.
The actual aggregation is performed using statistical software such as SAS or SPSS.
SAS offers the Data step and a few procedures that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.
SPSS, as already shown, provides the Aggregate procedure.