- 110 Views
- Uploaded on
- Presentation posted in: General

Find a chair. Any chair. Sit down. Relax.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Find a chair.

Any chair.

Sit down.

Relax.

The course outline is on the course website

WEBSITE ADDRESS

www.geography.ryerson.ca/coppack/geo161

THIS IS NOT A BLACKBOARD SITE

ALERT!

What the website looks like…

www.geography.ryerson.ca/coppack/geo161

www.geography.ryerson.ca/coppack/geo161

RYERSON UNIVERSITY

Department of Geography

GEO 161: INTRODUCTORY ANALYTICAL TECHNIQUESFALL 2014

(Also known as “Yippee! It’s statistics!”)

Instructor:Dr. Philip Coppack

Office:JOR 608

Phone:(416) 979-5000 ex. (I don’t respond well to phone calls but…)

E-mail:pcoppack@arts.ryerson.ca (e-mails - within reason - I will answer)

Office Hours: Posted - By chance or appointment.

www.geography.ryerson.ca/coppack/geo161

COURSE DESCRIPTION - The fine print:

- Welcome to Analytical Techniques I, a one semester professional course within the Geographic Analysis program. You’ll be happy to note that no familiarity with the fundamental elements of statistics is assumed (even though you started doing this in grade 4), though some keyboarding and operating systems experience with microcomputers is – which basically means that if you can log on and navigate Windows and have heard of Excel, you’ll be fine. The larger context for this course is the stuff you’re getting about geographic research in GEO 141. The current course, GEO 161, also sets the stage for GEO 361, inferential statistics, that you will have next term. My goal is to provide you with the fundamentals of data and information extraction, descriptive statistics, picturing your data, sampling distributions and exposes you to computer programmes commonly used in geographic research. While you will take courses in many different aspects of geography, your principle career path will be as a research analyst able to gather, order, and analyze data, extract information from those data and present your findings in a workplace environment using common computer software. Thus, this course provides the groundwork for all future courses you will take. My approach in this course is to ease the fears students usually have about “statistics” – or more to the point, numbers. My basic premise is that statistics can be considered as a set of related concepts driven by some fairly simple arithmetic. Knowledge of the mathematical derivations that underpin statistics is not required for this course. If you can add, subtract, divide, and multiply then you have all the numeracy skills you need.

I agree

www.geography.ryerson.ca/coppack/geo161

COURSE EVALUATION:

- Lab Assignments (5x10%)50% (see schedule below)
In-class multiple choice quizzes (5x10%)50% (see schedule below)

THERE IS NO FINAL EXAM SO YOU CANNOTMISS QUIZZES

- NOTE: Quizzes will be run in the first 40minutes of the lecture period and will be comprised of 40 multiple choice questions.
- If you miss a quiz, you lose the grade for it – no exceptions including begging, otherwise you will be swamped.
The labs are designed to be fairly short, self contained and most of the calculation work can be done during the lab session in which they are distributed and when I am around. It is expected that you will stay on schedule. If you don’t you’re going to fail – this is a challenging course!

If you miss a lab, you lose the grade for it – no exceptions including begging, otherwise you will be swamped.

- REQUIRED TEXT
- Miethe, Terance, and Jane Florence Gauthier (2008). Simple Statistics. Oxford University Press.
- You should also use my course PowerPoint shows, available at:

www.geography.ryerson.ca/coppack/geo161

www.geography.ryerson.ca/coppack/geo161

RATING SCHEME FOR TOPIC DIFFICULTY

= easy

- = fairly easy
- = moderate
- = fairly difficult
- = difficult

It’s all relative, after all.

You started stats in grade 4.

For some of you, all of this will be easy.

For others, none of it will be.

Here’s where you look for your assignment handout and due dates. The colours of the cells refer to the topics covered by each assignment.

Here’s when your first assignment is due. IN THE CLASSOF THAT WEEK.

Here’s where you look for your quiz due dates. They will cover all topics since the last quiz.

This is a challenging course, but you will most likely pass it if you:

Attend lectures and labs.

Do your readings.

Ask questions in the lab or class or office if you don’t understand something.

Hand in all the assignments. It is hard for me to give you anything but an ‘F’ if I have nothing to grade.

Be good – don't cheat (working together is not cheating – handing in someone else’s work is.)

Of the course:

To teach you the basic toolkit of the research analyst – statistics.

Of today:

To alleviate your abject terror of having to deal with…

numbers!

∑x/n-1

-1

1

0

x2

n

2

3.2

What is Statistics?

Statistics (the discipline) is a way of reasoning plus a collection of tools and methods designed to help us understand the world.

Statistics (plural) are particular calculations made from data.

Data are values within a variable. Their purpose is to help provide information about research questions.

Information allows us to answer a specific research question.

Descriptive Statistics (this course).

This type of statistics describes a dataset.

They tell you only about that specific dataset.

They are a crucial first step in analysis, often the only step.

You cannot have good inference without good description.

Inferential Statistics (GEO 361 next semester)

This type of statistics allows you to infer from a sample dataset.

They allow you to say things about a population using only a sample of that population.

They require very rigorous rules about sampling and data distributions.

They require that the margins of error and the confidence in the size of that error between the sample and the population be quantified.

How Does Statistics Work?

Statistics is really about measuring variation.

First, we collect data that answers a research question: e.g. which political party is most popular.

But measurements are always imperfect for reasons we’ll look at in a moment, so…

Second, we try to estimate how imperfect the measurements are. That is, how far from reality is the picture the data are painting.

These variations between reality and the statistical picture of reality are called the margins of error and we must quantify them using statistical tools.

We must measure how big the error is and how sure we can be about that measurement.

Imperfections come from two principle places:

Measurement:

They come from the fact that we live in a world about which we do not or cannot know everything, or they come from imperfections in the way we measure.

Sampling:

Most often we are dealing with data collected from a sample of the world and not from all of the world.

The data we have are only a small part of what are actually out there.

Some data we choose to count, most we do not.

Some data we can count, some we cannot.

Counting some things can make other things uncountable.

We chose one spatial scale and ignore others.

We chose one time period and ignore others.

#2: Sampling – Why we do it.

Essential part of statistics.

Need to do it because:

Population size may be large and thus too expensive to count in its entirety (e.g. Canada, the galaxy).

Population may be too large to count at all (e.g. quantity of water in Great Lakes).

Population size may be unknown (e.g. # of naked mole rats in the world).

Collecting millions of samples would allow human error to creep in anyway.

Differences between a population & a sample

A sample is only a small part of the population, so:

Chance of population and sample statistics (e.g. the mean and standard deviation) beingexactly the same is very small.

Chance of population and sample statistics (mean and standard deviation) being close can be very high depending on size of sample.

That’s how statistics works.

Differences between a population & a sample

But to make it work we need to know two things:

What is the quantitative difference between the population and sample statistics?

This is called the margin of error.

and

How surecan you be in those margins of error?

This is called the confidence limit.

These are explained as…

Margins of Error & Confidence Limits

Margin of Error

This number tells you where the unknown population statistic’s value will lie in relation to your known sample statistic’s value.

It is always in the same units as the data it measures and it is always a plus or minus value (hence “margins” of error).

For example, assume that a sample of Toronto’s population tells you that Toronto’s average income is $50,000 ± $7,000. This means that the actual population average income is somewhere between $43,000 and $57,000. Note that you never know where it is exactly.

Margins of Error & Confidence Limits

Confidence Limits

This number tells you how certain you can be that the margin of error you calculated is correct.

It is always in percentage (most often quoted as an “alpha” value of .05 or .01) and the number is one you decide upon arbitrarily (though it is usually never less than 95% (.05) and is often 99% (.01) or more.

For example, when you calculated the ± $7,000 margin of error you would have also added in a confidence limit as part of the formula (say 95%). This means that you can be 95% sure that the real population average income is somewhere between $43,000 and $57,000.

Margins of Error & Confidence Limits

Some Final Points

Statisticians are always very conservative about their margins of error so they always use very high confidence limits such as 95% or 99%.

But there is a trade off between being sure of the numbers you get:

Rule #1: the higher the confidence interval you use, the larger the margins of error become.

You can compensate for this by increasing your sample size, so another rule is:

Rule #2: the larger the sample you use, the smaller the margins of error become.

BUT…

BECAUSE YOU ARE BASING YOUR STATISTICS ON A DATA SAMPLE, THE RESULTS YOU GET ARE ONLY AS GOOD AS THE DATA YOU COLLECTED, HENCE THE GIGO RULE…

G ARBAGE

I N

G ARBAGE

O UT

Margins of error arise from...

Where margins of error come from…

Research question:

do you think possession of a concealed firearm should merit a life sentence?

Response:

70% yes

30% no

Canadian population:

34,911,537

…is the population really representative of the population?

… in time?

… in space?

Thus we would be 100% sure that 70% of Canadians agreed with the question and 30% did not. (Not quite as we will see in a minute, but for now…)

…is the sample representative of the population?

BUT

the likelihood of the sample proportion response rate being exactly the same as the population proportion response rate is very small. That is why we have to have the margins of error calculations.

…how sure can we be about the margins of error on the sample response rates?

…is the sample data accurate, precise and truthful?

Research question:

do you think possession of a concealed firearm should merit a life sentence?

Canadian population:

34,911,537

Response:

70% yes

30% no

Sample of Canadian

population:

2,500

…is the sample size large enough?

Thus we could be ≥95% sure that 70% of Canadians ± ‘x’% agreed with the question and 30% ± ‘x’% did not.

Statistics can be conceptually tricky but it is not numerically difficult.

Statistics gives us a way to work with the variability in the world around us and answer research questions.

Statistics is an essential part of your skill set as a research analyst and your life skills as an individual.

What Isn’t Statistics?

It is NOT mathematics – at least the way we will use it.

It is…

Arithmetic

+

Symbols

+

Concepts/logic

… that will make you appear very smart.

Consider the following…

1+2+3+4+5 = ? 5

1+2+3+4+5= ?

5

15 = 3

5

You add.

You divide.

You get your answer.

How easy is that?

No surprises here

you did this in grade 4 remember?

1+2+3+4+5 = ?

5

This is a concept.

It says that you can take a series of numbers (data values) and calculate a single number that represents them all.

This type of statistic is called a measure of central tendency.

They simplify, standardise, and generalise the representation of long forms of arithmetic so as to:

- take up less room
- mean the same thing to everyone
- be useful for all series of numbers
And conventionally the symbols used are…

For sample statistics the Latin alphabet is mostly used – e.g.:

a, b, c, x, n, ∑ , √

For population statistics the Greek alphabet is mostly used – e.g.:

α, β, λ,δ, ∑ , √

Note that science and engineering may use the same letters to mean different things.

1+2+3+4+5 = ?

5

What do you have here?

1,2,3,4,5, :are individual data values

++++,___,= :are arithmetic operators

5 :is the number of cases

? :is the answer

If you want a number that best represents the data values in the dataset then you add up the data values and divide the sum of all data values by the number of data values.

But to shorten this long winded way of stating the operation we can…

formula

.

Where:

: the arithmetic mean (‘ex’ bar) – your answer

∑: the sum of all x’s

x: a value in the dataset

n: the number of cases in the dataset

So 1/n = 1/5 = 0.2

Sum of all x’s = 15

0.2 * 15 = 3

This is the most common expression for the arithmetic mean of a dataset.

…and you can get the tee shirt.

But we will use an easier one…

where:

: the arithmetic mean (‘ex’ bar) – your answer

∑: the sum of all x’s

x: a value in the dataset

n: the number of cases in the dataset

This is the arithmetic mean of a dataset…

Wow! This is impressive!

In words…

The standard deviation of a dataset is equal to the square root of the sum of the squared differences between each data value ( x) and the mean ( ) of the dataset, divided by the number of data values in the dataset ( n ) minus 1.

Phew! So much easier with the formula.

It measures deviation among the data values – that is, how much on average each data value varies from the arithmetic mean – but lots more on that later.

DESCRIPTION--->EXPLANATION-->PREDICTION-PRESCRIPTION

Describe form ---Explain process--Predict outcomesChange outcomes

What it is----------Why it is----------What it should beWhat you want it to be

<------------------------------STATISTICS-----------------------------

Descriptive……

Inferential…………………………..

Forecasting……………..

Structure and Content of the Course

spatial

attribute

thematic

temporal

categorical

quantitative

discrete

continuous

nominal

ordinal

ratio

interval

Population

data

Types

Levels

- Relatives:
- percents
- contingency
- index #s
- coefficients

Data errors

Information

Extraction

Description

Sample

Inference

central

tendency

Patterns:

dispersion

hypothesis

testing

graphs

var

distributions

maps

Sampling

sd

mean

median

error

mode

ogives

frequency distributions

range

errors

1&2

theory

methods

histograms

SE

normal curves

CLT

Bi-variate & multivariate PARAMETRIC

techniques

non-normal

curves

shape

Chebyshev

symmetry

Problem solving

significance tests

NON-PARAMETRIC

TECHNIQUES

relationships

differences

confidence tests

1. 2,..n samples

Data

Why Collect Data?

- > Informed decisions come from examining information and data.
- > You are learning to be research analysts for decision support.
- > Good analysis cannot proceed without good data collection.
- > That means:
- being as objective as possible.
- being as honest as possible.
- being as unbiased as possible.
- being as un-opinionated as possible.
- proposing meaningful research questions.
- proposing do-able research questions.
- quantifying margins of error.
- having no bias towards a given outcome.

The Wisdom of the Crowd

The behaviour or opinions or decisions of individuals can be chaotic, but the average behaviour of a crowd can be highly predictable and surprisingly accurate.

Mathematician Marcus Du Sautoy’sjellybean experiment:

5,410 jelly beans

160 guesses

Guesses ranged from 450 to 50,000

Average of all guesses was 5,414 – just 4 over the real number.

You can see this excerpt from his show The Code at http://www.youtube.com/watch?v=982E49KAMyw

Data are observations about phenomena

For example:

- Quantitative observations such as rainfall in cm/hour on a given date
- Number of families in Peterborough with household incomes less than $50,000
- Observational “facts” such as perceptions about the scenic beauty value of a given environment
- Think of data as being the numbers that you look at in a spreadsheet when all the tallying has been done

Information is data that has been given some meaning.

Meaning comes from context and the problem to be solved or research question to be answered.

For example, here’s some data:

And here’s the context:

For example, 40 cm/hour of rainfall on June 14th 2004 ledto devastating floods in Peterborough that caused over $1billion dollars in damage, primarily in lower income neighbourhoods. Research is now under way to ascertain the 50 year flood levels for flood control planning by damming a scenic river valley.

Now we have information.

- Information is data that has been given some meaning.
- Meaning comes from context and the problem to be solved or research question to be answered
- For example, 40 cm/hour of rainfall on June 14th 2004 led to devastating floods in Peterborough that caused over $1 billion dollars in damage, primarily in lower income neighbourhoods. Research is now under way to ascertain the 50 year flood levels for flood control planning by damming a scenic up river valley.

IMPACT ANALYSIS

MITIGATION ANALYSIS

Data

Flavours

Data come in a variety of flavours…

FIRST, REMEMBER THAT DATA ARE PLURAL!

They come in groups called variables.

They can be categorical or quantitative

They can be continuous or discrete

They can be primary or secondary

They can be temporal, thematic (or attribute) or spatial – or all three at the same time.

They can be nominal, ordinal, interval, ratio

Variables are groups of data that vary in value, such as:

Population, birth, death counts.

Type of restaurants, political party, religious affiliations.

Temperature, precipitation, wind speed.

Tour de France average speeds, number of stage wins.

Bird beak and seed sizes, e. coli counts.

They can be based on raw data, such as:

Population counts.

Birth counts.

Death counts.

Area in square kilometres.

Or they can be derived, such as:

Birth or death rate per 1,000 population.

Population density in persons per sq.km.

Quantitative variables collect numeric data values, usually for individuals or samples, that answer quantitative questions about the variable: e.g. how much, how many, how often etc.

For example each of your individual best six high school grade average.

Categorical variables either group or collect data values by categories and answer questions about the categories themselves: e.g. which grade category had the highest number of students.

For example, each of your high school best six average grades as belonging to one of the categories 70-74.99, 75-79.99, 80-85.99, etc.

A discrete variable is one that comes in individual indivisible packets, for example people.

A continuous variable is one that can be continuously divided up into smaller packets, for example temperature.

Some variables can be either. For example water in bottles (discrete) or water in an ocean (continuous).

Occasionally discrete variables are used in a continuous form. For example:

Population 2013 = 35,158,304 (discrete value)

Births in Canada 2013 = 383, 822 (discrete value)

Birth rate: 10.9/1,000(continuous value)

(given by (births/population)*1,000

Primary data are data collected directly in relation to a problem being investigated.

e.g. counts of physical property losses in the Peterborough floods, surveys of attitudes.

Secondary dataare any other data, collected in general, that might have some bearing on the problem at hand.

e.g. census data at the D.A. level on areas flooded, historical river gauge data.

The main distinction between the two types of data are that secondary data are almost always available prior to the problem definition for which you are collecting the primary data.

The temporal dimensionrefers to the times or periodicities over which you collect your data.

e.g. monthly? annual?

The thematic or attribute dimensionis a characterisation of the data you are collecting.

e.g. population? Population change? Population density? Incomes?

The spatial dimension of data refers to the location attributes of a piece of data.

e.g. latitude, longitude, UTM grid, elevation or a street address for a house or business.

Scales of Measurement

High Level Data

Ratio

Interval

Ordinal

Nominal

Low Level Data

You can aggregate higher level data to lower level

but you cannot disaggregate lower level data to higher level.

Type of data determines what statistics you can use with the data.

X

Scales of Measurement

Figuring out types of data are tricky, but it is important because the type of data you have governs the type of stats you can do.

The scale above is a common one but there are some agencies who combine interval and ratio into one category called “numeric” or “quantitative”, Statistics Canada among them.

Also, some of the rules can be and are frequently ‘bent’ depending on how rigorous you want to be.

For example, interval level data cannot be arithmetically divided but calculating average Celsius temperature (an interval level variable) requires a division of the sum by ‘n’ as you know.

Scales of Measurement – Nominal Level Data

- Simplest form of data but can’t do much with it.
- Data values are not arithmetic – cannot +, -, *, /.
- Data values are labels (or names, thus “nominal”) – e.g. restaurants by type, political party affiliation.
- Very few statistical tools can be used – e.g. counts, maybe mode, thematic maps by type, frequencies.
- Comparisons are qualitative only - no scaling factor is implied - i.e. there is no “better than” or “worse than”. But can have a quantitative value attached…
- Temperature Example: Warm and Cold

Scales of Measurement – Ordinal Level Data

- Next level of data – can do quite a bit more.
- Data values are not truly arithmetic – cannot +, _, *, /.
- Data values are ranked or ‘ordered’ (thus “ordinal”) e.g. 1st, 2nd, 3rd.
- Some statistical tools can be used (mode, median, frequency), some specially designed for ordinal data called non-parametric statistics.
- Comparisons are qualitative but can be ordered or ranked according to a rudimentary scaling factor such as “better than” or “worse than” so forms the basis of semantic and Likert scales:
- Bad 1 2 3 4 5 Good
- where scores are “ordered” on a judgement scale, but ‘2’ is not twice as good as ‘1’.
- Temperature Example: Cold, Colder, Coldest

Scales of Measurement – Interval Level Data

- Minimum level of data required for parametric stats.
- Data values allow some arithmetic comparisons: +, -, sometimes *, /.
- Data values have consistent arithmetic intervals between them but do not have an absolute zero base starting point.
- Most statistical tools can be used but must be careful.
- Arithmetic comparisons can be made: 10 is twice 5, and 20 is 2*10 and 4*5. But 10⁰C is not twice as warmas 5⁰C.
- Temperature Example: 5⁰C, 0⁰C, -5⁰C
- NOTE: There are minus Celsius temperatures because there is no absolute base starting point.

Scales of Measurement – Ratio Level Data

- Highest level of data, data values allow all arithmeticoperations (+, -, *, /)
- Data values have consistent arithmetic intervals between them andan absolute zero base starting point.
- All statistical tools and mathematic operations can be used.
- All arithmetic comparisons can be made: 10 is twice 5, etc. and two hundred degrees Kelvin is twice as ‘warm’ as one hundred degrees Kelvin, as measured by molecular motion (i.e. kinetic energy potential).
- Temperature Example:
- Temperature Example: 278.15⁰k, 273.15⁰k, 268.15⁰k
- 0⁰Celsius = 278.15⁰K 0⁰K = minus273.15⁰C
- NOTE: There are no minus Kelvin temperatures because there is an absolute base starting point.

Data Formats

These refer to what operations can be performed on data and Excel’s are:

General:

Lets Excel decide – probably not good.

Numeric:

The main format for quantitative data. Allows you to set decimal place and hence whether integer.

Accounting, Currency:

Interprets cell values as dollars (or other currencies).

Date, Time:

Interprets cell values as date or time format.

Percentage:

Calculates cell values as percentages.

Text:

Writing sometimes called a string variable (e.g. in SPSS)

READ YOURSELF FOR TESTS

- Accuracyrefers to whether your data are wrong or right.
- Precisionrefers to whether your data measure the phenomena you are trying to measure at a refined enough spatial or temporal scale to be useful.
- Fidelityrefers to the “truthfulness” or “trustworthiness” of your data.
- Data redundancyrefers to the repetition of data.
- Data integrityrefers to the problems associated with keeping all of the above errors out of your database. Integrity checks can be divided into four groups: type checks, redundancy checks, range checks, and comparison checks.

- Type checksensure that the data values in a column of data are of the correct type for that variable.
- Redundancy checksensure that you have as little repeat data as possible in your dataset, or even the potential for repeat data.
- Range checksensure that a data value falls within a specified range of values. For example if you have an age variable, the range check on that variable will ensure that no values put in are < 0 or > 120.
- Comparison checkscover a number of things. For example, the check could be performed on the salaries of employees to ensure that the maximum salary for one group of people does not exceed the minimum salary for another group of people in the next higher salary group.

- Arises because the data you collect are for a sample of the population rather than all of the population.
- Relates to the paradox of sampling, which states that in order to know whether your sample represents your population on a given variable, you need to know the population value of the variable and hence don’t need the sample!
- Sampling error is a quantification of the potential that your sample value might be (will be) wrong and by how much.
- For example, I would be able to say that I am 95% sure that Ryerson’s entering average was 78% ± 5%, based on this class as a sample of Ryerson – or all Ontario entering students.