70 208 regression
Download
1 / 27

70-208: Regression - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

70-208: Regression. Lecture 1: Introduction to Regression Analysis. Spring 2014. John Gasper. Welcome. What is Regression? Why should we care? What can we do with it? How much do sales increase with every advertisement placed? How do wages of employees depend on education?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 70-208: Regression' - tim


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
70 208 regression

70-208: Regression

Lecture 1: Introduction to Regression Analysis

Spring 2014

John Gasper


Welcome
Welcome

  • What is Regression? Why should we care? What can we do with it?

    • How much do sales increase with every advertisement placed?

    • How do wages of employees depend on education?

    • How will the price of a stock change?

    • Estimating demand (optimal pricing)

    • Estimating effectsandPrediction/Forecasting


Welcome1
Welcome

Teaching staff:

  • Who am I?

  • Who are you? Stop by my office.

    • Office hours: Mon / Wed: 1-2pm and 4:30-5:30pm

      • And by appointment

  • Teaching Assistants for the course:

    • TA: Adriana Lopez ([email protected] ) (CMUQ 1171)

      • Office hours: by appointment only

    • Undergrad Course Assistants: office hours TBA

      • Syed TanveerHaider, AkhmedSungurov, FlavioFenley, Noor-Ul-Huda Admaney, and TanzeelHuda


Course details
Course Details

  • Textbooks:

    • Statistics for Business (main text; you should have it)

    • Next Generation Excel (supplemental text – on reserve in Library)

  • Attendance and participation

    • Required. Clickers – bring them to every class.

    • Blackboard + Piazza discussion site

  • Cell phones and laptops

    • Turn off your phones.

    • Computers OK for for taking notes and working through data. NOT OK to check news, facebook, twitter, youtube…

      • Seriously. If I or a TA sees you, odds are that I’ll ask you to leave.It’s disrespectful to me and other students.


Course details1
Course Details

  • Grades: (aka what you stress over but shouldn’t)

    • How do you get a good grade in this class?

    • The only way to learn the material is to do it.

    • Homework Exercises = 7%

      • Problem Sets graded on Check System.

    • Lab Quizzes (x5) = 4% each (20%)

      • Attend 95% of classes and scored best 4 of 5.

    • Midterm Exams (x3) = 15% each (45%)

    • Final Exam = 25%

    • Participation = 3%

      • Attendance (clickers) + Discussion site (Piazza).

  • Academic Integrity


Course details2
Course Details

  • Warning: There is a lot of material in the course and we’ll move quickly.

  • Any questions?


Review
Review

Data: what is it?

  • Types of measurements: nominal, ordinal, interval, and ratio

  • Categorical data

    • Measures of Centrality: median, mode

  • Numerical

    • Measures of Centrality: median, mode, mean

    • Measures of Spread: variance/standard dev, range, interquartile range, etc


Review describing data
Review: Describing Data

  • There are many ways to describe and examine data, and that at a basic level is what we’ll be doing in this class.

  • You should be familiar with:

    • Categorical

      • 1 variable: bar charts, pie charts, etc.

      • 2 variables: Contingency tables (x-tabs); Chi-sqtests

    • Numerical

      • 1 variable: histograms, boxplots, cumulative distribution

      • 2 variables: scatterplots, correlation, t-test, etc…


Common plots
Common Plots

  • Histogram, PDF and CDF of exam scores:

  • Scatterplot of Exam 1 and Exam 2:

    (different class)



Review graphical summaries1
Review: Graphical summaries

boxplot

{

histogram


Review graphical summaries2
Review: Graphical summaries

Center: Median?

  • 3.5


Review graphical summaries3
Review: Graphical summaries

Inter quartile range?

  • First to third quartile


Review graphical summaries4
Review: Graphical summaries

Center: Mean?

  • 3.8 Why?


Review graphical summaries5
Review: Graphical summaries

Center: Mean?

  • The mean is greater than the median here because the data are slightly skewed 3.8 vs 3.5


Excel review
Excel review

  • There will be a review session this Wednesday / Thursday from 12:15-1:15

    • Adriana will remind you how to use Excel to generate these graphical displays and quantitative summaries.

      • 9am class (section W): Wednesday12:15 - 1:15

      • 10:30am class (section X): Thursday 12:15 - 1:15

      • Computer cluster 1185

    • Come on time. If you’re late, you’ll be asked to leave.

    • You can pick up your clickers at the review session

    • While I can’t require you go, I would stronglyrecommend it. I won’t be slowing down to go over this stuff again

    • Homework 1 (distributed today) is a review – you should have seen it before and I won’t cover it during class.

      • Due 1 week from today


Probability review
Probability Review

  • What does ‘P(heads) = .5’ mean?

  • What about ‘P(“Alice will get an A in Regression”) = .75’?

    • Frequentistvs Bayesian interpretations. Differences don’t matter for this class and I’ll use language from both.

  • Basic properties:

    • 0 ≤ P(A) ≤ 1

    • P(A) = 1 – P(Ac)

    • P(A or B) = P(A) + P(B) – P(A and B)

    • Events A and B are independentif the occurrence of one doesn’t tell you anything about the occurrence of the other.


Conditional probability
Conditional Probability

  • P(A and B) is often called the “joint probability”

  • P(A) is the “marginal probability”

    • P(A and B) + P(A and ~B) = P(A)

  • The conditional probability

    • P(A|B) = P(A and B) / P(B)

    • P(A|B) is very different than P(B|A).


More review normal distribution
More review:Normal Distribution

What is the Normal distribution?

  • Often called the “Bell Shaped Curve.”

  • This isn’t quite right. It is bell shaped, but there are many bell shaped distributions that aren’t the Normal dist.

    Normal, or Gaussian, distributions are going to be very important for us.

  • Often we’ll need to assume that a random variable X is Normally distributed, denoted X ~ N(μ,σ2)


Normal distribution
Normal Distribution

Different μ

Different σ


Random variables
Random Variables

  • Random doesn’t mean haphazard. Consider an uncertain investment: X

    • X could lose 1000 (with probability = .3)

    • X could gain 10000 (with probability = .2)

    • X could gain 100 (with probability = .5)

  • X is a Random Variable. What is the expectation of X?

    • E(X) = p(x1)x1 + p(x2)x2 + …p(xn)xn

    • E(X) = 0.5*100 + 0.2*10000 + 0.3*-1000 = 1750 = μ

  • Variance of X?

    • Var(X) = E(X – μ)2 =σ2

    • = (x1– μ)2 p(x1) + (x2– μ)2 p(x2) + … + (xn– μ)2 p(xn)

    • = (100- 1750)2 * 0.5 + (10000 – 1750)2 * 0.2 + (-1000 – 1750)2*0.3

  • And higher order moments Skew, Kurtosis, etc.

  • Regression is basically about Conditional Expectation: E(Y|X)

    • I.e., what do we expect about Y given we have some information X


More on the normal dist
More on the Normal Dist

Normality

  • Why assume Normality? The Central Limit Theorem tells us that we’re often OK:

    • The probability distribution of a mean (or sum) of IID random variables of tends to a Normal distribution (asymptotically)

    • Several versions of the CLT but we won’t go through the proofs here (they can be a little nasty)

    • So why are we OK?

      • Observed data are often (not always) the accumulation of many small factors (e.g., the value of the stock market depends on many investors, or scores on an exam)


Quantile plots
Quantile Plots

  • A visual check on Normality

    • Why wouldn’t just looking at the density or histogram work?

      • Sometimes skew, kurtosis, etc, is easy to see but often it is not unless you look at a quantile plot

If data track the diagonal line,

you can safely assume it’s a

Normal distribution.


Standardizing a variable z scores
Standardizing a Variable:z-scores

What is a z-score?

  • Transforms a variable to standard deviation units away from the mean. Centered at 0.

  • Why would we use it?


Probabilities and percentiles
Probabilities and Percentiles

  • What is P(X = 600)?

  • What is P(X >= 600)?


Percentiles
Percentiles

  • The lifetimes (in km) of a certain brand of automobile tires is a normally distributed random variable,

  • X ~ N(μ=40,000 km, σ=2000 km)

  • In a shipment of 3000 tires how many tires are expected to have a lifetime that is less than 35,000 miles?

  • E(# of tires) = P(X < 35000) * 3000

  • So how do we calculate P( X < 35000)?

    • Z-scores. Or very easy in Excel: NORM.DIST()

      • norm.dist(x, μ, σ, Cumulative?)

      • norm.dist(35000, 40000, 2000, TRUE) = .0062

      • E (# tires) = .0062 * 3000 = 18.6 = 19


Next time
Next time

  • If any of the topics today seem hazy, review those chapters (take note of chapters 4, 12, and 15).

  • Problem Set 1 due next Monday 9am.

  • First quiz next Wednesday

  • Pick up your clicker this week

    • In the Excel review (9am  Wednesday, 10:30 Thur)

    • Sunday 10:30-11 in Adriana’s office.

    • Must have it by next Monday’s class

  • Excel reviewon Wednesday and Thursday (depending on section)

    • Don’t come late – it’s distracting – we’re starting at 12:15 to give you a 25min break for lunch.

    • You’ll be asked to leave if you’re late. Again, it’s very distracting for the students who came on time.


ad