introduction n.
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 22

Introduction - PowerPoint PPT Presentation

  • Uploaded on

Introduction . Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia Outline. Increasing interest in data Course: From Data to Knowledge Summary. “The data deluge” “Data, data everywhere”.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction' - benjamin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Malathi Veeraraghavan


Charles L. Brown Dept. of Electrical and Computer Engineering

University of Virginia

  • Increasing interest in data
  • Course: From Data to Knowledge
  • Summary
the data deluge data data everywhere
“The data deluge” “Data, data everywhere”
  • Economist Special Issue Feb 27-Mar. 5, 2010
  • Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers
  • From businesses to governments, data collection and analysis is rapidly becoming the next big thing.
  • 2012:
the data deluge
“The data deluge”
  • “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”
  • Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.”
business intelligence
Business intelligence
  • Nestle sells > 100,000 products in 200 countries using 550,000 suppliers
  • Problem: not using its huge buying power effectively
  • Used SAP software and analyzed its data
  • Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year
  • Annual savings from such operational improvements: $1 billion

Economist special issue

medical use
Medical use
  • Dr. Carolyn McGregor from University of Ontario
  • Goal: spot fatal infections in premature babies
  • Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc.
  • ECG alone takes 1000 readings/second
  • Infections are detected before obvious symptoms emerge
  • Naked eye cannot see it, but the computer can!
  • Who programs these? Stats experts.
  • Another term: Evidence Based Medicine

Economist special issue

government usage
Government usage
  • An add-on to a 1986 law required firms to disclose the harmful chemicals they release.
  • When the public started tracking these numbers, by 2000, American businesses had reduced their emissions of the chemicals covered under the law by 40%

Economist special issue

best sellers
  • “Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart” by Ian Ayres
  • “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis
  • “The Long Tail” by Chris Anderson
  • Malcolm Gladwell books - Outliers
  • Microtrends – Mark Penn (elections)
  • Freakonomics – S. Dubner and S. Levitt
moneyball example
Moneyball example
  • 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it?
  • Billy Beane, general manager of Oakland A’s
    • Respected statistics
    • Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics.
    • Runs created = (Hits + Walks) Total Bases/(At Bats + Walks)
    • Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight
    • Scouts vs. statisticians!
  • The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical!
malcolm s gladwell s outliers hockey players story
Malcolm's Gladwell's "Outliers” hockey players story
  • Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1
  • ESPN conducted a little study: All the 2008 season NHL players who were born from 1980 to 1990. [Later disputed for 2011 players]
  • Sure enough: Many more were born early in the year than late.

examples from the long tail
Examples from “The Long Tail”
  • Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks.
  • Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores.
  • Anderson gives several such examples, calling these businesses Long-Tail aggregators
    • Google as the long-tail aggregator of advertising
    • eBay of goods
    • Amazon of books
    • Apple of music
    • Netflix of movies
experts vs intuition
Experts vs. intuition
  • Ian Ayres’ book
    • “The future belongs to people like Wolfers who are comfortable with both intuition and numbers”
    • Wolfers analyzed 44,000 college basketball games (> 16 years)
  • Also see Jason Lehrer’s “How we Decide” – another bestseller

Ian Ayres’ book, page 220

what wolfers did
What Wolfers did
  • Plot density function of number of games that beat the Las Vegas spread
    • Perfect normal bell curve!
  • Just look at games with point spreads less than or equal to 12
    • Perfect normal bell curve
  • Look at games with point spread > 12
    • 47% chance that the favored team beat the spread (53% failed to cover the spread)
    • more than 20% of games fell in this category of games with >12 spreads
    • Is it point shaving?
  • Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time!
    • Indeed a stronger case for point shaving

Ian Ayres’ book, page 216

2sd rule to understand variability
2SD Rule: To understand variability
  • There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean
  • Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean.
  • Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”!

Ian Ayres’ book, page 221

margin of error
“Margin of error”
  • News article says “Laverne is leading Shirley 51% to 49% with a margin of error of 2%” and so the race is a “statistical dead heat.”
  • Ayers declares this “balderdash!” Why?
  • Margin of error = 2SD
  • So standard deviation is 1%
  • This means there is an 84% chance that Laverne leads in the polls (i.e., has more than 50% of the vote)

Ian Ayres’ book, page 224

  • See if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height
  • Estimate two things: mean and standard deviation

Ian Ayres’ book, page 214

technology trends enabling all this data analysis
Technology trends enabling all this data analysis
  • Cloud computing
    • Amazon , Google, Yahoo, Microsoft
  • Open source software
    • R programming language
      • NY Times article, Jan. 7, 2009
    • Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers

Economist special issue

technology or techniques
Technology or techniques?
  • Moore’s Law
    • Processing power doubles every two years
    • Supercrunching does need CPUs, but computing power has been available
  • More important: Kryder’s Law
    • Storage capacity of hard drives has been doubling every two years
    • Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate

Ian Ayres’ book, page 151

three techniques
Three techniques
  • Regressions
    • error term ~ N(0,2)
  • Randomization
    • Run experiments by treating different samples in different ways
  • Neural networks
    • Functional form is not assumed to be linear or anything specific

Ian Ayres’ book

course material
Course material
  • From Data to Knowledge
  • Focus on data sets
  • Less on details of statistical techniques
  • Learn R programming through class-provided R programs and assignments
  • Importance of data analysis
    • in every walk of life!
  • How to extract the “story” hidden in the data set?