slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Mining Baseball Statistics PowerPoint Presentation
Download Presentation
Mining Baseball Statistics

Loading in 2 Seconds...

play fullscreen
1 / 13

Mining Baseball Statistics - PowerPoint PPT Presentation


  • 216 Views
  • Uploaded on

Mining Baseball Statistics. Data Mining – CSE881. Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/. Overview of Baseball. Baseball is a team sport There are two major leagues: AL (American), NL (National)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Mining Baseball Statistics' - nasya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Mining Baseball Statistics

Data Mining – CSE881

Paul Cornwell

Kajal Miyan

Mojtaba Solgi

Project URL: http://kmp-cse881.appspot.com/

slide2

Overview of Baseball

  • Baseball is a team sport
  • There are two major leagues: AL (American), NL (National)
  • Many statistics characterizing player performance are published yearly
  • Each league names one player MVP (Most Valuable Player) each year according to a vote
  • People place bets on who will be MVP

2

slide3

Overview

  • Application: (motivation)
    • Can we predict who will be named MVP?
    • Learn how to do data mining
    • Learn about baseball
    • Impress sabermetricians
    • Baseball: it’s not diseases, crime, or pollution
  • Baseball statistics
  • Main task: predict MVPs for a given year
  • Use SVM to rank players

3

slide4

playerID

yearID

stint

teamID

lgID

Gbat

AB

R

H

2B

3B

HR

RBI

SB

SO

aasedo01

1985

1

BAL

AL

54

0

0

0

0

0

0

0

0

0

abregjo01

1985

1

CHN

NL

6

9

0

0

0

0

0

1

0

2

ackerji01

1985

1

TOR

AL

61

0

0

0

0

0

0

0

0

0

adamsri02

1985

1

SFN

NL

54

121

12

23

3

1

2

10

1

23

agostju01

1985

1

CHA

AL

54

0

0

0

0

0

0

0

0

0

aguaylu01

1985

1

PHI

NL

91

165

27

46

7

3

6

21

1

26

aguilri01

1985

1

NYN

NL

22

36

1

10

2

0

0

2

0

5

Overview of Data and Mining

  • Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries)‏
  • Data Mining:
    • Ranking (similar to classification)‏
    • Anomaly detection (maybe)‏

4

slide5

Methodology - Preprocessing

  • Initial Data: ~90,000 rows in Batting table, 1871-2007
    • One row: one player/year/stint/team
  • Cut to 1985-2007, ~28,000 rows, b/c Salary begin, rule changes
  • Perl script to merge tables by playerID/yearID/stint
    • BattingFieldingAwards(MVP)SalariesMaster = 48 columns
    • ~14 hours, but I got to relearn Perl!
  • Discovered: infeasible to use WEKA, need to use SVM-Light
  • Reformatted from CSV to space-delimited SVM-Light format
    • replace every “value” with “attribute:value”
    • replace commas, spaces
    • deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats)‏
    • create (binary) rank value based on MVP status
    • replace all MM/DD/YYYY with YYYY
    • insert “qid” column according to year/league (46 qids)‏
    • ...

5

slide6

Methodology – Data Mining

  • Classification not apt to get good results, hence ranking with‏
    • SVM-Light (Cornell University)‏
      • Training generates a model which can rank input
      • Training phase Leave one (year) out
      • Testing Rank the players for that year
  • Postprocessing
    • SVM-Light returns only ranks of the players as integers
    • match ranks with corresponding players
    • Reformat data for visualization
    • Ranked the data for each attribute
  • Anomaly detection (in progress)
    • KNN on 4 attributes (Gbat, R, HR, RBI)‏ for players in >= 10 games
    • Compute z-scores for each attribute/year
    • Rank players by distance from nearest neighbor
    • Compare ranks in various attributes for detecting anomalies

6

slide7

Methodology - Visualization

  • Bar charts of top 20 ranked players for various attributes
    • Python
    • Google App Engine
    • Google Charts tool
  • U.S. map of player birthState density

7

slide8

Team Roles

  • Roles of team members
    • Planning - Everyone
    • Preprocessing – Paul Cornwell
    • Data Mining – Kajal Miyan
    • Visualization – Mojtaba Solgi

8

slide9

Related Work

  • No apparent academic work on predicting MLB MVPs
  • PECOTA
    • Baseball Prospectus
      • www.baseballprospectus.com/pecota/
      • Baseball “forecasting”
      • Makes statistical predictions about players
      • No MVP prediction evident
      • subscription service
  • Books are available with baseball forecasts
    • apparently for one year only

9

slide10

Experimental Setup

  • Raw data downloaded from http://baseball1.com/content/view/58/82/
  • Preprocessing done using Perl, Nano, Excel, OOo, TextPad
  • Preprocessing yields a table with ~28K rows and 45 columns
  • Experiments were conducted on a 2 GHz P4 machine running Kubuntu 8.04 with 1GB RAM
  • Data Mining and postprocessing with SVM-Light, Visual C#, Matlab
  • Visualization done using Python, Google App

10

slide11

Experimental Evaluation

  • Preliminary results
    • SVM-Light trained on 1985-2006 data
    • tested on 2007
    • ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2)‏
      • (there is one MVP for each league each year: AL, NL)‏
      • 2006: ranks 7, 16 (1371 players)
      • 2005: ranks 1, 4 (1322 players)
      • 2004: ranks 1, 3 (1342 players)
      • 2003: ranks 3, 32 (1341 players)
      • 2002: ranks 1, 11 (1316 players)
  • Final evaluation (pending)‏
    • Leave-one-out

11

slide12

Visualization Demo

  • http://kmp-cse881.appspot.com/

12

slide13

Conclusions

  • MVP ranking was surprisingly successful
  • Early results suggest that it is feasible to predict MVPs with some accuracy
  • Lessons learned
    • Data mining is hard work
    • Baseball statistics are actually sort of interesting
  • Future work
    • Leave-one-out validation
    • Incorporate team statistics in player evaluations (expert advice)‏

13