1 / 18

Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US shawn@cicoria.com, {js20454w,mm42526w,lc18948w}@pace.edu.

kagami
Download Presentation

Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shawn Cicoria, John Sherlock, Manoj Muniswamaiah, Lauren Clarke Seidenberg School of Computer Science and Information Systems Pace University White Plains, NY, US shawn@cicoria.com, {js20454w,mm42526w,lc18948w}@pace.edu Classification of Titanic Passenger Data and Chances of Surviving the DisasterData Mining with Weka and KaggleCompetition Data

  2. Background • Titanic Disaster – April 15, 1912 • 1,502 passengers and crew perished out of 2,224[2] • Researchers still try to identify chance of survival[2,3] [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. [3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013].

  3. Kaggle.com • Crowd sourcing and competition for Analytics and Data mining • Online Presence • Example Competition • General Electric (GE) offering $200,000

  4. WekaWaikato Environment for Knowledge Analysis • Open Source tool • http://www.cs.waikato.ac.nz/ml/weka/ • Collection of machine learning algorithms and analytical tools • Cross Platform – Java based • Primary authors – Researchers at University of Waikato NZ

  5. Basic Premise • What classes of passengers impacted the survivability for the Titanic Disaster? • Sex • Cabin Class • Point of Departure • Age

  6. Source Data • Kaggle (Kaggle.com) • Titanic Disaster Competition • https://www.kaggle.com/c/titanic-gettingStarted • Used Test Data set

  7. Data Set – Coaxing for Weka • Original Data • Data Modifications

  8. Final Data Format • Final CSV • ARFF Format

  9. J48 Classifier • C.45 Based • 81% correct classification • 42nd in Kaggle if submitted !! Information gain  Amount of information gained by knowing the value of the attribute  (Entropy of distribution before the split) –(entropy of distribution after it) Claude Shannon, American mathematician and scientist 1916–2001

  10. J48 Tree Diagram • Sex largest impact • Cabin Class • Departure point

  11. Simple K Means Clustering • Sex had clear clustering impact

  12. Simple K Means Clustering • Cabin Class showed significant clustering • 3rd class not so great

  13. Simple K Means Clustering • Age Group • Hard to distinguish if any • Lowest influencer in J48

  14. Simple K Means • Point of Departure • Southampton seems significant • We didn’t identify if departure was associated with Cabin Class – another study needed.

  15. Simple K Means • Point of Departure vs. Survived • Instance Colored by Class (1st, 2nd, 3rd) • Show’s strong association between embark and class

  16. Conclusions and Summary • Sex clearly had the most significant impact on the survival rate • J48 classifier ~ 81% correctly classified instances • Kaggle competition 43rd place.

  17. Finally • Weka is powerful, however; • Requires significant coaxing of the data into a more amiable format • At first, we had chosen baseball statistics • Became overwhelmed • Baseball statistics were tossed out – very late in our project. • Kaggle to the rescue • Stumbled upon this dataset • Simple manipulation had compatible ARFF format • Demonstrated which classes of passengers had the greatest impact on survivability.

  18. References [1] GE, “Flight Quest Challenge,” Kaggle.com. [Online]. Available: https://www.gequest.com/c/flight2-main. [Accessed: 13-Dec-2013]. [2] “Titanic: Machine Learning from Disaster,” Kaggle.com. [Online]. Available: https://www.kaggle.com/c/titanic-gettingStarted. [Accessed: 13-Dec-2013]. [3] Wiki, “Titanic.” [Online]. Available: http://en.wikipedia.org/wiki/Titanic. [Accessed: 13-Dec-2013]. [4] Kaggle, Data Science Community, [Online]. Available: http://www.kaggle.com/ [Accessed: 13-Dec-2013] [5] Weka 3: Data Mining Software in Java, [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/ [Accessed: 13-Dec-2013] [6] C4.5 Algorithm, Wikipedia, Wikimedia Foundation, [Online]. Available: http://en.wikipedia.org/wiki/C4.5_algorithm, [Accessed: 13-Dec-2013]

More Related