Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Data Mining 157A, Fall Semester 2006 Brent Turner
Presentation Contents: • What Is Data Mining • Data Mining Ideas • The DM Process • Advantages and Problems in DM • Example 1 – web searches • Example 2 – buying habits • Example 3 – basketball stats • References
The DM process 2 • Data gathering • Data cleansing: eliminate errors and/or bogus data • Feature extraction: obtaining only the interesting attributes of the data • Pattern extraction and discovery. • Visualization of the data. • Evaluation of results
Data Mining Ideas 3 Search dataspace for a new “golden” relationship. • Brute force: 40 items: 2^40 = 1099511627776 (a trillion) possible pair combinations to look at with only 40 data items • Smarter Search: Infer or guess relationships based on other known data (Association rules; Causality; Frequent item sets)
Advantages of Data Mining 4 • Provides new knowledge from existing data • Public databases • Government sources • Company Databases • Old data can be used to develop new knowledge • New knowledge can be used to improve services or products • Improvements lead to: • Bigger profits • More efficient service
Some problems to consider in DM • Privacy – datum dealing with personal information (e.g. medical history) may need to be kept private from employers, insurance companies, etc. • Legality – can DM be used to screen out high-risk persons or help prosecute a crime • Ethics – should we create software that can be used in unethical ways? What should be done with the new knowledge?
Example 1 – Web Search 5 a. Page rank, for discovering the most “important” pages on the Web, as used in Google. b. Hubs and authorities, a more detailed evaluation of the importance of Web pages using a variant of the eigenvector calculation used for Page rank.
Example 2 – Buying habits 6 5% Historic data might identify that customers who purchase the Gladiator DVD and the Patriot DVD also purchase the Braveheart DVD. The historic data might indicate that the first two DVDs are purchased by only 5% of all customers. But 70% of these then also purchase Braveheart. 70% = + 5%
Example 2 – Buying habits Support = 5% customers bought Gladiator & Patriot Confidence = 70% hose who will also buy Braveheart Conclusion: Use realtime web advertising to get more sales.
Example 3 – basketball stats 7 In one application, IBM's Advance Scout was developed to identify different strategies employed by basketball players in the NBA.
Pippen Discoveries include the observation that Scottie Pippen's favorite move on the left block is a right-handed hook to the middle.
Harper And when guard Ron Harper penetrates the lane, he shoots the ball 83% of the time.
Jordan Also, it was noticed that 17% of Michael Jordan's offence comes on isolation plays, during which he tends to take two or three dribbles before pulling up for a jumper
References 8 • “Data Mining” Oo, Aung, 2005; at www.cs.sjsu.edu/faculty/lee/cs157 accessed 11-29-2006. • “Data Mining Lecture Notes” Ullman , Jeffery D., atinfolab.stanford.edu/~ullman/mining accessed 11-29-2006. • “DATA MINING Desktop Survival Guide” Williams, Graham, at www.togaware.com/datamining/survivor accessed 11-29-2006. • Pinker, Steven, at pinker.wjh.harvard.edu accessed 11-27-2006. • Photographs at www.nba.com, accessed 11-29-2006.