1 / 29

Data Mining

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011. Important Note. This presentation was obtained from Dr. Vijay Raghavan Obtained on January 12, 2011.

ledell
Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011

  2. Important Note • This presentation was obtained from Dr. Vijay Raghavan • Obtained on January 12, 2011. • Dr. Raghavan is a member of the Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, La., USA • They are being used with his permission. • Some modifications have been made.

  3. CONTENTS • The Motivation • Knowledge Discovery in Databases (KDD) • Data Mining • Related Fields • Research Issues • Tasks • Association Mining Problem • Classification Mining Problem • Conclusions

  4. THE MOTIVATION “We are drowning in information, but starving for knowledge.” John Naisbett

  5. KNOWLEDGE DISCOVERY IN DATABASES- Definition • A hot buzzword for a class of database applications that look for patterns or relationships in data that are: • Hidden, • Previously unknown and • Potentially useful

  6. KDD: Definition • Extract (discover): • interesting and • previously unknown knowledge from very large real world databases.

  7. KDD: Definition • More formally: • Valid, • Novel, Potentially useful or Desired • Ultimately understandable.

  8. KDD- PROCESS Note, order of these steps And what is include in Each step may vary depending On who you talk to. And maybe even this

  9. KDD vs. DATA MINING • Synonyms (?) • KDD • More than just finding pattern • Mining, dredging and fishing

  10. KDD- Related Fields • Data Warehousing • On-Line Analytical Processing (OLAP) • Database Marketing • Exploratory Data Analysis (EDA)

  11. Data Warehousing • A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision making process.

  12. WEB Relational Data Marts Query Reporting Tools Portal Data Warehouse OLAP Tools External Source MDD Data Marts GIS Tools File Server Geo- Reference Data Meta Data OLAP and Data Warehousing User

  13. Visualization Machine Learning Data Mining Statistics Artificial Intelligence Expert Systems Pattern Recognition Data Mining: Related Areas Database Management Systems • Other Areas: • Neural Networks • Evolutionary Methods • Information Retrieval • Etc. The ‘Parent’ of Data Mining Sometimes considered an subarea of AI Sometimes considered an subarea of AI

  14. Database versus Data Mining • Query • DB: Well Defined & SQL • DM: Poorly Defined & Various Languages • Data • DB: Operational (and generally relational) • DM: Not Operational. • Output • DB: Precise, subset of the database. • DM: Varies.

  15. Examples • Database • Find all people with last name Raghavan. • Identify all customers who have bought more than 10,000 dollars • Data Mining • Find those who have poor credit • Find all those who like the same cars • Find all items that are often (frequently) purchased with milk. • Predict the value of the housing market.

  16. Statistics • Simple descriptive models • Traditionally: • A model created from a sample of the data to the entire dataset. • Exploratory Data Analysis: • Data can actually drive the creation of the model • Opposite of traditional statistical view. • Presupposes a distribution

  17. Machine Learning • Machine Learning: area of AI that examines how to write programs that can learn. • Types of models • Classification • Prediction (Regression) • Clustering • Types of Learning: • Supervised • Unsupervised • Traditionally • Small Datasets • ‘Complete’ Data • Changed since about mid-1990’s • Examples on non-complete from earlier • Field isn’t static (ideas flow between)

  18. Data Mining: Research Issues • Ultra large data • Noisy data • Null values • Incomplete data • Note, somewhat related to NULL values • Redundant data • Dynamic aspects of data

  19. Data Mining: Tasks • Association • Classification • Clustering • Estimation • Data Visualization • Deviation Analysis • etc Generally, implication that you are seeking Relationships and/or can manipulate the data interactively.

  20. Data Mining Models and Tasks

  21. Deriving association rules from data: Given a set of items I = {i1,i2, . . . , in} and a set of transactions S = {s1, s2, . . ., sm}, each transaction si S, such that si  I, an association rule is defined as X  Y, where X  I, Y  I and X  Y = , describes the existence of a relationship between the two itemsets X and Y. ASSOCIATION MINING PROBLEM

  22. Measurements Measures to define the strength of the relationship between two itemsets X and Y

  23. The percentage of transactions that contain Y among those transaction containing X. Measure of Confidence

  24. Applications of Associations • I = Products, S = Baskets • I = Cited Articles, S = Technical Articles • I = Incoming Links, S = Web pages • I = Keywords, S = Documents • I = Term papers, S = Sentences

  25. Classification Mining Problem • Pattern Recognition and Machine Learning communities • Generally aimed at models of the data. • Often includes both • Categorization • Prediction (Regression) • Supervised.

  26. Clustering Mining Problem • Assumption: Data, naturally, falls into groups. • Overlapping or Non-Overlapping • What are the groups? • And what data falls within each group. • Unsupervised.

  27. Measures • Error • Categorization • Number Bad Assignments/Total Assignments • Prediction • Mean Squared Error • In truth, a number of measures have been proposed.

  28. Note about ‘Data’ • Various types: • Text • Strings • Numeric • Sound • Image • Relations • Etc.

  29. CONCLUSIONS KDD has interesting problems • It is an inter-disciplinary field • No matter your expertise, you can find an interesting niche • Many high-demand applications • Customer Relationship Management • Suggestive Sales • Amazon • Search • Ebay • Stock Prediction • Well, would be nice.

More Related