1 / 34

CSE 8392 SPRING 1999 DATA MINING: PART I

CSE 8392 SPRING 1999 DATA MINING: PART I. Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: mhd@seas.smu.edu

Download Presentation

CSE 8392 SPRING 1999 DATA MINING: PART I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 8392 SPRING 1999DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: mhd@seas.smu.edu www: http://www.seas.smu.edu/~mhd January 1999

  2. CSE8392 SPRING 1999 OUTLINE • Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken. • I. Introduction and Related Topics • II. Core Topics • III. Advanced Topics • IV. Case Studies • V. Student Presentations • VI. Summary and Future Trends CSE 8392 Spring 1999

  3. INTRODUCTION AND RELATED TOPICS • Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics. • Historical Perspective • Gleaning Knowledge from the Data • User Expectations increase as amount/sophistication of collected data increases. • Reality vs Extracted Data Physical View Database View Reality Data Information Need Query CSE 8392 Spring 1999

  4. Related Topics (to be covered) • Knowledge Discovery • Information Retrieval • Fuzzy Sets • Data Warehousing and OLAP • Dimensional Modeling CSE 8392 Spring 1999

  5. Data Mining Overview • What is Data Mining? • Definition: Fayyad, p. 9 • A.k.a. • Exploratory data analysis • Unsupervised pattern recognition • Data driven discovery • Deductive learning • Data Mining determines patterns in the data • Non-trivial • Valid • Novel • Potentially useful • Interesting • General and simple • Understandable CSE 8392 Spring 1999

  6. DM Techniques (R[1]) • DM involves many different algorithms to accomplish different things. All have the following techniques in common. • Model(Must fit a model to the data.) • Function/Purpose • Representation • Preference Criteria (How to choose one model over another?) • Search Algorithm (How to search the data) • Example (Loan Data, fig 1.1 p6 in Fayyad): • Model: Classification, Linear Function • Preference: What best fits data? (Fig 1.2 or 1.4) • Search Algorithm: Linear search of database CSE 8392 Spring 1999

  7. DM Model Functions (R[1]) • Classification - Map data into predefined groups • Regression - Map data to real valued predicate variable • Clustering - Map data into groups defined by data itself • Summarization - Map subsets of data into simple description • Dependency Modeling - Identify dependencies among data items • Link Analysis - Identify other relationships among data (association rules) • Sequence Analysis - Identify sequential patterns in data CSE 8392 Spring 1999

  8. DM Historical Perspective • Late 70’s: Spreadsheet analysis • 80’s: Transactional databases support data storage and retrieval • Early 90’s: Growing interest in end user support (a.k.a. decision support) • Issue: transactional databases are not designed for decision support • Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis • Late 90’s: Proliferation; new concepts (data marts) • DM Tools: Neovista, Red Brick CSE 8392 Spring 1999

  9. Data Mining Metrics • Berson, Tables 17-1,17-2,17-3, p 347 • Accuracy • Clarity • Dirty Data • Dimensionality • Raw Data (Preprocessing) • RDBMS embedding • Scalability • Speed • Validation CSE 8392 Spring 1999

  10. DM Issues • Overfitting • Outliers • Closed World Assumption • Database schemas and database models • Algorithms for data mining • Interpretation and visualization of results • Size of databases • Multimedia data, Spatio-Temporal Data • Changing data • Integration • DM Applications • Basket market analysis Stock analysis and selection • Fraud detection and prevention • Crisis prediction and prevention CSE 8392 Spring 1999

  11. KNOWLEDGE DISCOVERY IN DATABASES (KDD) • “Overall process of discovering useful knowledge from data.” (p28 in R[1]) • Defn: R[1] p 30 • Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad) • Data Mining is one step in KDD process • KDD objective not usually clear or exact. May require time with customer understanding needs. • Data usually has problems - needs cleaning • Incorrect/missing data • Extract from multiple sources and compare • Delete anomalous data and sources • Different data types/metrics CSE 8392 Spring 1999

  12. FUZZY SETS and LOGIC • Set membership described by a real valued (0,1) membership function • Ex: Set of all tall people • Set membership function: f(x)=x is tall iff height(x)>6 ft. • Note that this is a simple classification problem. Just as the Loan example, the results are not exact. • Basis of many classification and clustering approaches • In a conventional DB how do you retrieve all tall people? • Three valued logic: True, False, Maybe • Multi-valued logic: More than 2 values CSE 8392 Spring 1999

  13. Fuzzy Logic • Reasoning with uncertainty • Extends multivalued logic; allows user to communicate using imprecise concepts, i.e. • “good” and “bad” • “close to” and “far away” • Avoids brittleness of rule based reasoning by introducing probability of set membership • Allows for smoother transition between classification sets in the domain • Example • Berson figure 16.2, page 325 CSE 8392 Spring 1999

  14. INFORMATION RETRIEVAL • Store and retrieve documents based on fuzzy queries • Predecessor of web based access • Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps. • Overview • Conventional IR Systems • Query Structures(Keywords) • Matching(Multivalued logic) • Measures • Text Analysis Techniques • IR Related Topics CSE 8392 Spring 1999

  15. Conventional IR Systems • Library card catalogs • Documents (Library Science) • Formatted • Unformatted (Text) • Mixed • Document Surrogates • Identifiers • Titles, names, and dates • Abstracts, extracts, reviews • Summaries of Numerical Data • Image Descriptions CSE 8392 Spring 1999

  16. IR Queries • Query Structures • Matching Criteria • Boolean Queries • Vector • Fuzzy • Natural Language • Logical combination of keywords • Weight associated with keywords • Similarity measures CSE 8392 Spring 1999

  17. Similarity Measures • Document Vector: • Different Measures: • Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204. • Similarity uses: • Document-Document • Query-Query • Document-Query CSE 8392 Spring 1999

  18. IR Document/Query Matching • Matching Process • Relevance and Similarity Measures • Boolean based matching • Logical match • Vector based matching • Threshold match • Probabilistic Match n documents relevant • P(relevant) = N total documents • Fuzzy Matching • Proximity Matching • Weighting • Relative Importance of Items CSE 8392 Spring 1999

  19. IR Matching • Scaling • Impact of Sample Size • Clustering • Centroids • Measures • Precision • Recall CSE 8392 Spring 1999

  20. IR Indexing • Text Analysis • Indexing is the assignment of keywords or terms that represent document content • Originally a library science problem that has grown with the advent of web based searches • Indexing types • Automated vs. manual • Controlled vs. uncontrolled • Single term vs. terms in context • Deep vs. shallow CSE 8392 Spring 1999

  21. IR Indexing • General Steps • 1. Assignment of terms or concepts capable of representing content • 2. Assignment to each term a weight or value • Indexing • Vector based • Start with excerpts, remove high frequency words • Stop list • Thesaurus • Compute discrimination values of terms CSE 8392 Spring 1999

  22. IR Retrieval • Retrieval or Classification • Vector based • Same starting point as with indexing • Compute weighting factors • Assign to each document a weighted term vector • Similarity Measures • Measure similarity between document/query • Results normalized to range between 0 - 1 CSE 8392 Spring 1999

  23. IR Retrieval • Inverse Document Frequency • Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents. • Also used for similarity measurement • Inverted Indexing of Document • Concept Hierarchy • DAG of concepts • Follow nodes from general to more specific • Tag articles with low level concepts so that each may be distinguished from ancestors CSE 8392 Spring 1999

  24. IR Related Topics • Information Retrieval Related Topics • Text Analysis • Fuzzy Sets • Extending Databases • Hypertext • Digital Libraries • Data Mining • Web based browsers CSE 8392 Spring 1999

  25. DATA WAREHOUSING AND OLAP • Preparations for Mining: Data Warehousing • Extracting the data (from RDBMS) • Storing the data • Data warehouse or data mart • Cleansing the data • Mining the data • Often with multidimensional queries • Definition • Blend of technologies • Integration • Enables Strategic Use of Data • Architecture • Figure 6.1, page 116 CSE 8392 Spring 1999

  26. DW Migration • Migration from Relational Database to Data Warehouse • Differences (Relational vs. Data Warehouse) • Procedure for Migration • Extraction • Cleanup • Transformation • Migration • Issues • Multiple sources • Database Heterogeneity • Data Heterogeneity CSE 8392 Spring 1999

  27. DW Design • Data Warehouse Design Considerations - Nine Step Method: • Subject Matter • Fact Table contents • Dimensioning • Fact Selection • Precalculations • Rounding out dimension table • Duration selection • What about change? • Query priorities • Technical Considerations • Hardware • Communications Infrastructure • Data Structures CSE 8392 Spring 1999

  28. More on DW • Benefits • Development of strategic information and resources • Hypothesis testing • Knowledge discovery • Data Marts • Definition: a mini data warehouse for data mining • Directed at a partition of data • Dedicated user group • May be physically separate • Drivers • Urgent user requirements • Small budget • Absence of sponsor • Decentralization • Smaller project size CSE 8392 Spring 1999

  29. DIMENSIONAL MODELING • Dimensional Modeling • Describes relationships in the data that will be mined • Relatively new concept, still developing • A technique for visualizing data models • Schema (Star and Snowflake) • Facts - A collection of related data items, consisting of measures and context data • Dimensions - A collection of members or units of the same type of view. Axis for modeling. Sets the context for the facts. • Measures - Numeric attribute of fact (What is stored about sales data) • Focus - Tends to be on numeric data • MD Analysis vs. DM - Figure 4, R[3] CSE 8392 Spring 1999

  30. Data Cube • Way to visualize facts and dimensions • Hypercube (more than 3 dimensions) • May be nested • Figure 13.1, p249, Berson • Figure 15,R[3] CSE 8392 Spring 1999

  31. Star Schema Time Dimension Customer Dimension Sales Facts Part No. Dimension Salesperson Dims Product Dimension • Contains large fact table and a surrounding set of dimension tables • A.k.a. constellation or multistar model • Figure 9.1, p171,Berson • Following from Figure 18, R[3] CSE 8392 Spring 1999

  32. Snowflake Schema Week Dimension Month Dimension Time Dimension Customer Dimension Sales Facts Part No. Dimension Salesperson Dimension Product Dimension Location Dimension Manager Dimension • Sometimes dimensions have hierarchies among themselves • N:1 relationships among members of a dimension may be subdivided • Decomposition yields a snowflake like schema CSE 8392 Spring 1999

  33. OLAP (On Line Analytic Processing) • Multidimensional database • Allows user to analyze data using elaborate, multidimensional, complex views • MOLAP - Multidimensional OLAP. Supported by specialized DBMS/software systems. (Data structures, temporal) • May not be general enough for other uses • Access limited and optimized for OLAP processing • Fig 13.3 p 253, Berson • ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL). • Layer on top of DBMS. Middleware. • May have poor performance for OLAP applications • Fig 13.4 p 254, Berson CSE 8392 Spring 1999

  34. OLAP Operations • Move view of facts down/up dimensions • Drill Down • Roll Up • Figure 3, R[3] • Figure 16,R[3] • Look at data by partitioning the cube • Slice - Look at subcube to get more specific data • Dice - Rotate cube to look at another dimension • Figure 17,R[3] CSE 8392 Spring 1999

More Related