1 / 50

Data Mining: Current Status and Research Directions

Data Mining: Current Status and Research Directions. Jiawei Han Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han. Outline. Why is data mining hot? Current status: Major technical progress

barnard
Download Presentation

Data Mining: Current Status and Research Directions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Current Status and Research Directions Jiawei Han Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han Data Mining: Status and Directions

  2. Outline • Why is data mining hot? • Current status: Major technical progress • Is data mining flying high, or not? • How to fly data mining high?—Research directions on data mining Data Mining: Status and Directions

  3. Why Is Data Mining Hot? • Data mining (knowledge discovery in databases) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories • Necessity is the mother of invention • Data is everywhere—data mining should be everywhere, too! • Understand and use data—an imminent task! Data Mining: Status and Directions

  4. Data, Data, Everywhere!! • Relational database—A commodity of every enterprise • Huge data warehouses are under construction • POS (Point of Sales): Transactional DBs in terabytes • Object-relational databases, distributed, heterogeneous, and legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW: A huge, hyper-linked, dynamic, global information system Data Mining: Status and Directions

  5. Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining • Databases to be mined • Relational, transactional, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. • Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. • Techniques utilized • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. • Applications adapted • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. Data Mining: Status and Directions

  6. Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning (AI) Visualization Information Science Other Disciplines Data Mining: Status and Directions

  7. Data Mining—One Can Trace Back to Early Civilization • Most scientific discoveries involve “data mining” • Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA • Statistics: A discipline dedicated to data analysis • Then why data mining? What are the differences? • Huge amount of data—in giga to tera bytes • Fast computer—quick response, interactive analysis • Multi-dimensional, powerful, thorough analysis • High-level, “declarative”—user’s ease and control • Automated or semi-automated—mining functions hidden or built-in in many systems Data Mining: Status and Directions

  8. A Brief History of Data Mining Activities • 1989 IJCAI Workshop on Knowledge Discovery in Databases • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) • Journal of Data Mining and Knowledge Discovery (1997) • 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations • More conferences on data mining • PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc. Data Mining: Status and Directions

  9. Research Progress in the Last Decade • Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) • Association, correlation, and causality analysis • Classification: scalability and new approaches • Clustering and outlier analysis • Sequential patterns and time-series analysis • Similarity analysis: curves, trends, images, texts, etc. • Text mining, Web mining and Weblog analysis • Spatial, multimedia, scientific data analysis • Data preprocessing and database compression • Data visualization and visual data mining • Many others, e.g., collaborative filtering Data Mining: Status and Directions

  10. Multi-Dimensional Data Analysis • Data warehousing: integration from heterogeneous or semi-structured databases • Multi-dimensional modeling of data: star & snowflake schemas • Efficient and scalable computation of data cubes or iceberg cubes • OLAP (on-line analytical processing): drilling, dicing, slicing, etc. • Discovery-driven exploration of data cubes • From OLAP to OLAM: A multi-dimensional view for on-line analytical mining Data Mining: Status and Directions

  11. Association and Frequent Pattern Analysis • Efficient mining of frequent patterns and association rules: • Apriori and FP-growth algorithms • Multi-level, multi-dimensional, quantitative association mining • From association to correlation, sequential patterns, partial periodicity, cyclic rules, ratio rules, etc. • Query and constraint-based association analysis Data Mining: Status and Directions

  12. Classification: Scalable Methods and Handling of Complex Types of Data • Classification has been an essential theme in machine learning, and statistics research • Decision trees, Bayesian classification, neural networks, k-nearest neighbors, etc. • Tree-pruning, Boosting, bagging techniques • Efficient and scalable classification methods • Exploration of attribute-class pairs • SLIQ, SPRINT, RainForest, BOAT, etc. • Classification of semi-structured and non-structured data • Classification by clustering association rules (ARCS) • Association-based classification • Web document classification Data Mining: Status and Directions

  13. Clustering and Outlier Analysis • Partitioning methods • k-means, k-medoids, CLARANS • Hierarchical methods: micro-clusters • Birch, Cure, Chameleon • Density-based methods: • DBSCAN and OPTICS, DENCLU • Grid-based methods • STING, CLIQUE, WaveCluster • Outlier analysis: • statistics-based, distance-based, deviation-based • Constraint-based clustering • COD (Clustering with Obstructed Distance) • User-specified constraints Data Mining: Status and Directions

  14. Sequential Patterns and Time-Series Analysis • Trend analysis • Trend movement vs. cyclic variations, seasonal variations and random fluctuations • Similarity search in time-series database • Handling gaps, scaling, etc. • Indexing methods and query languages for time-series • Sequential pattern mining • Various kinds of sequences, various methods • From GSP to PrefixSpan • Periodicity analysis • Full periodicity, partial periodicity, cyclic association rules Data Mining: Status and Directions

  15. Similarity Search: Similar Curves, Trends, Images, and Texts • Various kinds of data, various similarity mining methods • Discovery of similar trends in time-series data • Data transformation & high-dimensional structures • Finding similar images based on color, texture, etc. • Content-based vs. keyword-based retrieval • Color histogram-based signature • Multi-feature composed signature • Finding documents with similar texts • Similar keywords (synonymy & polysemy) • Term frequency matrix • Latent semantic indexing Data Mining: Status and Directions

  16. Spatial, Multimedia, Scientific Data Analysis • Multi-dimensional analysis of spatial, multimedia and scientific data • Geo-spatial data cube and spatial OLAP • The curse of dimensionality problem • Association analysis • A progressive refinement methodology • Micro-clustering can be used for preprocessing in the analysis of complex types of data • Classification • Association-based for handling high-dimensionality and sparse data Data Mining: Status and Directions

  17. Data Mining Industry and Applications • From research prototypes to data mining products, languages, and standards • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc. • A few data mining languages and standards (esp. MS OLEDB for Data Mining). • Application achievements in many domains • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc. Data Mining: Status and Directions

  18. Is Data Mining Flying? Or Not?? • Data mining is flying • R & D have been striding forward greatly • Applications have been broadened substantially • But not as high as some may have hoped. Why not? • Hope to see billions of $’s within years? • A young and coming technology, not a hype! • Not bread-and-butter but value-added service • DBMS, WWW, and other information systems will still be a “data mining” aircraft-carrier • Not on-the-shelf in nature • Need training, understanding, and customizing (re-develop.) • Young technology—need much R&D to fly high • Much research, development, and real problem solving! Data Mining: Status and Directions

  19. How to Fly Data Mining High?—Research Directions • Web mining • Towards integrated data mining environments and tools • “Vertical” (or application-specific) data mining • Invisible data mining • Towards intelligent, efficient, and scalable data mining methods Data Mining: Status and Directions

  20. Web Mining: A Fast Expanding Frontier in Data Mining • Mine what Web search engine finds • Automatic classification of Web documents • Discovery of authoritative Web pages, Web structures and Web communities • Meta-Web Warehousing: Web yellow page service • Web usage mining Data Mining: Status and Directions

  21. Mine What Web Search Engine Finds • Current Web search engines: A convenient source for mining • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. • Data mining will help: • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies • better search primitives: user preferences/hints • linkage analysis: authoritative pages and clusters • Web-based languages: XML + WebSQL + WebML • customization: home page + Weblog + user profiles Data Mining: Status and Directions

  22. Discovery of Authoritative Pages in WWW • Page-rank method ( Brin and Page, 1998): • Rank the "importance" of Web pages, based on a model of a "random browser." • Hub/authority method (Kleinberg, 1998): • Prominent authorities often do not endorse one another directly on the Web. • Hub pages have a large number of links to many relevant authorities. • Thus hubs and authorities exhibit a mutually reinforcing relationship: • Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW. Data Mining: Status and Directions

  23. Automatic Classification of Web Documents • Web document classification: • Good human classification: Yahoo!, CS term hierarchies • These classifications can be used as training sets to build up learning model • Key-word based classification is different from multi-dimensional classification • Association or clustering-based classification is often more effective • Multi-level classification is important Data Mining: Status and Directions

  24. A Multiple Layered Meta-Web Architecture More Generalized Descriptions Layern ... Generalized Descriptions Layer1 Layer0 Data Mining: Status and Directions

  25. Web Yellow Page Service: A Multi-Layer, Meta-Web Approach • XML: facilitates structured and meta-information extraction • Automatic classification of Web documents: • based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance) • Automatic ranking of important Web pages • authoritative site recognition and clustering Web pages • Generalization-based multi-layer meta-Web construction • With the assistance of clustering and classification analysis • Meta-Web can be warehoused and incrementally updated • Querying and mining can be performed on or assisted by meta-Web Data Mining: Status and Directions

  26. Importance of Constructing Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • It benefits even if it is partially constructed • The benefit may justify the cost of tool development, standardization, and partial restructuring Data Mining: Status and Directions

  27. Web Usage (Click-Stream) Mining • Weblog provides rich information about Web dynamics • Multidimensional Weblog analysis: • disclose potential customers, users, markets, etc. • Plan mining (mining general Web accessing regularities): • Web linkage adjustment, performance improvements • Web accessing association/sequential pattern analysis: • Web cashing, prefetching, swapping • Trend analysis: • Dynamics of the Web: what has been changing? • Customized to individual users Data Mining: Status and Directions

  28. Towards Integrated Data Mining Environments and Tools • OLAP Mining: Integration of Data Warehousing and Data Mining • Querying and Mining: An Integrated Information Analysis Environment • Basic Mining Operations and Mining Query Optimization • “Vertical” (or application-specific) data mining • Invisible data mining Data Mining: Status and Directions

  29. OLAP Mining: An Integration of Data Mining and Data Warehousing • Data mining systems, DBMS, Data warehouse systems coupling • No coupling, loose-coupling, semi-tight-coupling, tight-coupling • On-line analytical mining data • integration of mining and OLAP technologies • Interactive mining multi-level knowledge • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Integration of multiple mining functions • Characterized classification, first clustering and then association Data Mining: Status and Directions

  30. An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration Data Mining: Status and Directions

  31. Querying and Mining: An Integrated Information Analysis Environment • Data mining as a component of DBMS, data warehouse, or Web information system • Integrated information processing environment • MS/SQLServer-2000 (Analysis service) • IBM IntelligentMiner on DB2 • SAS EnterpriseMiner: data warehousing + mining • Query-based mining • Querying database/DW/Web knowledge • Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc. Data Mining: Status and Directions

  32. Basic Mining Operations and Mining Query Optimization • Relational databases: There are a set of basic relational operations and a standard query language, SQL • E.g., selection, projection, join, set difference, intersection, Cartesian product, etc. • Are there a set of standard data mining operations, on which optimizations can be done? • Difficulty: different definitions on operations • Importance: optimization can be performed on them systematically, standardization to facilitate information exchange and system interoperability Data Mining: Status and Directions

  33. “Vertical” Data Mining • Generic data mining tools? —Too simple to match domain-specific, sophisticated applications • Expert knowledge and business logic represent many years of work in their own fields! • Data mining + business logic + domain experts • A multi-dimensional view of data miners • Complexity of data: Web, sequence, spatial, multimedia, … • Complexity of domains: DNA, astronomy, market, telecom, … • Domain-specific data mining tools • Provide concrete, killer solution to specific problems • Feedback to build more powerful tools Data Mining: Status and Directions

  34. Invisible Data Mining • Build mining functions into daily information services • Web search engine (link analysis, authoritative pages, user profiles)—adaptive web sites, etc. • Improvement of query processing: history + data • Making service smart and efficient • Benefits from/to data mining research • Data mining research has produced many scalable, efficient, novel mining solutions • Applications feed new challenge problems to research Data Mining: Status and Directions

  35. Towards Intelligent Tools for Data Mining • Integration paves the way to intelligent mining • Smart interface brings intelligence • Easy to use, understand and manipulate • One picture may worth 1,000 words • Visual and audio data mining • Human-Centered Data Mining • Towards self-tuning, self-managing, self-triggering data mining Data Mining: Status and Directions

  36. Integrated Mining: A Booster for Intelligent Mining • Integration paves the way to intelligent mining • Data mining integrates with DBMS, DW, WebDB, etc • Integration inherits the power of up-to-date information technology: querying, MD analysis, similarity search, etc. • Mining can be viewed as querying database knowledge • Integration leads to standard interface/language, function/process standardization, utility, and reachability • Efficiency and scalability bring intelligent mining to reality Data Mining: Status and Directions

  37. One Picture May Worth 1000 Words! • Visual Data Mining • Visualization of data • Visualization of data mining results • Visualization of data mining processes • Interactive data mining: visual classification • One melody may worth 1000 words too! • Audio data mining: turn data into music and melody! • Uses audio signals to indicate the patterns of data or the features of data mining results Data Mining: Status and Directions

  38. Visualization of data mining results in SAS Enterprise Miner:scatter plots Data Mining: Status and Directions

  39. Visualization of association rules in MineSet 3.0 Data Mining: Status and Directions

  40. Visualization of adecision treein MineSet 3.0 Data Mining: Status and Directions

  41. Visualization of Data Mining Processes by Clementine Data Mining: Status and Directions

  42. Interactive Visual Mining by Perception-Based Classification (PBC) Data Mining: Status and Directions

  43. Human-Centered Data Mining • Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting • Data mining should be an interactive process • User directs what to be mined • Users must be provided with a set of primitivesto be used to communicate with the data mining system — using a data mining query language • User should provide constraints on what to be mined • System should use such constraints to guide the mining process (constraint-based mining or mining query optimization) Data Mining: Status and Directions

  44. Constraint-Based Mining • What kinds of constraints can be used in mining? • Knowledge type constraint: classification, association, etc. • Data constraint: SQL-like queries • Find products sold together in Vancouver in Feb.’01. • Dimension/level constraints: • in relevance to region, price, brand, customer category. • Rule constraints: • small sales (price < $10) triggers big sales (sum > $200). • Interestingness constraints: • E.g., strong rules (min_support  3%, min_confidence  60%, min_lift > 3.0). Data Mining: Status and Directions

  45. Rule Constraints: A Classification Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints Data Mining: Status and Directions

  46. Constraint-Based Clustering Analysis • User-specified constraints: no cluster has less than 1000 gold customers • Resource allocation (clustering) with obstacles Data Mining: Status and Directions

  47. Towards Automated Data Mining? • It is not realistic to automatically find all the knowledge in a large database • Thus we promote human-centered, constraint-based mining • However, to achieve genuine intelligent data mining, data mining process should be self-tuning, self-managing, self-triggering • Functions should be developed to achieve such performance Data Mining: Status and Directions

  48. Conclusions • Data mining—A promising research frontier • Data mining research has been striding forward greatly in the last decade • However, data mining, as an industry, has not been flying as high as expected • Much research and application exploration are needed • Web mining • Towards integrated data mining environments and tools • Towards intelligent, efficient, and scalable data mining methods Data Mining: Status and Directions

  49. http://www.cs.sfu.ca/~han http://db.cs.sfu.ca Thank you !!! Data Mining: Status and Directions

  50. References • J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. • J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999. Data Mining: Status and Directions

More Related