1 / 52

A Brief Overview of Data Mining

A Brief Overview of Data Mining. - IR Group Meeting 04/11/2006 Qiaozhu Mei. Outline. Introduction Functionalities Hot topics Research Groups Useful Resources. Part 1: Introduction. Introduction What is data mining? General Process Related Fields Different Views Functionalities

tillie
Download Presentation

A Brief Overview of Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Overview of Data Mining - IR Group Meeting 04/11/2006 Qiaozhu Mei

  2. Outline • Introduction • Functionalities • Hot topics • Research Groups • Useful Resources

  3. Part 1: Introduction • Introduction • What is data mining? • General Process • Related Fields • Different Views • Functionalities • Hot topics • Research Groups • Useful Resources

  4. What is Data Mining? • (From Prof. Jiawei Han’s Slides): Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • (From Prof. Sunita Sarawagi’s slides): Process of semi-automatically analyzing large databases to find patterns that are • valid: hold on new data with some certainty • novel: non-obvious to the system • useful: should be possible to act on the item • understandable: humans should be able to interpret the pattern • (From Prof. Vipin Kumar’ Slides): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

  5. What is Data Mining? (cont.) • Under these definitions: • What is not Data Mining? • Look up phone number in phone directory • Query a Web search engine for information about “Amazon” • What is Data Mining? • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) • Group together similar documents returned by search engine according to their context - Tan, Steinbach, Kumar, Introduction to Data Mining

  6. General Process of KDD Knowledge • Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases - Han & Kamber, Data Mining: Concepts and Techniques

  7. Database Technology Statistics Statistics/AI Machine Learning/ Pattern Recognition Data Mining Machine Learning Data Mining Visualization Database systems Algorithm Other Disciplines Related Fields • Confluence of Multiple Disciplines • Han & Kamber, Data Mining: • Concepts and Techniques • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • But different… - Tan, Steinbach, Kumar, Introduction to Data Mining

  8. Differences to Related Fields • Traditional Techniques may be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data • Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on • scalability of number of features and instances • stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. • automation for handling large, heterogeneous data • From Prof. Vipin Kumar’s slides • From Prof. Sunita Sarawagi’s slides

  9. Different Views of Data Mining • Categorize a data mining task from different views • By general functionality and operations: • Descriptive data mining • Find human-interpretable patterns that describe the data. • Clustering / similarity matching • Association rules and variants • Deviation detection • Predictive data mining • Use some variables to predict unknown or future values of other variables. • Regression • Classification • Collaborative Filtering

  10. Different Views of Data Mining (II) • By data to be mined • Relational, data warehouse, transactional, stream, object-oriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW • By knowledge to be discovered • Characterization, discrimination, frequent patterns, association, classification, clustering, trend/deviation, outlier analysis, etc • By techniques utilized • Database-oriented, data warehouse (OLAP), combinational algorithms, machine learning, statistics, visualization, etc. • By application adapted • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. - Han & Kamber, Data Mining: Concepts and Techniques

  11. Part 2: Functionalities • Introduction • Functionalities • Data Warehousing and OLAP • Frequent patterns, association, correlation and causality • Classification and prediction • Clustering • Outlier analysis, Trend and evolution analysis • Hot topics • Research Groups • Useful Resources

  12. all country product date product,date product,country date, country product, date, country Data Warehousing and OLAP • Data Warehousing: • “A data warehouse is asubject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision-making process.”—W. H. Inmon • OLAP: on-line analytical processing • Major task of data warehouse system • Data analysis and decision making • Drill-down, roll-up, exception/discovery driven • Methodology • Data Cubing • Iceberg cube • Multi-way, BUC, Star, MM, shell, close-cube, etc. - Han & Kamber, Data Mining: Concepts and Techniques

  13. Frequent Patterns and Associations • Frequent pattern: a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set • Comparing to n-grams, phrases, etc. • Motivation: Finding inherent regularities in data • Applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis • Association rule mining: • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. • Frequent pattern  association rules  correlations

  14. Mining Frequent Patterns • Types of data: • Itemsets, sequences, graphs. • Scalable mining methods: Three major approaches • Apriori (Agrawal & Srikant@VLDB’94) • FPgrowth (Han, Pei & Yin @SIGMOD’00) • Prefixspan, clospan, gSpan, closegraph, etc. • Vertical data format approach (Charm, Zaki & Hsiao @SDM’02) • Apriori: • Candidate pattern generation and pruning • Breadth-first search over pattern space • FPgrowth: • Pattern growth through FP-tree, no candidate generation • Depth-first search, doing pruning smartly

  15. Classification and Prediction • Supervised Learning, already discussed in Machine Learning. • Classification: classifies data (constructs a model) based on the training set and the values (categorical class labels) in a classifying attribute and uses it in classifying new data • Prediction: models continuous-valued functions, i.e., predicts unknown or missing values • Algorithms: • Decision Tree based: C4.5, ID3, Rainforest, etc. • Bayesian Method: Naïve Bayesian, Bayesian network, a lot of others covered in Machine Learning.. • Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM, etc. • Rule-based, Associative, k-NN, etc. • Prediction: Regression, • Bagging, Boosting, Model Selection, Cross-Validation

  16. Clustering • Unsupervised Learning, as discussed in Machine Learning • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that • Data points in one cluster are more similar to one another. • Data points in separate clusters are less similar to one another. • Similarities/distances: many! • Algorithms: • Partition based: K-means, K-Medoids, CLARA, etc • Hierarchical: Bottom-up (single/complete/average link), top-down, Birch • Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc. • Model-based: EM, COBWEB, SOM, etc. • High-Dimensional, Constraint based

  17. Outlier, Trend and Evolution • outliers: The set of objects that are considerably dissimilar from the remainder of the data • Statistical: hypothesis testing, bug mining • Density based • Clustering based, etc • Deviation/Anomaly Detection • Fraud Detection • Trend and Evolution: • Usually coupled with outlier analysis • Basic functionalities in temporal data mining • Trend, cycle, seasonal, irregular patterns

  18. Part 3: Hot Topics • Introduction • Functionalities • Hot topics • Mining data stream, Mining time series, Spatiotemporal data mining, mining Social Networks, Sequential data mining, Graph Mining, Biology data mining, Privacy Preserving Data Mining • Text and Web mining • Research Groups • Useful Resources

  19. Mining Data Streams • Data: Data streams—continuous, ordered, changing fast, huge amount • Characteristics and Challenges : • Huge volumes • Fast changing, requires fast and real-time response • Random access is expensive — need single scan algorithms • Difficult to keep the universe — need approximations • Basic problems: • Multi-dimensional on-line analysis of streams • Mining outliers and unusual patterns in stream data • Clustering data streams • Classification of stream data

  20. Mining Data Streams (II) • Methods: • Basic: Sliding windows, Tilted time frames • Counting (FP mining, etc): • Random sampling • Approximated counting • OLAP: • Keep Critical layers in stream cube computation • Partial materialization • outlier: exception-based exploration • Clustering: • Offline microclustering and online macroclustering • Text Related Applications: • Web logs and Web page click streams

  21. Mining Time series • Data: Time-series database • Consists of sequences of values or events changing with time • Data is recorded at regular intervals • Characteristics and Challenges : • Characteristic time-series components: Trend, cycle, seasonal, irregular patterns • Basic Problems: • Trends discovery, Similarity Search, outlier detection, prediction and clustering

  22. Mining Time series (II) • Methods: • Statistical modeling (Regression, Spline, Mixture Model, etc) • Data transformation (DFT, DWT) • Sliding windows, Atomic matching, window stitching, Subsequence Ordering • Clustering • Text Related Applications: • Transliteration mining, Temporal text mining, word bursting, etc. • Han & Kamber, Data Mining: • Concepts and Techniques

  23. Spatiotemporal data mining • Data: object data sets, spatial/spatiotemporal databases and data warehouses • Characteristics and Challenges: • Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage • handling objects in space that have identity and well-defined extents, locations, and relationships. • Require the merge of a set of geographic areas by spatial operations • Basic Problems: • Querying objects; distribution/cluster/correlation/evolution/trend analysis

  24. Spatiotemporal data mining (II) • Methods • GIS (Geographic Information System): Analysis and visualization of geographic data • Search, Location analysis, Terrain analysis, Distribution, Spatial analysis/statistics, Measurement • Indexing Spatial data (R-tree, etc. ) • Modeling single objects with points, lines and regions • Modeling spatially related collection of objects: plane partitions and networks. • Spatiotemporal patterns, correlations, trend analysis, clustering… • Text Related Applications: • Spatiotemporal text mining; community evolution in weblogs; • Information diffusing; web evolution

  25. Special topics in Frequent Pattern Mining • Association rule mining and frequent itemset mining are pretty old topics • However, some special topics of frequent pattern mining are still hot • Sequential pattern mining • Graph mining • Pattern post-processing

  26. Sequential pattern mining • Data: sequential data base • Basic problems: • Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences • Sequence Similarity Search, Sequence Alignment • Methods: • Apriori: GSP • FP-Growth: PrefixSpan, Clospan • BLAST, Hidden Markov models, CRF, etc. • Text Related Applications: • Most text patterns are sequential patterns • Phrase extraction, entity/relation extraction, opinion mining, etc • Biology sequence modeling • Han & Kamber, Data Mining: • Concepts and Techniques

  27. Graph Mining • Data: graph databases (like social network, but multiple graphs, more general), examples include • Chemical component, protein structure, program flow, XML/Web, • Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc. • Characteristics and Challenges: • Theoretically, most are of high complexity, but practically, the graphs are solvable. • Too many substructures to index • … • Basic problems • Frequent subgraph mining • Close subgraph mining • Graph indexing by substructures • Similarity search • Han & Kamber, Data Mining: Concepts and Techniques

  28. Graph Mining (II) • Methods: • Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan) • gSpan: pattern growth, depth first search, active elimination of duplicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension. • CloseGraph: mining close subgraph patterns • gIndex: identify frequent structures, prune redundancy to maintain discriminative structures, create index on such structures. • Similarity search: indexing; feature based similarities; estimate feature missing • Text Related Applications: • Multi-resolution topic map, entity-relation network, pathway extraction, etc.

  29. Graph Mining (III): Graph Indexing & Querying • More on Graph Indexing and Similarity Search • Comparing to Text Retrieval:

  30. Graph Mining (IV): Graph Indexing & Querying • What if we want to index on phrases instead of words? • Need to extract phrases first • N-grams/sequential patterns, have to remove redundancy • E.g. “natural language processing” v.s. “language processing” • Substructures are like phrases… • Can IR help? • Representation and Similarity measures? (Vector Space Models, Probabilistic models…) • How to weight features? (TF-IDF, …) • Generative models? • Query expansion? Feedback?

  31. Pattern Post-processing • Data: frequent patterns extracted by mining algorithms • Challenge: • Mining algorithms output explosively large number of patterns • How to interpret the frequent patterns extracted • Basic Problems: • Pattern summarization • Mining compressed patterns • Top-K patterns • Pattern annotation • User-oriented ranking • Methods: • Modeling Pattern profiles, coverage and contexts • Using Clustering to summarize and compress patterns • Bridging IR/NLP and frequent pattern mining: profile, context, ranking, feedback, filtering, summarization, MMR, etc.

  32. Mining Social Networks • Data: Graphs/networks with nodes and links • Example: communication networks, webpages, citations, biological pathways, etc. • Characteristics and Challenges: • Connected Components: few • Network diameter: small • Clustering: high degree • Degree distribution: heavy-tailed • Modeling Logical/statistical dependencies • Basic Problems: • Model the generation of graphs/networks • Link based object ranking, classification, Identification, Clustering, entity resolution • Link Prediction, querying, community discovery H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)

  33. Mining Social Networks (II) • Methods: • Graph Generation Models: trying to derive generative models which explains the characteristics and evolutions of social networks/graphs. • Vertex Ranking: PageRank, HITS, etc. • Community Detection: Hierarchical Clustering, Spectral clustering, Stochastic modeling, etc. • Link based classification: semi-supervised learning, propagation • Entity resolution: duplicate prediction, collective resolution, probabilistic models • Link Prediction: binary classification problem, local conditional probabilistic models • Substructure mining: graph pattern mining, indexing

  34. Mining Social Networks (III) • Generative Models of social network/graph generation and evolution • Random graphs (Erdös-Rényi models) • Fix vertices, generate each edge independently with probability p • N(N-1)/2 trials of a biased coin flip, p ~ 1/N • Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-1)/2 • Parameter: p • Graph process model: • starting with no edges, just keep adding one edge at a time • always choose next edge randomly from among all missing edges

  35. Mining Social Networks (IV) • α-model (Watts-Strogatz models, Small-world) • For vertices u, v, define m(u,v) to be the number of common neighbors (so far) • Define the propensity R(u,v) of u to connect to v • if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect) • if m(u,v) = 0, R(u,v) = p (no mutual friends  no bias to connect) • else, R(u,v) = p + (m(u,v)/k)a (1-p)  biased to connect • Generate network incrementally, with R(u,v) as the edge probability; • α  ∞, is similar to Erdos-Renyi models • Need to tune parameter α, p, k

  36. Mining Social Networks (V) • Scale free models: not fix N (# of vertices) • Start with (say) two vertices connected by an edge • let Z = Σ d(j) where d(j) = degree of vertex j so far • add new vertex i with k edges back to {1, …, i-1}: i is connected back to j with probability d(j)/Z • Richer get richer… • Evaluation of generative models • Can they explain all the characteristics of social networks? • Parameter tuning? • Other models for Social network analysis • Copying model: leads to communities • Forest Fire Model • Electricity network (not generative model, but interesting)

  37. Mining Social Networks (VI) • Text Related Applications: quite a lot! • Ranking webpages • Multi-resolution Concept/Topic Map • Citation Impact of scientific literature • Entity-relation extraction • Bioinformatics: Pathway extraction • Reference Reconciliation • Web structure evolution • Community discovery in Weblogs..

  38. Text and Web mining • Data: text, unstructured/semi-structured; webpages with linkages, user logs; • E.g. webpage, news, email, weblogs, scientific literature, citations, customer reviews, forums, search logs, chatting logs, legal documents, etc. • Challenges: • Modeling unstructured/semi-structured data • Coupling with Natural Language Processing • Handling high dimensionality • Handling data sparseness and ambiguity • The Web is too complicated!

  39. Text and Web mining (II) • Selected Problems: • Text categorization/clustering (Already covered in NLP and ML) • Word sense disambiguation (Covered in NLP) • Information Extraction (Covered in NLP) • Dimension Reduction (Overlapping with ML and IR) • Collaborative Filtering, User-interest modeling • Topic Detection and Tracking • Comparative Text Mining, Theme based text mining • Transliteration mining • Email clustering / spam detection • Opinion mining (Overlapping with NLP) • Social Networks Related (Already covered) • Temporal Text Mining • Vision based page segmentation / Block based search

  40. Text and Web mining (III) • Methods: Confluence of Multiple Disciplines • Database: data integration, schema matching, XML • Data mining: sequential pattern mining, association rule mining, … • IR: Search, language models, feedback, … • Machine Learning: SVD, Supervised/unsupervised learning, semi-supervised learning, Topic-models, … • NLP: POS tagging, parsing, context modeling, sentiment extraction, entity extraction, … • Statistical Learning: Bayesian methods, word bursting, time-series analysis, hypothesis testing, other statistical models, …

  41. Text and Web mining (IV) • Resolution: • Word level: Word sense disambiguation, word bursting, transliteration mining • Entity level: information extraction, entity-relation network • Pattern level: opinion mining, relation extraction • Document level: document classification/clustering • Theme level: PLSI, LDA, comparative text mining, temporal text mining/spatiotemporal text mining • Topic level: topic detection and tracking, email threading • Web level: social network, weblog mining, block based search • Selected topics will be discussed in next meeting..

  42. Part 4: Research Groups • Introduction • Functionalities • Hot topics • Research Groups • Stanford, CMU, UIUC, Wisc, Helsinki, UMN • IBM, Microsoft, MSRA, Yahoo! • Others • Useful Resources

  43. Research Groups • Rakesh Agrawal • One of the Leaders in Data Mining • Frequent patterns, Privacy Preserved Data Mining • Stanford: Jerome H. Friedman • http://www-stat.stanford.edu/~jhf/ • Strong Statistical flavor, machine learning, boosting • CMU: Christos Faloutsos • http://www.cs.cmu.edu/~christos/ • Graph mining, Social Networks, Stream data mining, Image/Multimedia mining, time-series mining • UIUC: Jiawei Han • http://www-sal.cs.uiuc.edu/~hanj/ • Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream data mining, Classification, Clustering, …

  44. Research Groups (II) • University of Helsinki: Heikki Mannila • http://www.cs.helsinki.fi/research/fdk/ • http://www.cs.helsinki.fi/u/mannila/ • Frequent itemset mining, computational biology • Wisconsin: Raghu Ramakrishnan • http://www.cs.wisc.edu/dmi/ • http://www.cs.wisc.edu/~raghu/ • Data warehousing, cubing, classification/clustering, • Minnesota: Vipin Kumar • http://www-users.cs.umn.edu/~kumar/ • Spatiotemporal data mining • IBM T.J Watson: Philip S. Yu • http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html • http://www.research.ibm.com/people/p/psyu/index.html • Frequent pattern mining, graph mining, data streams

  45. Research Groups (III) • Microsoft Research Redmond: Surajit Chaudhuri • http://research.microsoft.com/dmx/ • Data base related, Data cleaning, etc. • Microsoft Research Redmond: Eric Brill • http://research.microsoft.com/tmsn/ • http://research.microsoft.com/~brill/ • Text Mining, Search and Navigation Research, NLP • Microsoft Research Asia: • http://research.microsoft.com/wsm/ • Web search, web/text mining • Yahoo! Research: Prabhakar Raghavan • http://research.yahoo.com/researcher.shtml • http://theory.stanford.edu/~pragh/ • Web/Text Mining, Social Networks

  46. Research Groups (IV) • IBM Webfountain • http://www.almaden.ibm.com/webfountain/ • UIC: Bing Liu • http://www.cs.uic.edu/~liub/ • Association rule mining, web/text mining • UNC: Wei Wang • http://www.cs.unc.edu/~weiwang/ • Biology data mining, frequent pattern mining • Simon Fraser: Jian Pei • http://www.cs.sfu.ca/~jpei/ • Sequential pattern mining, OLAP • National University of Singapore: Anthony K.H. Tung • http://www.comp.nus.edu.sg/~atung/ • Spatial data mining, Biology data mining • …

  47. Part 5: Useful Resources • Introduction • Functionalities • Hot topics • Research Groups • Useful Resources • Text Books • Toolkits • Conferences • Others

  48. Text Books • S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 • U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 • J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 • D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 • T. M. Mitchell, Machine Learning, McGraw Hill, 1997 • G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 • P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 • S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 • I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 - From Prof. Jiawei Han’s slides

  49. Toolkits • Weka: Data mining software in Java • http://www.cs.waikato.ac.nz/%7Eml/weka/ • IlliniMine (Illinois Data Mining System) • http://illimine.cs.uiuc.edu/ • Data Cubing • Frequent Pattern Mining • Sequential pattern mining • Graph pattern Mining • Classification • Collected by Vipin Kumar: • http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm

  50. Conferences • Other related conferences • ACM SIGMOD • VLDB • (IEEE) ICDE • WWW, SIGIR • ICML, CVPR, NIPS • Journals • Data Mining and Knowledge Discovery (DAMI or DMKD) • IEEE Trans. On Knowledge and Data Eng. (TKDE) • KDD Explorations • ACM Trans. on KDD • KDD Conferences • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) • SIAM Data Mining Conf. (SDM) • (IEEE) Int. Conf. on Data Mining (ICDM) • Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) - From Prof. Jiawei Han’s slides

More Related