1 / 36

D ata Mining: Introd uction

D ata Mining: Introd uction. CENG 514. What is Data Mining?. Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names

wiedeman
Download Presentation

D ata Mining: Introd uction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining:Introduction CENG 514

  2. What is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

  3. What is Data Mining? Definition by Gartner Group • “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”

  4. What is Not Data Mining? • (Deductive) query processing • Expert systems or small ML/statistical programs

  5. Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/grocery stores • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

  6. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation

  7. From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts

  8. Data Mining in Relation with Other Technologies Artificial Intelligence Machine Learning Database Management Statistics Visualization Algorithms Data Mining

  9. Data Mining:History of the Field • Knowledge Discovery in Databases workshops started ‘89 • Now a conference under the auspices of ACM SIGKDD • IEEE conference series started 2001

  10. Applications • Market Analysis, Customer Relationships Management (CRM) • Churn Analysis • Risk Analysis and Management • Fraud Detection, Counter Terrorism • Network Intrusion Detection • Web Site Restructring • Recommendation • Scientific Applications

  11. Applications: Corporate Analysis & Risk Management • Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning • summarize and compare the resources and spending • Competition • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market

  12. Applications: Fraud Detection & Mining Unusual Patterns • Approaches: Clustering & model construction for frauds, outlier analysis • Applications: Health care, retail, credit card service, telecomm. • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance • Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests • Telecommunications: phone-call fraud • Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Anti-terrorism

  13. Example: Use in retailing • Goal: Improved business efficiency • Improve marketing (advertise to the most likely buyers) • Inventory reduction (stock only needed quantities) • Information source: Historical business data • Example: Supermarket sales records • Size ranges from 50k records (research studies) to terabytes (years of data from chains) • Data is already being warehoused • Sample question – what products are generally purchased together? • The answers are in the data, if only we could see them

  14. Example: Churn Analysis • Business Problem: Prevent loss of customers, avoid adding churn-prone customers • Solution: Use neural nets, time series analysis to identify typical patterns of telephone usage of likely-to-defect and likely-to-churn customers • Benefit: Retention of customers, more effective promotions

  15. Example: Clicks to Customers • Business problem: 50% of Dell’s clients order their computer through the web. However, the retention rate is 0.5%, i.e. of visitors of Dell’s web page become customers. • Solution Approach: Through the sequence of their clicks, cluster customers and design website, interventions to maximize the number of customers who eventually buy. • Benefit: Increase revenues

  16. What Can Data Mining Do? • Cluster • Classify • Categorical, Regression • Summarize • Summary statistics, Summary rules • Link Analysis / Model Dependencies • Association rules • Sequence analysis • Time-series analysis, Sequential associations • Detect Deviations

  17. Find groups of similar data items Statistical techniques require some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions “Group people with similar travel profiles” George, Patricia Jeff, Evelyn, Chris Rob Clustering

  18. Find ways to separate data items into pre-defined groups Requires “training data”: Data items where group is known “Route documents to most likely interested parties” English or non-english? Domestic or Foreign? Classification

  19. Identify dependencies in the data: X makes Y likely Indicate significance of each dependency “Find groups of items commonly purchased together” People who purchase fish are extraordinarily likely to purchase wine People who purchase Turkey are extraordinarily likely to purchase cranberries Association Rules

  20. Find event sequences that are unusually likely “Find common sequences of warnings/faults within 10 minute periods” Warn 2 on Switch C preceded by Fault 21 on Switch B Fault 17 on any switch preceded by Warn 2 on any switch Sequential Associations

  21. Given database of user preferences, predict preference of new user Example: Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies Predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer) Recommendation Techniques

  22. Interpretation/ Evaluation Data Mining Preprocessing Patterns Selection Preprocessed Data Data Target Data Knowledge Discovery in Databases: Process Knowledge adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

  23. Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

  24. Key Steps in Data Mining • Learning the application domain • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation • Choosing functions of data mining • summarization, classification, regression, association, clustering • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

  25. Major Issues • Mining methodology • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Data Quality: Handling noise and incomplete data • Parallel, distributed and incremental mining methods • Integration of the discovered knowledge with existing one: knowledge fusion • User interaction • Data mining query languages and ad-hoc mining • Expression and visualization of data mining results • Interactive mining ofknowledge at multiple levels of abstraction • Applications and social impacts • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy

  26. What is Data? Attributes • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Objects

  27. Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters • Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different • ID has no limit but age has a maximum and minimum value

  28. Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data

  29. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

  30. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

  31. Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

  32. Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

  33. Graph Data • Examples: Generic graph and HTML Links

  34. Ordered Data • Sequences of transactions Items/Events An element of the sequence

  35. Discrete and Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • Examples: zip codes, counts, or the set of words in a collection of documents • Often represented as integer variables. • Note: binary attributes are a special case of discrete attributes • Continuous Attribute • Has real numbers as attribute values • Examples: temperature, height, or weight. • Practically, real values can only be measured and represented using a finite number of digits. • Continuous attributes are typically represented as floating-point variables.

  36. Important Characteristics of Structured Data • Dimensionality • Curse of Dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale

More Related