1 / 17

1. Data Mining (or KDD)

1. Data Mining (or KDD). Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad). Let us find something interesting!. Why Mine Data? Scientific Viewpoint.

akira
Download Presentation

1. Data Mining (or KDD)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1. Data Mining (or KDD) Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Let us find something interesting!

  2. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • GIS • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation

  3. 2.1 Supervised Clustering Ch. Eick Attribute2 Attribute2 Attribute2 class 1 class 2 unclassified object class 1 class 2 A unclassified object E I J G F B C K L D Attribute1 H Attribute1 Attribute1 a. Unsupervised Clustering b. Semi-supervised Clustering c. Supervised Clustering Applications of Supervised Clustering Include: • Learning Subclasses • for Region Discovery in Spatial Datasets • Distance Function Learning • Data Set Compression (reduce size of dataset by using cluster representatives) • Adaptive Supervised Clustering

  4. Example: Finding Subclasses Ch. Eick Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Ford SUV Attribute2 GMC SUV

  5. SC Algorithms Investigated • Representative-based Clustering Algorithms • Supervised Partitioning Around Medoids (SPAM). • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). • Supervised Clustering using Evolutionary Computing (SCEC) • Agglomerative Hierarchical Supervised Clustering (AHSC) • Grid-Based Supervised Clustering (GRIDSC) • Naïve approach • Hierarchical Grid-based Clustering relying on data cubes • Grid-based Clustering relying on density estimation techniques

  6. 2.2 Spatial Data Mining (SPDM) • SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets. • Spatial patterns • Spatial outlier, discontinuities • bad traffic sensors on highways • Location prediction models • model to identify habitat of endangered species • Spatial clusters • crime hot-spots , poverty clusters • Co-location patterns • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc. Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.

  7. Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets Ch. Eick

  8. 2.3 Distance Function Learning Example: How to Find Similar Patients? Task: Construct a distance function that measures patient similarity Motivation: Finding a “good” distance function is important for: • Case based reasoning • Clustering • Instance-based classification (e.g. nearest neighbor classifiers) Our Approach: Learn distance functions based on training examples and user feedback

  9. Motivating Example: How To Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) • Attribute Domains • ssn: 9 digits • weight between 30 and 650; mweight=158 sweight=24.20 • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2 • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor • eye-color: {brown, blue, green, grey } • age: between 3 and 100; mage=45 sage=13.2 Task: Define Patient Similarity

  10. Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X Distance Function Q Cluster “Bad” distance function Q1 “Good” distance function Q2 q(X) Clustering Evaluation o o o x x o x o o o x o o o Goodness of the Distance Function Q o o x x x x x x

  11. Distance Function Learning Framework Distance Function Evaluation Weight-Updating Scheme / Search Strategy Current Research [CHEN05] K-Means [ERBV04] Inside/Outside Weight Updating Supervised Clustering Work By Karypis Randomized Hill Climbing NN-Classifier Adaptive Clustering Other Research … [BECV05] …

  12. Ch. Eick Clustering Supervised Clustering Algorithm Summary Inputs Changes Adaptation System Evaluation System Feedback Domain Expert Past Experience Quality Fitness Functions (Predefined) q(X), … 2.4 Adaptive Data Mining

  13. 2.5 Signatures of Data Sets Input: a set of classified examples Output: Signatures in the dataset that characterize • how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset • how many regions dominated by a single class exist in the data set • which regions dominated by one class are bordering regions dominated by another class? • where are the regions, identified in step 2 and 3, located • what are the density attactors (maxima of the density function) of the classes in the data set Why are we creating those signatures? • As a preprocessing step to develop smarter classifiers • To understand why a particular data mining techniques works well / do not work well for a particular dataset  meta learning Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…

  14. Example: Signatures of Data Sets Attribute2 Attribute2 Attribute2 class 1 class 2 unclassified object class 1 class 2 A unclassified object E I J G F B C K L D Attribute1 H Attribute1 Attribute1 a. Unsupervised Clustering b. Semi-supervised Clustering c. Supervised Clustering

  15. Applications of Creating Signatures: • Class Decomposition (see also [VAE03]) Attribute 1 Attribute 1 Attribute 2 Attribute 2 Attribute 1 Attribute 2

  16. 2.6 Research Christoph F. Eick 2005-2007 Clustering for Classification Creating Signatures For Datasets Editing / Data Set Compression Supervised Clustering Distance Function Learning Spatial Data Mining Adaptive Clustering Mining Data Streams Online Data Mining Mining Sensor Data Measures of Interestingness Evolutionary Computing Mining Semi-Structured Data Web Annotation File Prediction

  17. 3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine. Topics investigated: • Meta Learning • Classification and Learning from Examples • Clustering • Distance Function Learning • Using Reinforcement Learning for Data Mining • Spatial Data Mining • Knowledge Discovery

More Related