1 / 61

Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl). Course Outline. Objective Understand the basics of data mining Gain understanding of the potential for applying it in the bioinformatics domain Hands on experience Schedule Evaluation

jswan
Download Presentation

Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3:Descriptive Data MiningPeter van der Putten(putten_at_liacs.nl)

  2. Course Outline • Objective • Understand the basics of data mining • Gain understanding of the potential for applying it in the bioinformatics domain • Hands on experience • Schedule • Evaluation • Practical assignment (2nd) plus take home exercise • Website • http://www.liacs.nl/~putten/edu/dbdm05/

  3. Agenda Today:Descriptive Data Mining • Before Starting to Mine…. • Descriptive Data Mining • Dimension Reduction & Projection • Clustering • Hierarchical clustering • K-means • Self organizing maps • Association rules • Frequent item sets • Association Rules • APRIORI • Bio-informatics case: FSG for frequent subgraph discovery

  4. Before starting to mine…. • Pima Indians Diabetes Data • X = body mass index • Y = age

  5. Before starting to mine….

  6. Before starting to mine….

  7. Before starting to mine…. • Attribute Selection • This example: InfoGain by Attribute • Keep the most important ones

  8. Before starting to mine…. • Types of Attribute Selection • Uni-variate versus multivariate (sub set selection) • The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model • Filter versus wrapper • Wrapper methods involve the subsequent learner (classifier or other)

  9. Dimension Reduction • Projecting high dimensional data into a lower dimension • Principal Component Analysis • Independent Component Analysis • Fisher Mapping, Sammon’s Mapping etc. • Multi Dimensional Scaling • See Pattern Recognition Course (Duin)

  10. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. weight f.e. age

  11. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible f.e. weight f.e. age

  12. Clustering Techniques • Hierarchical algorithms • Agglomerative • Divisive • Partition based clustering • K-Means • Self Organizing Maps / Kohonen Networks • Probabilistic Model based • Expectation Maximization / Mixture Models

  13. Hierarchical clustering • Agglomerative / Bottom up • Start with single-instance clusters • At each step, join the two closest clusters • Method to compute distance between cluster x and y: single linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid • Distance measure: Euclidean, Correlation etc. • Divisive / Top Down • Start with all data in one cluster • Split into two clusters based on category utility • Proceed recursively on each subset • Both methods produce a dendrogram

  14. Levels of Clustering Agglomerative Divisive Dunham, 2003

  15. Hierarchical Clustering Example • Clustering Microarray Gene Expression Data • Gene expression measured using microarrays studied under variety of conditions • On budding yeast Saccharomyces cerevisiae • Groups together efficiently genes of known similar function, • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

  16. Hierarchical Clustering Example • Method • Genes are the instances, samples the attributes! • Agglomerative • Distance measure = correlation • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

  17. Simple Clustering: K-means • Pick a number (k) of cluster centers (at random) • Cluster centers are sometimes called codes, and the k codes a codebook • Assign every item to its nearest cluster center • F.i. Euclidean distance • Move each cluster center to the mean of its assigned items • Repeat until convergence • change in cluster assignments less than a threshold KDnuggets

  18. K-means example, step 1 Y k1 k2 k3 X Initially distribute codes randomly in pattern space KDnuggets

  19. K-means example, step 2 Y k1 k2 k3 X Assign each point to the closest code KDnuggets

  20. K-means example, step3 Y k1 k2 k2 k1 k3 k3 X Move each code to the mean of all its assigned points KDnuggets

  21. K-means example, step 2 Y k1 k2 k3 X Repeat the process – reassign the data points to the codes Q: Which points are reassigned? KDnuggets

  22. K-means example Y k1 k3 k2 X Repeat the process – reassign the data points to the codes Q: Which points are reassigned? KDnuggets

  23. K-means example Y k1 k3 k2 X re-compute cluster means KDnuggets

  24. K-means example Y k2 k1 k3 X move cluster centers to cluster means KDnuggets

  25. Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers K-means clustering summary Extensions • Adaptive k-means • K-mediods (based on median instead of mean) • 1,2,3,4,100  average 22, median 3

  26. Biological Example • Clustering of yeast cell images • Two clusters are found • Left cluster primarily cells with thick capsule, right cluster thin capsule • caused by media, proxy for sick vs healthy

  27. Self Organizing Maps(Kohonen Maps) • Claim to fame • Simplified models of cortical maps in the brain • Things that are near in the outside world link to areas near in the cortex • For a variety of modalities: touch, motor, …. up to echolocation • Nice visualization • From a data mining perspective: • SOMs are simple extensions of k-means clustering • Codes are connected in a lattice • In each iteration codes neighboring winning code in the lattice are also allowed to move

  28. SOM 10x10 SOM Gaussian Distribution

  29. SOM

  30. SOM

  31. SOM

  32. SOM example

  33. Famous example:Phonetic Typewriter • SOM lattice below left is trained on spoken letters, after convergence codes are labeled • Creates a ‘phonotopic’ map • Spoken word creates a sequence of labels

  34. Famous example:Phonetic Typewriter • Criticism • Topology preserving property is not used so why use SOMs and not adaptive k-means for instance? • K-means could also create a sequence • This is true for most SOM applications! • Is using clustering for classification optimal?

  35. Bioinformatics ExampleClustering GPCRs • Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004] • Important drug target, function often unknown

  36. Bioinformatics ExampleClustering GPCRs

  37. What are frequent item sets & association rules? Quality measures support, confidence, lift How to find item sets efficiently? APRIORI How to generate association rules from an item set? Biological examples Association Rules Outline KDnuggets

  38. Market Basket ExampleGene Expression Example • Frequent item set • {MILK, BREAD} = 4 • Association rule • {MILK, BREAD}  {EGGS} • Frequency / importance = 2 (‘Support’) • Quality = 50% (‘Confidence’) • What genes are expressed (‘active’) together? • Interaction / regulation • Similar function

  39. Association Rule Definitions • Set of items: I={I1,I2,…,Im} • Transactions: D={t1,t2, …, tn}, tj I • Itemset: {Ii1,Ii2, …, Iik}  I • Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Dunham, 2003

  40. Frequent Item Set Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Dunham, 2003

  41. Association Rule Definitions • Association Rule (AR): implication X  Y where X,Y  I and X,Y disjunct; • Support of AR (s) X Y: Percentage of transactions that contain X Y • Confidence of AR (a) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X Dunham, 2003

  42. Association Rules Ex (cont’d) Dunham, 2003

  43. Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. • NOTE: Support of X  Y is same as support of X  Y. Dunham, 2003

  44. Association Rules Example • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% KDnuggets

  45. Solution Association Rule Problem • First, find all frequent itemsets with sup >=minsup • Exhaustive search won’t work • Assume we have a set of m items  2m subsets! • Exploit the subset property (APRIORI algorithm) • For every frequent item set, derive rules with confidence >= minconf KDnuggets

  46. Finding itemsets: next level • Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, .. • Subset Property: If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! • In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent • Compute k-item set by merging (k-1)-item sets KDnuggets

  47. An example • Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) • Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent KDnuggets

  48. From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? • A => B, E • A, B => E • A, E => B • B => A, E • B, E => A • E => A, B • __ => A,B,E (empty rule), or true => A,B,E KDnuggets

  49. Example: ‘Generating Rules from an Itemset • Frequent itemset from golf data: • Seven potential rules: KDnuggets

  50. Example:Generating Rules • Rules with support > 1 and confidence = 100%: • In total: 3 rules with support four, 5 with support three, and 50 with support two KDnuggets

More Related