1 / 33

The Challenge of Heterogeneity

Explore the challenges of heterogeneity in data and users when structuring music collections. Learn how Web 2.0 and hierarchical classification can help organize large data collections with semantic annotations and user-given taggings.

kcarson
Download Presentation

The Challenge of Heterogeneity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Challenge of Heterogeneity

  2. Overview • Heterogeneity in Data • Distributed Data • Web 2.0 • Heterogeneity of Users • Structuring music collections • Structuring tag collections

  3. Heterogeneity in Data • Databases • Fixed set of attributes • Declared data types • Multi-relational • Very large number of records • Preparation for mining • Extract, Transform, Load • Select attributes • Declare label for learning • Handle missing values • Compose new attributes • Schema-mapping for re-use of DM MiningMart application to customer churn -- Telecom Italia

  4. Heterogeneity in Data • Time series data • Measurements over time • Business • Medicine • Production • Hand writing • Pictures • Music • Prediction • Classification • Clustering • Signal to Symbol

  5. Heterogeneity in Data • Texts • High dimensional vectors • Sparse word vectors • Texts of the same class need not share a word! • Syntactic, semantic structures • Classification • Clustering • Named Entity Recognition, Information Extraction

  6. Distributed Data • Distributed databases of the same schema • Distributed databases of different schemas • Low-level, low capacity sensors • Peer-to-peer networks

  7. classic guitar piano pop hip hop alternative metal pop hip hop classic jazz death metal true metal favourites jazz pop classic blues modern home work office plane Heterogeneity of Users • The same label name does not necessarily mean the same concept. • Different names may refer to the same set of items. • Users apply diverse aspects, e.g., genre, time of day, episodes (summer 99),... • Users share some set of items (possibly under different names).

  8. Web 2.0 • Organizing large data collections requires semantic annotations. • Users annotate items with arbitrary tags. • No common ontology is required (“folksonomies”). • Users want to keep their tags, but like to benefit from efforts of others.

  9. Structuring Music Collections • A concept’s meaning is its extension, e.g., some music. • A concept’s meaning can be expressed by a classifier. • A concept hierarchy for each aspect --> hierarchical classification. • Acquiring the hierarchy by clustering under the assumption that user-given taggings are kept. pop bad rock good a b blues metal aggressive d e f

  10. classic guitar piano pop hip hop alternative metal pop hip hop classic jazz death metal true metal favourites jazz pop classic blues modern home work office plane Localized Alternative Cluster Ensembles (ECML 2006) • Acquiring hierarchical clusterings from • Own partial clusterings • Clusterings of other peers • Preserve taggings of users • Produce several alternative • Exploit input clusterings • Consider locality instead of global consensus

  11. LACE Algorithm Items are represented by Ids. alternative metal a death metal true metal c a b c 11 f b d e g pop hip hop d f 12

  12. LACE Algorithm Best matching cluster node isselected by f-measure. alternative metal a death metal true metal c a b c 11 f b d e g pop hip hop d f 12

  13. LACE Algorithm Items that are sufficiently similar to items in the best matching clustering are deleted from the query set. alternative metal a death metal true metal f b c d 11 e g alternative metal a pop hip hop d f death metal true metal b c 12 11

  14. LACE Algorithm A new query is posed containing the remaining items. Only tags not used yet are considered. alternative metal a death metal true metal f b c d 11 e g alternative metal a pop hip hop d f death metal true metal b c 12 11

  15. LACE Algorithm The process continues until all items are covered, no additional match is possible or a maximal number of rounds is reached. alternative metal a death metal true metal 1 b c e g 11 alternative metal hip hop pop a d pop f hip hop d f death metal true metal 12 b c 12 11

  16. LACE Algorithm Remaining items are added byclassification (kNN). alternative metal a death metal true metal 1 b c 11 alternative metal hip hop pop a d pop f hip hop e g d f death metal true metal 12’ b c 12 11

  17. LACE Algorithm Process starts anew until no more matches are possible or the maximal number of results is reached. alternative metal a death metal true metal b c 11 pop hip hop alternative metal pop hip hop d f death metal true metal 12 1

  18. LACE Algorithm Process starts anew until no more matches are possible or the maximal number of results is reached. alternative metal a death metal true metal b c 11 pop hip hop home alternative metal pop work hip hop d f death metal true metal office plane 12 1 2 … k 3

  19. LACE Algorithm Ad hoc peer-to-peer network. alternative metal a death metal true metal b c P2p Network 11 pop hip hop home alternative metal pop work hip hop d f death metal true metal office plane 12 1 2 … k 3

  20. Structuring Music Collections Challenge of music data: • There is no perfect feature set for all mining tasks. • Learning feature extraction for a classification taskMierswa/Morik MLJ 2005 • Structuring music collectionsWurst/Morik/Mierswa ECML 2006 • User views are local models - no global consensus wanted!Mierswa/Morik/Wurst, In: Masseglia, Poncelet, l and Teisserie(editors), Successes and New Directions in Data Mining, 2007

  21. Structuring Tag Collections • Users annotate resources with arbitrary tags. • Frequency of tags is shown by the tag cloud. • Tags structure the collection.

  22. Navigation • User may select a tag and sees the resources. • User may follow related tags. • Problem: • No hierarchical structure. • Restricted navigation to given tags. • No navigation according to subsets. • Photography and art cannot be found!

  23. Given: Folksonomy • A Folksonomy (U,T,R,Y), with • U Users • T tags • R Resources • Y U  T  R • a record (u,t,r)  Y means that user u has annotated resource r with tag t.

  24. Wanted: Tagset clustering • Hierarchical clustering of tags for navigation, • based on frequency: how many users used tag t?supp: P(T) --> suppU(T)=|{uU| t T:  r R: (u,t,r) Y}| • Subset of the lattice of frequent tag sets that optimizes clustering criteria.

  25. { } D1, ..., D16 ...{sun} {beach} D1, D4, D5, D6, D2, D9, D13 D7, D14 D2, D9, D13 D8, D10, D11, D15 D8, D10, D11, D15 {sun,fun} {fun, beach} {sun,beach} D1,D4,D6,D8 ... D2, D8, D9, D10 D10, D11, D13 D11, D15 {sun, fun, beach} D8, D10, D11, D15 Starting Point: Termset Clustering • Termset clustering: how many resources support a term? • Given frequent term sets form a clustering with small overlap and large coverage. Beil, Ester, Xu (2002) Frequent Term-Based Text Clustering, in KDD 2002 Fung, Wang, Ester (2003) Hierarchical Document Clustering Using Frequent Itemsets, in SDM 2003 • Heuristics for minimizing overlap, maximizing coverage.

  26. Heterogeneous Preferences Child-count vs. completeness (left); coverage vs. overlap (right)

  27. Multi-objective Optimization • Given frequent tag sets • Find all optimal clusterings according to two orthogonal criteria. • Orthogonal criteria can only be determined empirically. • Childcount: number of successors of a cluster • Overlap: average overlap of clusters at each level. • Completeness: how much of the lattice is retained? + + + + + + + + + + + +

  28. Initial population Output Fitness Stop? Mutation Selection Crossover GA for Optimization • NSGA II algorithmDeb, Agrawal,Pratab, Meyarivan (2000) in Procs. Parallel Problem Solving from Nature • Delivers all Pareto-optimal clusterings to a partial lattice of frequent tag sets.

  29. Encoding Frequent Tag Sets • Given the lattice of possibly frequent tag sets, • a Binary vector indicates the inclusion of a tag set into the clustering. • A vector can be mutated by flipping bits. • Two vectors can be combined to a new one by crossover.

  30. Result: Points of Pareto-front • Childcount vs. Completeness • Pareto-front for different minimal support • Instances

  31. Application • Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006 • 780 users, 59.000 resources, 25.000 tags • 4000 frequent tag sets • Optimization according to Childcount vs. Completeness and Overlap vs. Coverage

  32. Multi-objective Tagset Clustering • Multi-objective optimization allows the user to select among equally good clusterings -->heterogeneity of users is respected • High scalability, high dimensionality • Understandable labels (tags) • Hierarchical structure for navigation.

  33. Challenges for Data Mining • High dimensional data • High throughput data • Distributed Data • P2P networks • Web 2.0 • Diverse user preferences • Service for end-user systems, e.g. mobile “phones”

More Related