1 / 29

Generating Semantic Annotations for Frequent Patterns with Context Analysis

Generating Semantic Annotations for Frequent Patterns with Context Analysis. Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign June 6, 2014. Itemsets:. diaper. milk. camera. film. ;. ; …. Sequential Patterns:.

melaney
Download Presentation

Generating Semantic Annotations for Frequent Patterns with Context Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign June 6, 2014

  2. Itemsets: diaper milk camera film ; ; … Sequential Patterns: ... MiningClosedFrequentGraph Patterns… … Mining Graph and Structured Patterns in ... … Subgraph Patterns: Frequent Pattern Mining( [Agrawal & Srikant 94] and many others) Database Frequent Patterns D E F C A B AB EF AE CD CE DE AF BE BF CDE ABE ABF

  3. Toward Understanding the Patterns-- Find Canonical Patterns Database Frequent Patterns D E F C A B AB EF AE CD CE DE ( Yan et al ‘05) AF BE BF CDE ABE ABF ( Xin et al ‘05)

  4. Toward Understanding the Patterns-- How to Interpret Patterns? • Do they all make sense? • What do they mean? • How are they useful? diaper beer female sterile (2) tekele morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only those with meanings… Our goal: Annotate patterns with semantic information

  5. Challenges • How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?) • How can we infer pattern semantics? (How to annotate?) • How can we do it in a general way? (Do it for all kinds of patterns) • Once such annotations are generated, what can we use them for? (Applications)

  6. Word: “pattern” – from Merriam-Webster Non-semantic info. Definitions indicating semantics Examples of Usage Synonyms Related Words A Dictionary Analogy

  7. Pattern: “latent semantic analysis” Non-Semantic: sequential; close; sup = 0.1% Context Indicators (CI): “indexing”, “semantic”, “S. Dumais”, “singular value decomposition”, … Representative Transactions: index by latent semantic analysis probablist latent semantic analysis Semantically similar Patterns (SSP): “latent semantic indexing”, “LSA”, “PLSA” What about a “Pattern Dictionary”?-- Semantic Pattern Annotation (SPA) Word: Pattern Non-Semantic: function; pronunciation; date; etc. Definitions: A form or model proposed for … Related words: original, constellation … Examples: a dressmaker’s pattern a pattern of dissent Synonyms design, device, motif, motive…

  8. Frequent Patterns P1: AB ? P2: CD P3: … Pn: How Can We Generate Such an Entry? Semantic Annotations Database … How to infer the semantics of a frequent pattern?

  9. Context Pattern {A,B}:{ … Baby, Milk, Diaper, Toy, Soymilk… } {C,D}: { … Printer, Film, Camera, Lens, … } Continue the Analogy… “You shall know a word by the company it keeps.” - Firth 1957 Data … association … pattern … MINE … algorithm … mountain … Africa … diamond … MINE … weight … You’ll know the meaning of a pattern by its context

  10. Context Units <E, F, …, EF, … ABE> <E, F, …, EF, …,CDEF> Context Units = Objects co-occurring with p Our Approach: Model the Context Semantic Annotations Database Frequent Patterns P1: AB P2: CD … … Pn:

  11. Semantic Analysis with Context Models • Task1: Model the context of a frequent pattern Based on the Context Model… • Task2: Extract strongest context indicators • Task3: Extract representative transactions • Task4: Extractsemantically similar patterns

  12. < 2.0, 2.0, …, 1.0, … , 1.0 > < 2.0, 2.0, …, 1.0, … , 1.0 > Co-occurrence Cosine Similarity Mutual Information Pearson Coefficient Context Unit Weight: Context Similarity: …… …… Task1: Context Modeling - A Vector Space Model Context Units Semantic Annotations Frequent Patterns Database <E, F, …, EF, … ABE> <E, F, …, EF, … ABE> P1: AB <E, F, …, EF, …,CDEF> … P2: CD … Pn:

  13. Single items , , … diaper milk printer , itemsets milk lotion camera t2 transactions t1 Context Unit Selection t1 diaper milk babywear lotion t2 camera memory stick printer Valid Context Units: In general, Context Units are frequent patterns

  14. Context Unit Selection: Redundancy Removal • Problem: too many valid context units, most are redundant • { Diaper, milk, babywear }: “diaper”, “diaper, milk”, “milk, babywear”, “milk, lotion”, … • Solution: • use close patterns • micro-clustering: (hierarchical, one-pass) • Jaccard Distance (γ: threshold to stop clustering):

  15. Context Unit Weighting < 3.0, 0, … 2.0, … , 1.0, …> AB 3.0EF 2.0ABE 1.0… Task2: Extract Context Indicators Semantic Annotations Context Units Frequent Patterns Database < AB, CD, … , EF, … ABE, …> <A, B, AB, C, D, CD, E, F, EF, AE, BF, … ABE, ABF,…, ABEF> P1: AB … P2: CD … Pn:

  16. T1: 1.0, 0, …,1.0, … , 1.0 T5: Semantic Similarity T5 0.8T1 0.6T3 0.6… Task3: Extract Representative Transactions Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …

  17. P2: CD 0, 3.0, …,2.0, … , 0.5 Pk: EF Semantic Similarity CD 0.7BF 0.5EF 0.3… AB: Task4: Extract Semantically Similar Patterns Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …

  18. Experiments • Three different real world applications • Annotating DBLP title/authors Patterns • Motif/Gene-Ontology (GO) matching • Gene Synonyms extraction • Study the effectiveness of the proposed SPA methods • Explore applications of SPA to different real world tasks

  19. P1: { x_yan, j_han } Frequent Itemset P2: “substructure search” Frequent Sequential Pattern Context Units < { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … > Annotating DBLP Co-authorship and Title Pattern Database: Frequent Patterns Authors Title X.Yan, P. Yu, J. Han Substructure Similarity Search in Graph Databases … … … … Semantic Annotations

  20. DBLP Results: Frequent Itemset Pattern= {xifeng_yan, jiawei_han} Annotations:

  21. DBLP Results: Freq. Seq. Pattern Pattern= “Information … retrieval” Annotations:

  22. GO term 1 Sequence 1 GO term 2 motif1 motif2 Sequence 2 GO term 3 motif2 motif3 GO term 4 Sequence 3 GO term 5 motif2 motif4 motif5 Motif-GO Matching ? motif2 Motif: a subsequence pattern in the sequences Gene Ontology (GO) terms: annotating the functionality of sequence, motifs

  23. Motif 1 P1: Motif1 Sequential Pattern P2: GOTerm2 Single Item Pattern Context Units < Motif1, Motif3, …, GOTerm1, GOTerm2, … > Motif-GO Matching (Cont.) Database: Frequent Patterns Protein Sequence GO terms GOTerm1; GOTerm2;GOTerm3 GOTerm3 … … Motif-GO matching Semantic Annotations

  24. Motif/GO Matching: Evaluation • Gold standard generated by human experts • Measure: Mean reciprocal rank (MRR) • Reflects ranking accuracy (the higher the better) • 1/Rank (0.5 means the correct answer is ranked as the 2nd ) • Results: Weights for Context Units: Ranking Strategy

  25. Gene Synonym Extraction • Gene Synonyms: • A Sequential Pattern in the textual database • Matching gene synonyms: a challenging and important new problem in mining biology data • Analogy: thesaurus or synonyms in dictionary

  26. P1: female sterile (2) tekele Sequential Pattern P2: Fs(2)Tek Sequential Pattern Context Units < gene, female, …, d. melanogaster gene, … > Context Units: context units can be single words or sequential patterns Gene Synonym Extraction (Cont.) Database: Frequent Patterns Biomedical Sentences … D. melanogaster gene Female sterile (2) Tekele … … Female sterile (2) Tekele , abbreviated as Fs(2)Tek … … Matched Synonyms Semantic Annotations

  27. Gene Synonym Extraction: Results MRR: hierarchical MRR: one-pass • Effective! MRR > 0.5 • frequent pattern >> single words • Micro-clustering is useful Running time: hierarchical Running time: one-pass

  28. Conclusions • A novel problem: semantical pattern annotation • A structured annotation for frequent patterns • A general method based on context modeling • A general post-processing procedure of frequent pattern mining on any types of pattern • Applicable to and effective for quite different tasks • Future work: • Tune for specific tasks • Better context unit weights, redundancy removal, etc

  29. Thanks and Questions

More Related