1 / 20

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet. Svetlana Strunja š-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati. Introduction.

meena
Download Presentation

Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and Kenneth Berman {strunjs,annexste,berman}@ececs.uc.edu University of Cincinnati

  2. Introduction • Consider Lowest Common Ancestor Query Problem • Find most specific common generalization or least common subsumeramong 2 or more terms or attributes in a large hierarchical/classification data sets • Constraint: Evaluate queries without indirection • Goal: Compact labeling schemes for taxonomies

  3. Introduction (cont’d) • Applications • Fast classification of sets and similarity, e.g. prediction sets similar to Google Sets (given “Bush" and “Clinton” it predicts all other US presidents) • Fast answers to ancestor queries in XML search, e.g., test if 2 terms share a parent node without loading XML file (see[1],[2]) • Fast navigation through voluminous web taxonomies (see [3])

  4. Data Model • Structural properties found in well-known web taxonomies: • large variance out-degree(Δ), i.e., some nodes have many subclasses • small in-degree (δ) range and variance • small depth (σ) (logarithmic) • small number (>1) of paths from root • See paper for table of statistical values for Wordnet, ODP, and Math taxonomies

  5. Our Approach • Given: large, rooted web taxonomies represented abstractly as Directed Acyclic Graph or DAG with above statistics • Problem: Label each node of the DAG so that all local path information for each taxonomy element is preserved in the encoding • Our labeling scheme is a variable-length, prefix-based scheme, and built up in two stages

  6. Our Approach (cont’d) 1.Greedy Dewey Labeling for Trees (TGDL) -Identifies a Breadth-First tree T in a DAG -Encodes path information for the paths in T -Label nodes with concatenation of edge labels

  7. GDL example

  8. TGDL example

  9. Analysis of the Length for TGDL Labels • Performed in 2 steps • First step: assume that delimiting labels are empty -- each node v labeled with bits at most • Second step: Using different edge delimiting schemes estimated upper bound of node labels

  10. Delimiting schemes • They encode length of each tree-edge label • Two approaches tested: • Unary Length Encoding • Fixed Binary Length Encoding

  11. Unary Length Encoding (ULE) • Comparable to Elias Gamma Code •    Gamma             ULE    1     1              10        2     010            113     011             0100 4     00100          01015     00101          01106     00110          01117     00111       0010008     0001000        001001 • ULE assigns |e|-1 bits long zero prefix to an edge label e with GDL label of the length |e|

  12. Unary Length Encoding (ULE) Analysis Theorem: Upper bound on TGDL label length with ULE of delimiters is bits, for an arbitrary node v in a tree T - is the depth of v in T - n is number of nodes in T

  13. Fixed Binary Length Encoding (FBLE) • For an edge e, this encoding is the binary representation of the length for GDL(e) • Encoded with a fixed number of bits - is the maximum node out-degree in T - uses 4 bits in our application

  14. FBLE example - 4 bits will encode delimiters for any T with maximum out-degree < 2^16 - Let e is an edge in T with a given GDL label, e.g. GDL(e)=0000111111 Then FBLE produces delimiter 1010, so label for e is 10100000111111

  15. Fixed Binary Length Encoding (FBLE) Analysis • Upper bound on TGDL label length with FBLE of delimiters is bits, for an arbitrary node v in a tree T

  16. Our Approach (cont’d2) 2.Extended Greedy Dewey Labeling for DAGs (EGDL) -Augment codes generated from step 1 -Used for inferring paths not part of the Breadth-First tree -Adds TGDL node label pairs of non-tree edges

  17. EGDL Labeling - Example .01*.0.01 .01*.0.0 .0.01*.0.01

  18. Experimental Results for Wordnet taxonomy (n= 80K)

  19. Experimental Results-Label Lengths Encoding Length Wordnet 2.1 Statistics

  20. References [1] Budanitsky, A., Hirst, G. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh,PA, 2001. [2] Resnik, F. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pages 448–453, 1995. [3] Christophides, V., Plexousakis, D. On Labeling Schemes for the Semantic Web. In Proceedings of the 12th international conference on World Wide Web, pages 544–555, Budapest, Hungary. [4] Abiteboul., S., Kaplan, H., Milo, T. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556, Washington, D.C., 2001. [5] Strunjas-Yoshikawa, S., Annexstein, F., Berman, K. Compact Encodings for All Local Path Information in Web Taxonomies with applications to WordNet . In Proceedings of the 32nd International Conference on Current Trends in Theory and Practice of Computer Science, Merin, Czech Republic, January 21-27, 2006.

More Related