1 / 18

Measuring the Structural Similarity of Semistructured Documents Using Entropy

Measuring the Structural Similarity of Semistructured Documents Using Entropy. Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University. Outline. Introduction

colt-reid
Download Presentation

Measuring the Structural Similarity of Semistructured Documents Using Entropy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University

  2. Outline • Introduction • Measuring Entropy • Information Distance • Measuring Structural Similarity • Extract structural information • Ziv-Lempel Encoding • Ziv-MerhavCrossparsing • Competitors • Tree-editing distance • Discrete Fourier Transformation • Path shingles • Experiments • Clustering Quality • Document Collections • Results • Conclusion Center for E-Business Technology

  3. Introduction • Semistructured document • HTML, XHTML, XML, … • In traditional IR detecting similarities used widely • For querying • For clustering • Lots of similarity measures for text documents • New challenges with semistructured documents • Measuring structural similarity • Semistructured documents show great structural diversity • Measuring structural similarity can integrate heterogeneous data sources Center for E-Business Technology

  4. Measuring Entropy • Bennet et al. • Introduced concept of universal information metric • Based on Kolmogorov complexity • Given data object x, Kolmogorov complexity K(x) means the length of shortest program that outputs x • Conditional Kolmogorov complexity K(x|y) • Generalized form of Kolmogorov complexity • Length of the shortest program with input y that outputs x Center for E-Business Technology

  5. Information Distance • Similarity of two data objects can be measured by normalized information distance • Based on Conditional Kolmogorov complexity • Has some nice property • Almost a metric property • Lower bound for admissible distance • Unfortunately, Kolmogorov complexity is not computable in general • Can be approximated by compression • C(x) : length of compressed x Center for E-Business Technology

  6. Measuring Structural Similarity • Just Compressing XML files does not get the job done • Extract structural information • Tags • List element/attribute names in document order • Pairwise • Like tags, but with names of parents • Path • Like tags, but with full path to root • Family order • Family-order traversal of document • Except path, all extractions can be done in linear time Center for E-Business Technology

  7. Measuring Structural Similarity (cont.) • After extracting structural information • Come up with similarity measure • NCD with GZIP (Zip-Lempel Encoding) • Ziv-Merhavcrossparsing • Can be done in linear time • We can calcurate the actual value for the NCD by encoding or crossparsing • Crossparsing is alternative way to measure relative entropy between sequences in files • Instead of compressing Center for E-Business Technology

  8. Measuring Structural Similarity (cont.) • Ziv-Lempel Encoding • Build a directory for encoding by utilizing substring that have already appeared in the previous encoded part of the document • <distance, length, symbol> • ex) String to encode “aababbccccd”<0,0,a>,<1,1,b>,<2,2,b>,<0,0,c>,<1,3,d> • Ziv-merhavCrossparsing • Find the longest prefix of string x that appears as a string in y until we have parsed the whole document x • ex) x=“abbbbaaabba”, y=“baababaabba”c(x|y) = 3, the three phrases being “abb”,”bba”,”aabba” Center for E-Business Technology

  9. Competitors • Tree-editing distance • Nierman and Jagadish • Discrete Fourier Transformation • Flesca et al. • Path shingles • Buttler Center for E-Business Technology

  10. Tree-editing distance • Measuring the minimum editing distance between string • Levenshtein distance • Five different edit operations • Relabel • Insert • Delete • Insert Tree • Delete Tree • Quadratic runtime • O(|d1||d2|) • |dn| : size of document d Center for E-Business Technology

  11. Discrete Fourier Transformation • Called DFT technique • The feature of an XML document relevant to its structure are described via a numerical encoding • Numerical values are interpreted as a time series • Rotate document by 90◦, interpret indentations as time series • Use DFT transform to compute similarity • Tag encoding • Document encoding • Document similarity • Runtime • O(NlogN) • where N = max((|d1||d2|) Center for E-Business Technology

  12. Path Shingles • Extract structural information using Full Path variant • Compute a hash value hj for each path • A shingle of width w is the combination of w consecutive hash values • Compute similarity between two documents using Dice coefficient on the two sets of shingles • Not linear run time Center for E-Business Technology

  13. Clustering quality • Measure quality ofsimilarity measure byclustering • Hierarchicalagglomerativeclustering • Quality expressed innumber of misclusteringin dendrogram Center for E-Business Technology

  14. Document Collections • Use three different document collections for experiment • Real data set • SIGMOD record, INEX 2005, music sheets encoded in XML • Synthetically generated data sets from DFT paper • Own synthetically generated various data sets • Element names • Element frequencies • Element positions • Element depths Center for E-Business Technology

  15. Overall Results • Number of misclustering for different methods Center for E-Business Technology

  16. Detailed results • Different method have different strengths and weaknesses • Tree-edit • Generally good, has problems with largely varying document sizes • DFT • Good at frequencies, bad at element names, position, and depth • GZIP/Ziv-Merhav • Bad at frequencies, good at element names, position, and depth • DFT and GZIP/Ziv-Merhav are complementary to each other • Performance of hybrid becomes even better • Hybrid approach does not have linear run time Center for E-Business Technology

  17. Runtime • Shallow documents • Crossparsing/pairwise, shingle path algorithm • Deep documents • Crossparsing/pairwise Center for E-Business Technology

  18. Conclusion • Determining the similarity of document • Based on measuring entropy • Encoding/Crossparsing approach totally different to previous approaches • Can be done in linear time • Important for large document collection • Future work • More sophisticated ways of encoding document structure • Different entropy measure better suited for structural information Center for E-Business Technology

More Related