1 / 26

Integrating Semantics-Based Access Mechanisms with P2P File Systems

Integrating Semantics-Based Access Mechanisms with P2P File Systems. Yingwu Zhu, Honghao Wang and Yiming Hu. Outline. Background System Design Related Work Conclusions and Furture Work. Background. Current P2P file systems (e.g., CFS and PAST )

Download Presentation

Integrating Semantics-Based Access Mechanisms with P2P File Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu

  2. Outline • Background • System Design • Related Work • Conclusions and Furture Work

  3. Background • Current P2P file systems (e.g.,CFS and PAST) • Layer FS functionalities on a distributed hash table (DHT), e.g., chord, pastry • Do not support semantics-based access • Because DHTs support only exact-match lookups

  4. Background Software layering in a P2P file system

  5. Motivation • A problem of P2P file systems • Supports only exact-match lookups given a file object identifier fileID • get(fileID): retrieves the file corresponding to the fileID • put(fileID, file): stores the file with the fileID as a DHT key • Extending exact-match lookups to semantic access is non-trivial

  6. Motivation • A challenge to P2P file systems • Provides convenient access to vast amount of information • E.g., provide semantics-based search capabilities to efficiently locate semantically close files for browsing and purging, etc.

  7. System Design • Targeted Application • System Architecture • Semantic Indexing and Locating • Evalutation

  8. Targeted Application • Semantic search is expressed in natural language. • Query: “locate files similar to f1” • The query results are materialized via semantic directories • Not a simple keyword match: “loate files with k1, k2 and k3”*k1, k2 and k3 are three distinct keywords

  9. System Architecture • Extends a P2P file system to support semantics-based access • Major Components • Semantic Extractor Registry • Semantic Indexing and Locating Utility

  10. System Architecture Application/User FS Extractor Registry Semantic Indexing and Locating Utility DHT Major components of the system architecture

  11. Semantic Extractor Registry • A set of semantic extractors • Leverage IR algorithms, VSM and LSI • Represent a file as a semantic vector (SV), typcially 200-300 keywords • Semantically close files have similar SVs

  12. Semantic Indexing and Locating Untility • Provides semantics-based indexing and retrieval capabilities • Relies on the property of Locality Sensitive Hash Fucntions (LSH) • Derives a small number of semantic identifiers (semID) from a file’s SV as the DHT keys for indexing and locating

  13. Semantic Indexing and Locating Untility • Goals • The indice of semantically close files are clustered to the same peer nodes with high probability (nearly 100%) • Efficiently locate semantically close files by searching a small number of peer nodes (e.g, 20)

  14. Locality Sensitive Hashing • A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B) • Min-wise independent permutations are LSH • sim(A,B) = |A B| / |A B| Similarity function

  15. Semantic Indexing • Given a file’s SV • Step 1: Drive a small number of semIDs from the SV using LSH • Step 2: Indexing the file by having these semIDs as the DHT keys

  16. Semantic Indexing • Using n groups of m hash functions • Results: • The indice of semantically close files are hashed to the same peers with probability  1-(1-pm)n • P is expected to be high for semantically close files, so is the probability *p=sim(f1,f2), similarity between two files’s SVs

  17. Semantic Indexing Given a file’s SV A: proc sem_index (A) { convert A into A’; \\ A’ is a set of integer by using SHA-1 for each g[j]do\\ g[j] is one of n group of hash funcions semID[j] = 0; for each h[i] in g[j]do\\ g[j] has m hash functions semID[j] ^ = h[i](A’); \\ ^ is a XOR operation endfor endfor for each semID[j]do insert the tuple <semID, fileID, A> into DHT by having semID[j] as the DHT key \\ semantic indexing endfor endproc

  18. Semantic Locating • Given a query’s SV • Step 1: Derive a small number of semIDs from the SV using LSH • Step 2: Locate those semantically close files by having these semIDs as the DHT keys • Goal: answer a query by consulting only a small number of peer nodes

  19. Indexing B Indexing A Indexing C NULL A, B, C Semantic Locating A, B Demostration of Semantic Indexing and Locating A B C D Peer node A, B, C and D are semantically close files User1 User2 Query: locate files similar to D

  20. Evaluation • Load distribution of semantic indexing • Semantic indices per peer node • Performance of semantic locating • Percentage of semantically close files that can be located (Recall)

  21. Semantic Indexing Number of file indexes per node Number of peer nodes Load distribution when the system indexes 10,000 files

  22. Semantic Indexing Nmber of file indexes per node Number of indexed files (x1000) Load distribution in a 1000 node system

  23. Perf. of Semantic Locating n recall m [1] Apply n groups of m hash functions [2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating

  24. Related Work • P2P file systems like CFS and PAST • Exact-match lookups in DHTs • Traditional semantic file systems like SFS and HAC • IR algorithms as VSM and LSI • LSH and its related applications (e.g.,the nearest neighbor problem, cached data location in database)

  25. Conclusions • The first step to support semantics-based access in P2P file systems • LSH-based semantic indexing and locating approach • Impose small storage overhead (several MBs per node) • Efficiency: answer a query by consulting a small number of peers (e.g., 20) • Approximate results, but acceptable

  26. Furture Work • Query consistency and refinement • Evaluation using IR workloads (e.g., TREC data sets).

More Related