1 / 19

Connections : Using Context to Enhance File Search, from SOSP ‘05

Connections : Using Context to Enhance File Search, from SOSP ‘05. Russell Greenspan CS 523 April, 2006. File System 1.0. Folder-based Too many files to effectively organize into folders in meaningful way

angie
Download Presentation

Connections : Using Context to Enhance File Search, from SOSP ‘05

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Connections: Using Context to Enhance File Search, from SOSP ‘05 Russell Greenspan CS 523 April, 2006

  2. File System 1.0 • Folder-based • Too many files to effectively organize into folders in meaningful way • Files in folders do not match how they are interrelated, i.e. folder containing papers read in this course • Attribute-based • But... users are unwilling to manually assign attributes to files

  3. Content-based File System • Indexed • Keep full inverted indices where every occurrence of every word is indexed with a pointer to the document that contains it and the location of the word • Return results of all occurrences of the search term(s) • Index size is comparable to document collection size

  4. Content-based File System • Ranked results • Use inverted indices (as opposed to full inverted indices) to store documents in which term appears, not position in document • To judge relevancy, use probabilistic methods such as term frequency within a document and inverse document frequency over a collection

  5. Content-based File System: Limitations • How to index binary data? • How to manipulate contextual details? • Google Desktop Search • No substring search • Only first 10,000 words of each document and first 100,000 documents are indexed, likely due to index size boundaries

  6. Glimpse: Two-level Content Indexing • To deal with bloated indices, Glimpse offers a hybrid of full inverted indices and sequential search • Subdivide file space into manageable blocks, then index occurrence of terms within each block • Occurrences of the same term in the same block are stored only once, greatly reducing index size • On query, use index to find blocks with input search terms, then use sequential search tool like grep within the blocks

  7. Connections:Context-based File System • Web-based context • “Authority” (nodes that link to sites) and “hub” nodes (nodes that are linked to often) • Web pages linked within a specified vicinity of other pages; a virtual neighborhood • How can context be applied in file systems?

  8. ConnectionsArchitecture • Find “Temporal Locality” • Tracer • Sits at system call layer in kernel, monitoring file system and process management calls • Relation Graph • Stores graph of relationships between files • Nodes in graph are files, edges between nodes indicate a contextual relationship between files, with weight of edges indicating strength of relationship

  9. Identifying Relationships • Relation window • Files accessed within a given window of time; too short a time might miss relationships, while too long a window connects unrelated files • Increment edge weight by 1 for duplicated operations • Do not re-increment weight if same input is in Relation window

  10. Identifying Relationships • Operation Type • open – temporal relationship of files (accessed at nearby points in time) • read/write – causal relationship since data from read from file A can affect data later written to file B • all-ops – input is source file of mmap, stat, dup, link, or rename operation, output is destination of dup, link, or rename.

  11. Identifying Relationships – Relation Graph Example • open(A), open(B) A B • read(A), write(C) A B C • dup(C, D), read(A), write(D) A B D C

  12. Context-based Search • Take results of content-based search (e.g. Indri) • For each file in results, perform breadth-first search starting at file’s node; store all nodes touched in separate subgraph • Limit path length to ensure relevant files • Limit edge weight so frequently accessed files are only considered by most-relevant files

  13. Ranking Results • If a file is rarely used in association with content-matched files, we want it to receive a lower ranking • Take node’s content-matched ranking and augment with contextual relationships from Relation graph

  14. Basic-BFS 5 • Content-based rankings: A=4,B=1,C=0,D=2 • Consider node D • Update D’s rankval with rankvals of incoming edges, using percentage of total of D’s incoming edge weights that each represents • For example, A->D = (2/10) * 4 (A’s rankval) • Repeat to get total weight pushed to each node from all contextual relationships A B 1 2 D 8 C

  15. Ranking Results • HITS algorithm • Improve ranking of “authorities”, nodes linked to many times, and “hubs”, nodes with many links • PageRank • Rank by the probability of reaching a particular node on a random walk of the graph (Google’s ranking algorithm)

  16. Evaluation • Compared to content-only ranking (via Indri) • Recall (reducing false positives) increased from 13% to 22% for top-10 and 34% to 74% overall • Precision (reducing false negatives) increased from 23% to 29% for top-10 and 15% to 16% overall • Best precision from: • Read/write filter • Path length of 3

  17. Performance • Background service requires on average 23 seconds per day to merge trace results into Relation Graphs • On average, index size is less than 1% of data set size • Queries execute in on average 2.62 seconds (0.98s for content search and 1.64s for context search)

  18. Discussion • Other applicable context information? Applications, user personalization • Deleted files: should they be left in? • Network file access? • Implement closer to the kernel?Can better handle renamed files, organize virtual directory structure, assign attributes

  19. References • C. Soules and G. Ganger. Connections: Using Context to Enhance File Search. Symposium on Operating System Principles, October 2005. • C. Soules and G. Ganger. Why Can't I Find My Files? New Methods for Automating Attribute Assignment. 9th Workshop on Hot Topics in Operating Systems (HotOS IX) May 2003. • D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at TREC 2004: terabyte track. Text Retrieval Conference, 2004. • U. Manber and S. Wu. GLIMPSE: a tool to search through entire file systems. Winter USENIX Technical Conference, pages 23–32. USENIX Association, 1994.

More Related