slide1 n.
Skip this Video
Download Presentation
Alexander Gelbukh Gelbukh

Loading in 2 Seconds...

play fullscreen
1 / 25

Alexander Gelbukh Gelbukh - PowerPoint PPT Presentation

  • Uploaded on

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching. Alexander Gelbukh Previous Chapter: Conclusions. Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Alexander Gelbukh Gelbukh' - keitha

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 4 (book chapter 8): Indexing and Searching

Alexander Gelbukh

previous chapter conclusions
Previous Chapter: Conclusions
  • Main measures: Precision & Recall.
    • For sets
    • Rankings are evaluated through initial subsets
  • There are measures that combine them into one
    • Involve user-defined preferences
  • Many (other) characteristics
    • An algorithm can be good at some and bad at others
    • Averages are used, but not always are meaningful
  • Reference collection exists with known answers to evaluate new algorithms
previous chapter research topics
Previous Chapter: Research topics
  • Different types of interfaces
  • Interactive systems:
    • What measures to use?
    • Such as infromativeness
types of searching
Types of searching
  • Indexed
    • Semi-static
    • Space overhead
  • Sequential
    • Small texts
    • Volatile, or space limited
  • Combined
    • Index into large portions, then sequential inside portion
    • Best combination of speed / overhead
inverted files
Inverted files
  • Vocabulary: sqrt (n). Heaps’ law. 1GB  5M
  • Occurrences: n * 40% (stopwords)
    • positions (word, char), files, sections...
compression block addressing
Compression: Block addressing
  • Block addressing: 5% overhead
    • 256, 64K, ..., blocks (1, 2, ..., bytes)
    • Equal size (faster search) or logical sections (retrieval units)
searching in inverted files
Searching in inverted files
  • Vocabulary search
    • Separate file
    • Many searching techniques
    • Lexicographic: log V (voc. size) = ½ log n (Heaps)
    • Hashing is not good for prefix search
  • Retrieval of occurrences
  • Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)
    • Boolean operations. Context search
      • Merging occurrences
      • For AND: One list is usually shorter (Zipf law)  sublinear!
  • Only inverted files allow sublinear both space & time
    • Suffix trees and signature files don’t
building inverted file 1
Building inverted file: 1
  • Infinite memory? Use trie to store vocabulary. O(n)
    • append positions
  • Finite memory? Build in chunks, merge. Almost O(n)
  • Insertion: index + merge. Deleting: O(n). Very fast.
suffix trees
Suffix trees
  • Text as one long string. No words.
    • Genetic databases
    • Complex queries
    • Compacted trie structure
    • Problem: space
  • For text retrieval, inverted files are better
suffix array
Suffix array
  • All suffixes (by position) in lexicographic order
  • Allows binary search
  • Much less space: 40% n
  • Supra-index: sampling, for better disk access
suffix tree and suffix array searching construction
Suffix tree and suffix array:Searching. Construction


  • Patterns, prefixes, phrases. Not only words
  • Suffix tree: O(m), but: space (m = query size)
  • Suffix array: O(log n) (n = database size)
  • Construction of arrays: sorting
    • Large text: n2 log (M)/M, more than for inverted files
    • Skip details
  • Addition: n n' log (M)/M. (n' is the size of new portion)
  • Deletion: n
signature files
Signature files
  • Usually worse than inverted files
  • Words are mapped to bit patterns
  • Blocks are mapped to ORs of their word patterns
  • If a block contains a word, all bits of its pattern are set
  • Sequential search for blocks
  • False drops!
    • Design of the hash function
    • Have to traverse the block
  • Good to search ANDs or proximity queries
    • bit patterns are ORed
boolean operations
Boolean operations
  • Merging file (occurrences) lists
    • AND: to find repetitions
  • According to query syntax tree
  • Complexity linear in intermediate results
    • Can be slow if they are huge
  • There are optimization techniques
    • E.g.: merge small list with a big one by searching
    • This is a usual case (Zipf)
sequential search
Sequential search
  • Necessary part of many algorithms (e.g., block addr)
  • Brute force: O(nm) worst-case, O(n) on average
  • MANY faster algorithms, but more complicated
    • See the book
approximate string matching
Approximate string matching
  • Match with k errors, select the one with min k
  • Levenshtein distance between strings s1 and s2
    • The minimum number of editing operations to make onefrom another
    • Symmetric for standard sets of operations
    • Operations: deletion, addition, change
    • Sometimes weighted
  • Solution: dynamic programming. O(mn), O(kn)
    • m, n are lengths of the two strings
regular expressions
Regular expressions
  • Regular expressions
    • Automation: O (m 2m) + O (n) – bad for long patterns
    • There are better methods, see book
  • Using indices to search for words with errors
    • Inverted files: search in vocabulary
    • Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path
search over compression
Search over compression
  • Improves both space AND time (less disk operations)
  • Compress query and search
    • Huffman compression, words as symbols, bytes
      • (frequencies: most frequent shorter)
    • Search each word in the vocabulary  its code
    • More sophisticated algorithms
  • Compressed inverted files: less disk  less time

Text and index compression can be combined

  • Suffix trees can be compressed almost to size ofsuffix arrays
  • Suffix arrays can’t be compressed (almost random),but can be constructed over compressed text
    • instead of Huffman, use a code that respects alphabetic order
    • almost the same compression
  • Signature files are sparse, so can be compressed
    • ratios up to 70%
research topics
Research topics
  • Perhaps, new details in integration of compression and search
  • “Linguistic” indexing: allowing linguistic variations
    • Search in plural or only singular
    • Search with or without synonyms
  • Inverted files seem to be the best option
  • Other structures are good for specific cases
    • Genetic databases
  • Sequential searching is an integral part of manyindexing-based search techniques
    • Many methods to improve sequential searching
  • Compression can be integrated with search

Thank you!

Till April 26, 6 pm