Loading in 5 sec....

Alexander Gelbukh GelbukhPowerPoint Presentation

Alexander Gelbukh Gelbukh

- 87 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Alexander Gelbukh Gelbukh' - keitha

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 4 (book chapter 8): Indexing and Searching

Alexander Gelbukh

www.Gelbukh.com

Previous Chapter: Conclusions

- Main measures: Precision & Recall.
- For sets
- Rankings are evaluated through initial subsets

- There are measures that combine them into one
- Involve user-defined preferences

- Many (other) characteristics
- An algorithm can be good at some and bad at others
- Averages are used, but not always are meaningful

- Reference collection exists with known answers to evaluate new algorithms

Previous Chapter: Research topics

- Different types of interfaces
- Interactive systems:
- What measures to use?
- Such as infromativeness

Types of searching

- Indexed
- Semi-static
- Space overhead

- Sequential
- Small texts
- Volatile, or space limited

- Combined
- Index into large portions, then sequential inside portion
- Best combination of speed / overhead

Inverted files

- Vocabulary: sqrt (n). Heaps’ law. 1GB 5M
- Occurrences: n * 40% (stopwords)
- positions (word, char), files, sections...

Compression: Block addressing

- Block addressing: 5% overhead
- 256, 64K, ..., blocks (1, 2, ..., bytes)
- Equal size (faster search) or logical sections (retrieval units)

Searching in inverted files

- Vocabulary search
- Separate file
- Many searching techniques
- Lexicographic: log V (voc. size) = ½ log n (Heaps)
- Hashing is not good for prefix search

- Retrieval of occurrences
- Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)
- Boolean operations. Context search
- Merging occurrences
- For AND: One list is usually shorter (Zipf law) sublinear!

- Boolean operations. Context search
- Only inverted files allow sublinear both space & time
- Suffix trees and signature files don’t

Building inverted file: 1

- Infinite memory? Use trie to store vocabulary. O(n)
- append positions

- Finite memory? Build in chunks, merge. Almost O(n)
- Insertion: index + merge. Deleting: O(n). Very fast.

Suffix trees

- Text as one long string. No words.
- Genetic databases
- Complex queries
- Compacted trie structure
- Problem: space

- For text retrieval, inverted files are better

Suffix array

- All suffixes (by position) in lexicographic order
- Allows binary search
- Much less space: 40% n
- Supra-index: sampling, for better disk access

Suffix tree and suffix array:Searching. Construction

Searching

- Patterns, prefixes, phrases. Not only words
- Suffix tree: O(m), but: space (m = query size)
- Suffix array: O(log n) (n = database size)
- Construction of arrays: sorting
- Large text: n2 log (M)/M, more than for inverted files
- Skip details

- Addition: n n' log (M)/M. (n' is the size of new portion)
- Deletion: n

Signature files

- Usually worse than inverted files
- Words are mapped to bit patterns
- Blocks are mapped to ORs of their word patterns
- If a block contains a word, all bits of its pattern are set
- Sequential search for blocks
- False drops!
- Design of the hash function
- Have to traverse the block

- Good to search ANDs or proximity queries
- bit patterns are ORed

- False drop: letters in 2nd block

Boolean operations

- Merging file (occurrences) lists
- AND: to find repetitions

- According to query syntax tree
- Complexity linear in intermediate results
- Can be slow if they are huge

- There are optimization techniques
- E.g.: merge small list with a big one by searching
- This is a usual case (Zipf)

Sequential search

- Necessary part of many algorithms (e.g., block addr)
- Brute force: O(nm) worst-case, O(n) on average
- MANY faster algorithms, but more complicated
- See the book

Approximate string matching

- Match with k errors, select the one with min k
- Levenshtein distance between strings s1 and s2
- The minimum number of editing operations to make onefrom another
- Symmetric for standard sets of operations
- Operations: deletion, addition, change
- Sometimes weighted

- Solution: dynamic programming. O(mn), O(kn)
- m, n are lengths of the two strings

Regular expressions

- Regular expressions
- Automation: O (m 2m) + O (n) – bad for long patterns
- There are better methods, see book

- Using indices to search for words with errors
- Inverted files: search in vocabulary
- Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path

Search over compression

- Improves both space AND time (less disk operations)
- Compress query and search
- Huffman compression, words as symbols, bytes
- (frequencies: most frequent shorter)

- Search each word in the vocabulary its code
- More sophisticated algorithms

- Huffman compression, words as symbols, bytes
- Compressed inverted files: less disk less time
Text and index compression can be combined

...compression

- Suffix trees can be compressed almost to size ofsuffix arrays
- Suffix arrays can’t be compressed (almost random),but can be constructed over compressed text
- instead of Huffman, use a code that respects alphabetic order
- almost the same compression

- Signature files are sparse, so can be compressed
- ratios up to 70%

Research topics

- Perhaps, new details in integration of compression and search
- “Linguistic” indexing: allowing linguistic variations
- Search in plural or only singular
- Search with or without synonyms

Conclusions

- Inverted files seem to be the best option
- Other structures are good for specific cases
- Genetic databases

- Sequential searching is an integral part of manyindexing-based search techniques
- Many methods to improve sequential searching

- Compression can be integrated with search

Till April 26, 6 pm

Download Presentation

Connecting to Server..