1 / 15

Information Retrieval

Information Retrieval. CSE 8337 Spring 2003 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book. Text Processing TOC. Simple Text Storage String Matching

clarke
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval CSE 8337 Spring 2003 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

  2. Text Processing TOC • Simple Text Storage • String Matching • String-to-String Correction (Approximate matching)

  3. Text storage • EBCDIC/ASCII • Array of character • Linked list of character • Trees- B Tree, Trie • Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

  4. Pattern Matching(Recognition) • Pattern Matching: finds occurrences of a predefined pattern in the data. • Applications include speech recognition, information retrieval, time series analysis.

  5. Similarity Measures • Determine similarity between two objects. • Similarity characteristics: • Alternatively, distance measures measure how unlike or dissimilar objects are.

  6. String Matching Problem • Input: • Pattern – length m • Text string – length n • Find one (next, all) occurrences of string in pattern • Ex: • String: 00110011011110010100100111 • Pattern: 011010

  7. String Matching Algorithms • Brute Force • Kknuth-Morris Pratt • Boyer Moore • P209 in text

  8. 011010 011010 011010 Brute Force String Matching • Brute Force • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html • Space O(m+n) • Time O(mn) 00110011011110010100100111

  9. FSR

  10. Creating FSR • Create FSM: • Construct the “correct” spine. • Add a default “failure bus” to state 0. • Add a default “initial bus” to state 1. • For each state, decide its attachments to failure bus, initial bus, or other failure links.

  11. Knuth-Morris-Pratt • Apply FSM to string by processing characters one at a time. • Accepting state is reached when pattern is found. • Space O(m+n) • Time O(m+n) • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

  12. Boyer-Moore • Scan pattern from right to left • Skip many positions on illegal character string. • O(mn) • Expected time better than KMP • Expected behavior better • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

  13. String-to-String Correction • Measure of similarity between strings • Can be used to determine how to convert from one string to another • Cost to convert one to the other • Transformations • Match: Current characters in both strings are the same • Delete: Delete current character in input string • Insert: Insert current character in target string into string

  14. Distance Between Strings

  15. Approximate String Matching • Find patterns “close to” the string • Fuzzy matching • Applications: • Spelling checkers • IR • Define similarity (distance) between string and pattern

More Related