1 / 16

Lecture 4: Indexing Files

Lecture 4: Indexing Files. Inverted File Lexical Analysis Stop lists. Indexing. Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak. Creating inverted files.

fagan
Download Presentation

Lecture 4: Indexing Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4:Indexing Files • Inverted File • Lexical Analysis • Stop lists

  2. Indexing • Arrangement of data (data structure) to permit fast searching • Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak

  3. Creating inverted files Word Extraction Word IDs Original Documents W1:d1,d2,d3 W2:d2,d4,d7,d9 Wn :di,…dn Inverted Files Document IDs

  4. D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file • Map the file names to file IDs • Consider the following Original Documents

  5. D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced itsfirst PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file Red: stop word

  6. D1 depart comput scienc establish D2 depart launch bsc hons comput studi D3 follow msc comput scienc start D4 depart produc phd graduat D5 staff contribut intellectuprofession advanc field Creating Inverted file After stemming, make lowercase (option), delete numbers (option)

  7. Words Documents Words Documents depart d1,d2,d4 produc d4 comput d1,d2,d3 phd d4 scienc d1,d3 graduat d4 establish d1 staff d5 launch d2 contribut d5 bsc d2 intellectu d5 hons d2 profession d5 studi d2 advanc d5 follow d3 field d5 msc d3 start d3 Creating Inverted file (unsorted)

  8. Words Documents Words Documents advanc d5 msc d3 bsc d2 phd d4 comput d1,d2,d3 produc d4 contribut d5 profession d5 depart d1,d2,d4 scienc d1,d3 establish d1 staff d5 field d5 start d3 follow d3 studi d2 graduat d4 intellectu d5 launch d2 Creating Inverted file (sorted)

  9. Searching on Inverted File • Binary Search • Using in the small scale • Create thesaurus and combining techniques such as: • Hashing • B+tree • Pointer to the address in the indexed file

  10. Lexical Analysis for indexing • Word extraction • Spaces as English words boundaries • Chinese word segmentation • Stop words elimination • “a”,”an”,”the”,”about”,”etc”,”every”,”you”,etc. • Word stemming

  11. Lexical Analysis • Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens • Lexical analysis is the first stage of: • Automatic indexing • Query processing

  12. Lexical Analysis for Automatic Indexing • What counts as a word or token in the indexing scheme? (an easy problem?) • Digits • “Year 2000”, “Y2K” • Hyphens • “F-16”“MS-DOS” • Other Punctuation • “COMMAND.COM”“max_size” (often in C code) • Case • IBM or ibm

  13. Lexical Analysis for Automatic Indexing (cont.) • No technical difficulty in solving any of these problems • Must think about them carefully • Tradeoff between recall and precision • Breaking up hyphenated terms increase recall but decreases precision • Preserving case distinctions enhances precision but decreases recall

  14. Lexical Analysis for Query Processing • Depends on the design of the lexical analyzer for automatic indexing • Distinguish operators (Boolean operators, weighting function operators etc.) • Process certain characters: • Control characters • “” for phase search, {} for priority • Disallowed punctuation characters (error)

  15. STOPLISTS • Many of the most frequently occurring words in English (“the” ,”of” etc.) are worthless as index terms • Eliminating such words • Speeds processing • Saves huge amounts of space in indexes • Does not damage retrieval effectiveness • Stoplists are used to eliminates such words. E.g., • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words • http://bll.epnet.com/help/ehost/Stop_Words.htm • http://www.syger.com/jsc/docs/stopwords/english.htm

  16. STOPLISTS • Choices of words in stop list may vary from person to person. • The general idea is to find words that occur often so that they are not good terms for information retrieval. • How to use vector space model to find out a list of stop words? • How to find stop words in Chinese?

More Related