1 / 21

Information Retrieval

Information Retrieval. scope, basic concepts system architectures, modes of operation. Scope of IR : Getting data relevant to user information needs from: free text (unstructured) recently also: hypertext (WWW) semi-structured (XML)

carrington
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval • scope, basic concepts • system architectures, modes of operation Introduction – Beeri/Feitelson

  2. Scope of IR: Getting data relevant to user information needs from: free text(unstructured) recently also: hypertext (WWW) semi-structured (XML) Of the linguistic levels: lexical, syntax, semantics it uses (almost) only the first: The basic building block of queries is the term מילה (For hypertext – links are used in query processing) Introduction – Beeri/Feitelson

  3. An IR system architecture /typedetermined by: • What is the db? Full text, or only meta-data • How are information needs expressed – what is a query? • What areresults, how are theypresented? What is the interaction with the user? (e.g. feedback) • The indices & query processing algorithms (mostly determined by the above) Systems continue to evolve/expand by adding functionality: clustering, classification Introduction – Beeri/Feitelson

  4. Architecture A: text pattern match • Full text • No indices • Query = regular expression • Query processing: read docs sequentially, match with the pattern • Results: show matching docs Example: grep on UNIX Properties: • Can be used with compression • Practical for “small” db’s (~ few MB’s) Introduction – Beeri/Feitelson

  5. Architecture B: Hierarchical browsing • DB: a hierarchy of categories, with associated bib records | full text docs • “Query”: browsing the hierarchy Examples: yahoobow Issues: manual | (semi-) automatic construction of: • Hierarchy • Classification of entries Properties: • Browsing takes time/effort • Restricted coverage Introduction – Beeri/Feitelson

  6. Architecture C: Boolean queries • DB: full text | bib records | abstracts | … • Query: lists of terms connected with boolean ops: and, or, not • Query processing: • List of terms – retrieve doc id’s (docs containing all terms) • And|or|not: Combine using boolean set ops (note: not computer is not a query!) • Results: precise Example: lexis-nexis Introduction – Beeri/Feitelson

  7. Arch. C, cont’d: Extensions: • proximity operations For example: computer science • Together • In this order • Within 5 words of each other • …… • Wild cards Example: lexis-nexis Introduction – Beeri/Feitelson

  8. Properties: (mostly disadvantages) • End users have difficulties translating info needs to boolean queries (may require use of distributive laws for and, or…) • Precise boolean formulation often large/complex • No standard, for syntax, additional operations,.. • Extreme situations possible: • Answer too small – not clear how to weaken query • Answer too large – requires browsing & filtering, or transforming query to return less answers (not easy) • Terms can not be weighted by user  Often require information professional to help users Introduction – Beeri/Feitelson

  9. Architecture D: ranked queries (actually, answers) • DB: full text | bib records | abstracts | … • Query: lists of (possibly weighted) terms • Result: docs,ranked by similarity(~relevance) • Query processing: • List of terms – retrieve doc id’s, of docs containing some terms (# of term occurrences in doc, significance of term,..) A popular system architecture, underlies many commercial products & search engines Introduction – Beeri/Feitelson

  10. Properties: • Query = doc (list of term occurrences) • Easy to write/change queries (no expertiseneeded) • More relevant answers are shown first (providedsystem performs well) • User’s feedback on relevance, can be used to improve query – relevance feedback(find similar pages) • Possible to miss some relevant docs • Possible to include some/many irrelevant docs Introduction – Beeri/Feitelson

  11. # of relevant docs in answer |answer| # of relevant docs in answer # of relevant in DB Side trip : on measuringquality of answer Precision= Recall= Issues: • How is relevance measured? (need users’opinions) substitute: user satisfaction(e.g., # ofclicks) • Even if relevance is available, how is recall measured? Introduction – Beeri/Feitelson

  12. Index structure for boolean/ranked : Inverted file קובץ הפוך DB is ~ a boolean sparse matrix: d-t(i,j) = 1 if tj occurs in di, 0 if not A doc di is a boolean (sparse) vector (easilyobtained) “Transposing the matrix”, we obtain t-d(j,i) = 1/0 if tj occurs/occurs not in di A term tj is ~ a boolean (sparse) vector vector space model Sparse: Introduction – Beeri/Feitelson

  13. A compact representation of the vector of tj is an inverted list(postings list)רשימה הפוכה Inverted list for tj -- IL(tj) : • docId (logical pointer) for each doc that contains tj • (optional, for ranked) # of occurrences (or somemeasure of frequency, significance) of tj in doc • (optional for proximity) position of each occurrence of tj in doc: • Section/paragraph/sentence • Word #, char # (More info  larger lists, costlier storage/retrieval) Introduction – Beeri/Feitelson

  14. Typical inverted file organization: a 3 level architecture: • Lexicon מילון : the terms, organized for fast search. Often can reside in MM. For each term: • (optional) global data such as weight/total # of occs • Address of inverted list on disk • Inverted lists file (on disk) – allows to retrieve each inverted list • The DB of documents • Also need a docId-to-docAddress mapper Introduction – Beeri/Feitelson

  15. Advantages of the organization: • Can find fast lexicon entries for query terms (would be much slower if lexicon combinedwith IL’s) • Can retrieve fast the IL’ for these terms • Then manipulate docId’s before deciding which documents to retrieve (boolean, proximity, ranking, can be done on docId’s lists) • Compression reduces storage costs, more importantly I/O costs (up to 70% of space)  can compress IL’s and DB separately, using appropriate methods for each Introduction – Beeri/Feitelson

  16. Processing of boolean queries : • List of terms: retrieve their IL’s, intersect them • And : intersect lists • Or : union lists • And not : take relative complement Finally retrieve the docs Proximity: use positional information in IL’s for filtering, if present or: retrieve docs then filter as they arrive Introduction – Beeri/Feitelson

  17. Processing of ranked queries : Naively: • Retrieve & merge IL’s of all terms (like or) • Compute similarity / distanceof query to each doc • Present to user ranked by similarity What is naïve about the above? Introduction – Beeri/Feitelson

  18. How is similarity computed? Assume n terms • The query q is an n-vector (q1, q2, …..qn) • A doc d is an n-vector (d1, d2, ….dn) (The can be 0/1 or 0/real numbers -- frequencies/weights of terms in documents / query) • The distance dist(q,t) computed as follows: This measures the angle between the vectors in n-dimensional space Introduction – Beeri/Feitelson

  19. the factor is fixed – can be omitted • The factor normalizes document length – without it, long documents have an unfair advantage There are many variations on the cosine formula Introduction – Beeri/Feitelson

  20. IR topics for the course : • Terms and weights: • what are terms • which weights give good results • Index -- inverted file (efficient) construction • Process many GB’s DB’s in hours • Other index structures (signature files) • are inverted files the best solution? • Efficient & fast processing of ranked answers • Which weights to use in the cosine formula? • Fast computation of similarity for many docs • Approximate  find highly relevant docs early Introduction – Beeri/Feitelson

  21. Compression: • Disk capacities grow, collections grow faster • Main cost is I/O cost; compression reduces it • CPU speeds grow at a faster rate than I/O speeds – pays to always use compression for disk data – I/O rate vs. CPU usage tradeoff • Tailor compression methods for different kinds of data: DB docs, IL’s, images, … • Search engines • How can link information improve answer quality • Relevance feedback: • How can user feedback be used to improve answer? • Answer quality & presentation: • Revisit precision, recall • Clustering for improved presentation Introduction – Beeri/Feitelson

More Related