1 / 15

Thanks to Bill Arms, Marti Hearst

Documents. Thanks to Bill Arms, Marti Hearst. Last time. Size of information Continues to grow IR an old field, goes back to the ‘40s IR iterative process Search engine most popular information retrieval model Still new ones being built. Focus on documents. Document will be what we:

ellery
Download Presentation

Thanks to Bill Arms, Marti Hearst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Documents Thanks to Bill Arms, Marti Hearst

  2. Last time • Size of information • Continues to grow • IR an old field, goes back to the ‘40s • IR iterative process • Search engine most popular information retrieval model • Still new ones being built

  3. Focus on documents • Document will be what we: • Crawl (harvest) • Index • Retrieve with query • Evaluate • Rank • IR iterative process

  4. Repositories Goals Workspace IR is an Iterative Process

  5. Query Parse User’s Information Need text input

  6. Index Pre-process Collections

  7. Index Query Parse Rank or Match Pre-process User’s Information Need Collections text input

  8. Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input Evaluation

  9. Definitions Collections consist of Documents • Document • The basic unit which we will automatically index • usually a body of text which is a sequence of terms • has to be digital • Tokens or terms • Basic units of a document, usually consisting of text • semantic word or phrase, numbers, dates, etc • Collections or repositories • particular collections of documents • sometimes called a database • Query • request for documents on a topic

  10. Collection vs documents vs terms Collection Terms or tokens Document

  11. What is a Document? • A document is a digital object with an operational definition • Indexable • Can be queried and retrieved. • Many types of documents • Text or part of text • Image • Audio • Video • Blogs • Data • Email • Tweet • Etc.

  12. Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Example?

  13. Why the focus on text? • Language is the most powerful query model • Language can be treated as text • Text has many interesting properties • Others?

  14. Information Retrieval from Collections of Textual Documents Major Categories of Methods • Exact matching (Boolean) • Ranking by similarity to query (vector space model) • Ranking of matches by importance of documents (PageRank) • Combination methods What happens in major search engines

  15. Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on thevector space model. Web searchmethods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

More Related