information retrieval n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval PowerPoint Presentation
Download Presentation
Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 22

Information Retrieval - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Information Retrieval. Ugochukwu Chimbo EJIKEME. Structured Vs Unstructured Data. Coperate information not stored in the database In General * The structure of the data itself. * The structure of the container that hosts the data.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval' - vanya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval

Information Retrieval

Ugochukwu Chimbo EJIKEME

structured vs unstructured data
Structured Vs Unstructured Data
  • Coperate information not stored in the database
  • In General
    • * The structure of the data itself.
    • * The structure of the container that hosts the data.
    • * The structure of the access method used to access the data.
information retrieval systems irs
Information Retrieval Systems (IRS)
  • Information-retrieval systems are used to store and query textual data such as documents. They use a simpler data model than do database systems. Traditional examples of information-retrieval systems are online library catalogs and online document-management systems such as those that store newspaper articles.
characteristics of irs
Characteristics of IRS
  • Documents are typically described by a set of keywords.
  • Information in the database is organized simply as a collection of unstructured documents.
  • Cares less about transactional requirements.
relevance ranking
Relevance Ranking
  • Using Terms (Keywords)
    • Ranking Using TF-IDF
    • Similarity-Based Retrieval
  • Hyperlinks (WEB)
    • Popularity Ranking (prestige ranking)
    • PageRank
    • Combining TF-IDF and Popularity Ranking Measures
ranking using tf idf
Ranking using TF-IDF
  • Term Frequency (TF) – Relevance of a document (d) to a term (t).
  • “Multiple Keyword” Queries ?

n

Σ (TF(d,ti))

i=1

slide7

Inverse document frequency (IDF)

    • Query: “Facebook Ugo”???.
    • Relevance therefore:
    • Proximity??? The closer the word to each other in the document, the higher the rank.
similarity based retrieval
Similarity-Based Retrieval
  • Retrieve document similar to another.
  • Similarity may be defined on the basics of terms.
  • Cosine similarity metrics
  • Relevance feedback – start new search based on user feedback on prior search.
hyperlink
Hyperlink
  • Popularity Ranking
  • Rank “popular” documents higher among set of documents with specific keywords.
  • Determining “Popularity”
    • Access rate ?
      • How to get accurate data?
    • Bookmarks?
      • Might be private?
    • Links to related pages?
      • Using web crawler to analyze external links.
slide10

transfer of prestige

    • a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z.
pagerank
PageRank
  • A measure of popularity of a page based on the popularity of pages that link to the page.
  • Understanding PageRank.
    • Random walk model:
      • The PageRank of a page is the probability that a random walker is visiting a page at any given point in time.
  • Drawback:
    • does not take query keywords into account.
other measures of popularity
Other Measures of Popularity
  • Click fraction
    • search engine provides an indirect link through the search engine site, which records the page click, and transparently redirects the browser to the original link.
  • Anchor text + Page Rank
  • Anchor text + Page Rank + TF–IDF measures
slide13

The HITS algorithm:

    • compute popularity using set of related pages only.
  • Hubs and Authorities
  • Hub - A page that stores links to many related pages (may not in itself contain actual information on a topic)
  • Authority - A page that contains actual information on a topic (may not store links to many related pages).
  • Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).
search engine spamming
Search Engine Spamming
  • Practice of creating Web pages, or sets of Web pages, designed to get a high relevance rank for some queries, even though the sites are not actually popular sites.
synonyms homonyms and ontologies
Synonyms, Homonyms, and Ontologies
  • Synonyms
    • Define alternative words for keywords
      • E.g Class room <==> (Class or Lecture) room
  • Homonyms
    • single words with multiple meanings
  • Concept-based querying
    • analyze each document to disambiguate each word in the document, and replace it with the concept that it represents; disambiguation is usually done by looking at other surrounding words in the document.
slide16

Ontologies are hierarchical structures that reflect relationships between concepts.

    • Common relationships include: is – a, part of,.. etc.
indexing of documents
Indexing of Documents
  • Inverted index
    • maps each keyword Ki to a list Si of the documents that contain Ki.
      • Document 1 (d1), Document 2 (d2), Document 3 (d3)
      • 56,89,201 12, 18, 19 5
      • Inverted Index = “d1/56,89,201; d2/12,18,19; d3/5”
      • *May also include Term Frequency in documents.
measuring retrieval effectiveness
Measuring Retrieval Effectiveness
  • Keywords are maintained in a compressed form (to keep space usage of the index low).
    • index sometimes stored such that the retrieval is approximate; a few relevant documents may not be retrieved (called a false drop or false negative), or a few irrelevant documents may be retrieved (called a false positive).
measurement metrics
Measurement metrics
  • Precision
    • measures the percentage retrieved documents relevant to a given query.
  • Recall
    • Measures percentage of the documents (relevant to the query) retrieved.
beyond page ranking
Beyond Page Ranking
  • Information Extraction
    • convert information from textual form to a more structured form.
    • Sample application: google scholar.
  • Question Answering
    • system attempts to provide direct answers to questions posed by users.
summary
Summary
  • Information-retrieval systems are used to store and query textual data such as documents.
  • Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords.
  • Relevance ranking makes use of several types of information, such as:
    • ◦ Term frequency: how important each term is to each document.
    • ◦ Inverse document frequency.
    • ◦ Popularity ranking.
slide22

Search engine spamming attempts to get (an undeserved) high ranking for a page.

• Synonyms and homonyms complicate the task of information retrieval. Concept- based querying aims at finding documents containing specified concepts, regardless of the exact words (or language) in which the concept is specified. Ontologies are used to relate concepts using relationships such as is-a or part-of.

  • Inverted indices are used to answer keyword queries.
  • Precision and recall are two measures of the effectiveness of an information retrieval system.
  • Techniques have been developed to extract structured information from textual data and to give direct answers to simple questions posed in natural language.