Quality of a search engine
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Quality of a search engine PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Quality of a search engine. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 8. Is it good ?. How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language.

Download Presentation

Quality of a search engine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Quality of a search engine

Quality of a search engine

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 8

Is it good

Is it good ?

  • How fast does it index

    • Number of documents/hour

    • (Average document size)

  • How fast does it search

    • Latency as a function of index size

  • Expressiveness of the query language

Measures for a search engine

Measures for a search engine

  • All of the preceding criteria are measurable

  • The key measure: user happiness

    …useless answers won’t make a user happy

Happiness elusive to measure

Happiness: elusive to measure

  • Commonest approach is given by the relevance of search results

    • How do we measure it ?

  • Requires 3 elements:

    • A benchmark document collection

    • A benchmark suite of queries

    • A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an ir system

Evaluating an IR system

  • Standard benchmarks

    • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years

    • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant

  • On the Web everything is more complicated since we cannot mark the entire corpus !!

General scenario




General scenario

Precision vs recall

Precision vs. Recall

  • Precision: % docs retrieved that are relevant [issue “junk” found]

  • Recall: % docs relevant that are retrieved [issue “info” found]




How to compute them

How to compute them

  • Precision: fraction of retrieved docs that are relevant

  • Recall: fraction of relevant docs that are retrieved

  • Precision P = tp/(tp + fp)

  • Recall R = tp/(tp + fn)

Some considerations

Some considerations

  • Can get high recall (but low precision) by retrieving all docs for all queries!

  • Recall is a non-decreasing function of the number of docs retrieved

  • Precision usually decreases

Precision recall curve

We measures Precision at various levels of Recall

Note: it is an AVERAGE over many queries

Precision-Recall curve







A common picture

A common picture







F measure

F measure

  • Combined measure (weighted harmonic mean):

  • People usually use balanced F1 measure

    • i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)

  • Use this if you need to optimize a single measure that balances precision and recall.

Recommendation systems

Recommendation systems

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa



  • We have a list of restaurants

    • with  and  ratings for some

      Which restaurant(s) should I recommend to Dave?

Basic algorithm

Basic Algorithm

  • Recommend the most popular restaurants

    • say # positive votes minus # negative votes

  • What if Dave does not like Spaghetti?

Smart algorithm

Smart Algorithm

  • Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.

    • Perhaps recommend Straits Cafe to Dave

  •  Do you want to rely on oneperson’s opinions?

Main idea

Main idea



What do we suggest to U ?










A glimpse on xml retrieval extensible markup language

A glimpse on XML retrieval(eXtensible Markup Language)

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 10

Xml vs html


  • HTML is a markup language for a specific purpose (display in browsers)

    • XML is a framework for defining markup languages

  • HTML has fixed markup tags, XML no

  • HTML can be formalized as an XML language (XHTML)

Xml example visual

XML Example (visual)

Xml example textual

XML Example (textual)

<chapter id="cmds">

<chaptitle> FileCab </chaptitle>

<para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.



Basic structure

Basic Structure

  • An XML doc is an ordered, labeled tree

  • character data: leaf nodes contain the actual data (text strings)

  • elementnodes: each labeled with

    • a name (often called the element type), and

    • a set of attributes, each consisting of a name and a value,

    • can have child nodes

Xml design goals

XML: Design Goals

  • Separate syntax from semantics to provide a framework for structuring information

  • Allow tailor-made markup for any imaginable application domain

  • Support internationalization (Unicode) and platform independence

  • Be the standard of (semi)structured information (do some of the work now done by databases)

Why use xml

Why Use XML?

  • Represent semi-structured

  • XML is more flexible than DBs

  • XML is more structured than simple IR

  • You get a massive infrastructure for free

Data vs text centric xml

Data vs. Text-centric XML

  • Data-centric XML: used for messaging between enterprise applications

    • Mainly a recasting of relational data

  • Text-centric XML: used for annotating content

    • Rich in text

    • Demands good integration of text retrieval functionality

    • E.g., find me the ISBN #s of Books with at least three Chapters discussing cocoa production, ranked by Price

Ir challenges in xml

IR Challenges in XML

  • There is no document unit in XML

  • How do we compute tf and idf?

  • Indexing granularity

  • Need to go to document for retrieving or displaying a fragment

    • E.g., give me the Abstracts of Papers on existentialism

  • Need to identify similar elements in different schemas

    • Example: employee

Xquery sql for xml

Xquery: SQL for XML ?

  • Simple attribute/value

    • /play/title contains “hamlet”

  • Path queries

    • title contains “hamlet”

    • /play//title contains “hamlet”

  • Complex graphs

    • Employees with two managers

  • What about relevance ranking?

Data structures for xml retrieval

Data structures for XML retrieval

  • Inverted index: give me all elements matching text query Q

    • We know how to do this – treat each element as a document

  • Give me all elements below any instance of the Book element (Parent/child relationship is not enough)

Positional containment

droppeth under Verse under Play.

Positional containment










Containment can be

viewed as merging




Summary of data structures

Summary of data structures

  • Path containment etc. can essentially be solved by positional inverted indexes

  • Retrieval consists of “merging” postings

  • All the compression tricks are still applicable

  • Complications arise from insertion/deletion of elements, text within elements

    • Beyond the scope of this course

Search engines

Search Engines


Classic approach

Classic approach…

Socio-demo Geographic Contextual

Search engines vs advertisement

Search Engines vsAdvertisement

  • First generation-- use only on-page, web-text data

    • Word frequency and language

  • Second generation-- use off-page, web-graph data

    • Link (or connectivity) analysis

    • Anchor-text (How people refer to a page)

  • Third generation-- answer “the need behind the query”

    • Focus on “user need”, rather than on query

    • Integrate multiple data-sources

    • Click-through data

Pure search vs Paid search

Ads show on search (who pays more), Goto/Overture

2003 Google/Yahoo

New model

All players now have:

SE, Adv platform + network

The new scenario

The new scenario

  • SEs make possible

    • aggregation of interests

    • unlimited selection (Amazon, Netflix,...)

  • Incentives for specialized niche players

The biggest money is in

the smallest sales !!

Two new approaches

Two new approaches

  • Sponsored search: Ads driven by search keywords

    (and user-profile issuing them)


Quality of a search engine



Two new approaches1

Two new approaches

  • Sponsored search: Ads driven by search keywords

    (and user-profile issuing them)

  • Context match: Ads driven by the content of a web page

    (and user-profile reaching that page)



How does it work



How does it work ?

  • Match Ads to query or pg content

  • Order the Ads

  • Pricing on a click-through

Quality of a search engine

Visited Pages

Clicked Banner


usage data !!!

Web Searches

Clicks on Search Results

Dictionary problem

Dictionary problem

A new game

Similar to web searching, but:

Ad-DB is smaller, Ad-items are

small pages, ranking depends on clicks

A new game

  • For advertisers:

    • What words to buy, how much to pay

    • SPAM is an economic activity

  • For search engines owners:

    • How to price the words

    • Find the right Ad

    • Keyword suggestion, geo-coding, business control, language restriction, proper Ad display

  • Login