Internet search engines fluctuations in document accessibility
Download
1 / 31

Internet search engines: Fluctuations in document accessibility - PowerPoint PPT Presentation

Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Internet search engines: Fluctuations in document accessibility

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet search engines: Fluctuations in document accessibility

  • Wouter Mettrop

    CWI, Amsterdam, The Netherlands

  • Paul Nieuwenhuysen

    Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium

  • Hanneke Smulders

    Infomare Consultancy, The Netherlands

    http://www.cwi.nl/cwi/projects/IRT

    Presented at Internet Librarian International 2000in London, England, March 2000


WWW: growing number of WWW servers

WWW


Internet based information sources: how many? how much?

In 2000:

  • about 1 billion = 1000 million unique URLs in the total Internet

  • about 10 terabyte (= 10 000 gigabyte) of text data


Internet information retrieval systems in 2000

  • Several types of systems exist to retrieve information:

    • Directories of selected sources categorised by subject, made by humans, mainly for browsing.

    • Search systems, based on databases with machine made indexes, for word-based searching!

    • “Meta-search” or “multi-threaded” search systems.

  • We have studied and compared several well-known international (and a few national) word-based Internet search engines.


Internet information retrieval systems: evaluation criteria

  • Many aspects/criteria can be considered in the evaluation of an Internet search engine, including

    • coverage of documents present on WWW(studies exist)

    • number of elements of a document, that are indexed to make them usable for retrieval

    • fluctuations over time in the result sets offered by a search engine

  • We started to study the depth of indexing and we were soon confronted with the fluctuations in the performance that do exist.


Internet information retrieval systems: our research group

The following persons have been involved in the research:

  • Louise Beijer (Hogeschool van Amsterdam, The Netherlands)

  • Hans de Bruin (Unilever Research Laboratorium, Vlaardingen, The Netherlands)

  • Hans de Man (JdM Documentaire Informatie, Vlaardingen, The Netherlands)

  • Rudy Dokter (PNO Consultants, Hengelo, The Netherlands)

  • Marten Hofstede ( Rijksuniversiteit Leiden, The Netherlands)

  • Wouter Mettrop (CWI, Amsterdam, The Netherlands)

  • Paul Nieuwenhuysen (Vrije Universiteit Brussel, Belgium)

  • Eric Sieverts (Hogeschool van Amsterdam, and RUU, The Netherlands)

  • Hanneke Smulders (Infomare, Terneuzen, The Netherlands)

  • Hans van der Laan (Consultant, Leiderdorp, The Netherlands)

  • Ditmer Weertman (ADLIB, Utrecht, The Netherlands)


Internet search engines: research on indexing functionality

  • assessing the indexing functionality

    • test document

    • test method

  • conclusions concerning indexing functionality


Number of our test documents that were retrieved


title tag

META-tags: keywords, description and author

comment tag

ALT tag

text/URL of a link to a document

H3 tag

table header

text of: an internal link, a reference anchor, a link to a sound file

name of a sound file (au/wav/aiff/ra)

text of a link to an image

name of an image file (gif or jpg; inline or linked to)

name of a Java applet (with or without extension class)

terms after the first 100 lines in a document (200/…/700)

the URL of a document

Internet search engines: elements of test document studied


<HTML> <HEAD>

<TITLE>Test pagina</TITLE>

<META NAME="keywords"

CONTENT="een, twee, drie">

<META NAME="description"

CONTENT="This test page, containig a small part of the Secret Garden (by Frances Hodgson Burnett) is part of a larger site about the IRT project. vier, vijf, zes">

<META NAME="Subject" CONTENT="zeven">

<META NAME="Subject" CONTENT="acht">

<META NAME="Subject" CONTENT="negen">

<META NAME="Title” CONTENT="tien hoofdstukken uit The Secret Garden">

<META NAME="Title:Subtitle" content="elf">

Internet search engines: part of the test document source code


Number of the studied document elements that were indexed


Internet search engines : reachability

  • 14 528 queries sent to 13 search engines

  • 721 times unreachable

  • The percentage of unreachability varies from nearly 0% to nearly 15%.

  • The studied search engines were reachable for 95% of the queries.


Search engine indexing functionality: conclusions

  • Not “all of the web” is indexed.

    • Not all of our test documents.

    • Not all HTML elements of our test document.

  • Some of the studied search engines showed changes in the indexing policy.

  • No relation between the number of indexed test documents or HTML elements and the size of a search engine was found during our study.


Internet search engines: fluctuations - definition

  • A fluctuation appears when the result set of an observation

    - i.e.

    • one query or

    • set of queries

      misses documents with respect to a frame of reference

      - i.e.

    • other observations and

    • knowledge about Web reality


Internet search engines: detecting fluctuations

  • Through time: comparing result sets of one observation, repeatedly performed

    • Observation = one query or set of queries

    • Frame of reference = other observations & web-knowledge

  • One moment: consistency of result sets

    • Observation = one query in set of queries

    • Frame of reference = other observations


Internet search engines: types of fluctuations

  • Through time: comparing result sets of one observation repeatedly performed

    • “Document fluctuations”

    • “Indexing fluctuations”

  • One moment: consistency of result sets

    • “Element fluctuations”


Document fluctuations: example 1


Document fluctuations: example 2


Document fluctuations: experimental results


Indexing fluctuations:experimental results


Element fluctuations: example


Element fluctuations: experimental results


Percentage of documents missed due to fluctuations


Internet search engines: fluctuations - quantitative conclusions

  • Many element fluctuations many document and indexing fluctuations and many document elements indexed

  • Many document fluctuations not always many element fluctuations

  • Few document elements indexed few element fluctuations


Fluctuations: remarks on “correctness”

  • Fluctuations can be seen as “correct”, if they are reflections of alterations in:

    • (web-) reality

      • then document, indexing and element fluctuations are incorrect

    • the indexed database of a search engine

      • then only element fluctuations are incorrect

  • Users do not care; they miss documents


Fluctuations:remarks on “size”

  • No relation document / element fluctuations < ===== > “size”

  • Percentage missed documents determines (with other reducing effects, such as depth of indexing) the effective size of an engine


Internet search engines: conclusions of our research

  • Search engines differ in depth of indexing.

  • Search engines show fluctuations in their result sets:

    • They are subject to changes in indexing policy.(“indexing fluctuations”)

    • They forget documents completely (“document fluctuations”)

    • They miss documents in their result sets (“element fluctuations”).


Internet search engines: recommendations related to fluctuations

  • Fluctuations are “normal”; do not be surprised; do not worry.

  • Do not try to find a simple explanation to fully understand what happens.

  • Known item searchers should repeat the search

    • when using an engine with many element fluctuations; use other search terms;

    • when using an engine with many document fluctuations: repeat later.

  • Further research on effective size.


ad
  • Login