Accessing digital information: Search engines I.The problem

Accessing digital information: Search engines I.The problem • How to find the needle in the haystack II. How search engines work • Building the index IV. Types of search engines V. Problems with search engines

The Problem The WWW contains more than 2.5 billion pages with 7.3 million pages added each day The surface Web contains 19 terabytes (trillions of bytes) This where most of our stuff is There are 7,500 terabytes hidden in the "deep" Web This is largely proprietary information, dynamically generated pages, or pages behind firewwalls http://www.howstuffworks.com/news-item127.htm Without the URLs of the particular pages you want, you must rely on search engines to uncover potentially relevant pages

The web lacks bibliographic control standards that we take for granted in the print world There is no equivalent to the ISBN to uniquely identify a document There is no standard system of cataloguing or classification, analogous those developed by the Library of Congress, There is no central catalogue of the Web's holdings Many documents lack the name of the author and the date of publication Updating? Version control? Not likely!

The net is not a digital library It was not designed to support the organized publication and retrieval of information It has become a chaotic repository for the output of the hundreds of thousands of digital publishers It is filled with an amazing variety of digital artifacts The ephemeral mixes everywhere with works of lasting importance The librarian’s classification and selection skills must be complemented by the computer scientist’s ability to automate the task of indexing, storing, and providing access to information Lynch (1999) http://www.sciam.com/0397issue/0397lynch.html

How can we find what we want when we want it? This is becoming an increasingly important problem for people in the information professions Strategies: Follow links from page to page, hoping that you will stumble across the pages that will help you answer the question Maintain a personal collection of bookmarks Use search engines ~85% of all web sessions begin with or involve a search

Search engines are a critical web tool The Web offers the choice of hundreds of different search tools The problem is that each has its own database, command language, search capabilities, and method of displaying results Each covers a different portion of the web with some overlap This means that you have to learn a variety of search tools and develop effective search techniques to take advantage of Web resources

Search engine coverage relative to the estimated size of the publicly indexable web has decreased substantially since December 97 No engine indexes more than about 16% of the web Combined coverage of eleven major search engines is 42% of the Web Overlap between individual search engines is low, with approximately 40% of a given search engine’s content unique The biggest search engines are currently FAST and Google, with over 600 million pages indexed http://healthlinks.washington.edu/hsl/liaisons/stanna/navweb/nav2.html

Search engines are typically more likely to index sites that have more links to them (more 'popular' sites) Google does this Google interprets a link from page A to page B as a vote, by page A, for page B But, Google looks at more than the sheer volume of votes, or links a page receives It also analyzes the page that casts the vote Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important” http://www.google.com/technology/index.html

Other difficulties with search engine coverage Search engines are more likely to index US sites than non-US sites (AltaVista is an exception), and more likely to index .com sites than .edu sites Indexing of new or modified pages by just one of the major search engines can take months The pages that one engine indexes do not have extensive overlap with other indexes’ databases Lawrence and Giles (1999) http://www.wwwmetrics.com/

85% of users use search engines to find information (GVU survey) We use search engines to locate and buy goods and research many decisions Search engines are currently lacking in timeliness and comprehensiveness and do not index sites equally The current state of search engines can be compared to a phone book which is Updated irregularly Biased toward listing more popular information Missing many pages Filled with duplicate listings

Search engine indexing and ranking may have economic, social, political, and scientific effects Indexing and ranking of online stores can substantially effect economic viability Some search engines charge for placement and companies are willing to pay Delayed indexing of scientific research can lead to the duplication of work Delayed or biased indexing may affect social or political decisions

What do search engines do? They attempt to index and provide access to the “relevant web” This is defined differently by different engines It ranges from brute force indexing to the use of algorithms to gauge relevance and popularity Search engines/tools have four components The collection of entries for their databases The structure of their database The search process The interface

Data collection is done by Humans, who review and index in the employ of the search engine company (Yahoo!) Humans, by self submission Software Software collection agents include automated robot wanderers, spiders, harvesters, bots, and crawlers They roam the internet (mostly www, gopher and ftp sites), and bring back copies of resources This actually means systematically downloading pages and following links They sort, index and create database entries out of them

The search component concerns the end user It involves the interface between the human searcher and the indexed database of resources Several factors determine the success of a search engine: The size of the database The content and coverage of the database The currency of the entries and frequency of updating The elimination of redundancy and dead links The speed of searching The availability of advanced search features The interface design and ease of use

Search engines provide “electronic egalitarianism” Indexing and cataloguing tools are highly democratic They categorize information differently than human indexers do Machine-based approaches to information gathering, organization and retrieval provide uniform and equal access to all the information on the Net This is a source of one of our problems with search engines We type in a search request and receive thousands of URLs in response These results frequently contain references to irrelevant Web sites while leaving out others that hold important material

How search engines work Many search engines use two interdependent approaches Browsing through subject trees and hierarchies Keyword searching of an extensive database A subject tree provides a structured and organized hierarchy of categories for browsing for information by subject Under each category and/or sub- category, links to appropriate Web pages are listed Web pages are assigned categories either by the author or by subject tree administrators Many subject trees also have their own keyword searchable indexes

Search tools with elaborate subject trees present links with brief annotations Examples include Yahoo, Galaxy, the WWW Virtual Library) Search engines allow keyword searching of indexes These are automatically compiled by robots and spiders, which are constantly collecting net resources Searchers enter keywords to query the index Some allow Boolean operators and other advanced features Web pages and other Internet resources that satisfy the query are identified and listed

Search engines compete on the Size of their indexes Frequency of updating the index Range of advanced search options Speed of returning a result set Result set presentation Relevance of the items included in a result set Design of the interface Overall ease of use Range of additional services offered

Claimed size and “obscure search” test results Engine Size Expected Actual Rank Score Score Google 560 1.0 1.0 1 FAST 340 2.0 1.8 2 Northern Light 265 3.0 2.3 3 HotBot 110 4.0 2.3 3 iWon 110 4.0 2.3 3 AltaVista 350 2.0 2.5 4 Yahoo-Google 560 1.0 3.0 5 Excite 250 3.0 3.0 5 Yahoo-Inktomi 110 4.0 4.3 6 Data from: July 2002 Searchengine Showdown http://www.searchenginewatch.com/sereport/00/07-sizetest.html

How big are they? FAST Google Inktomi Alta Vista Northern Light SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html

How big are they? GoogleFASTAlta VistaInktomiNorthern Light SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html

Recent activity (indexing) GoogleFASTAlta VistaInktomiNorthern Light SearchEnginewatch, 7/02 http://www.searchenginewatch.com/reports/sizes.html

Search engines are powered by robots, indexing software, and “ontologists” who classify, sort, and arrange the Web into a searchable matrix The most popular search engines are always among the most visited sites on the Net Competition is high for the advertising dollars that keep these search tools free of charge Despite their similar approaches to scanning the Internet, search engines don't always turn up the same results Depending on the type of search being conducted, one engine might give you more satisfactory results than another

Three methods for indexing web resources Full text index Includes all terms and URLs Uses filters to remove words not important to searching Keyword index Based on the location and frequency of words and phrases If a term is mentioned only once or twice, it won’t be indexed Human index Created by individuals who review pages and select the best words and phrases to describe their content

Engines use index searches, concept searches, or browsing Index searching Many search engines use this method because it casts a wider net than a catalog does Results come from a dynamic index of pages and use an algorithm to sort documents to determine relevance For instance, the number of times a key-word appears as well as its proximity to the top of the document They don’t recognize context, synonyms, or homonyms Searching "beat" returns Ginsberg and Burroughs but also pages on metronomes, raves, and gingersnaps There are problems of redundancy and dead links

Concept searching With this type of search, your search term is treated as a concept and not a keyword If you type a word in the search box, you search for that word, other forms of the word, and synonyms The search also includes other words that are highly statistically related to that word A concept search looks for ideas related to a literal query Excite uses this strategy

Browsing services exist in great numbers on the net These are systematically grouped hotlists, starting points or systematic lists of interesting resources These pages are typically smaller and well-maintained The browsing structures typically do not use a controlled system of knowledge-structuring,or an established classification system Selection, classification and description of the resources are made by the list owner using idiosyncratic criteria Browsing systems covering rapidly changing areas are more difficult to maintain because they often don’t have automatic mechanisms for rapid and continual updating

What are the different types of search engines? Single, niche, and multiple-threaded search engines Single search engines These engines operate alone Your query is run against a single database and/or index A directory search tool allows searches by subject matter It is a hierarchical search that starts with a general subject heading and follows with more specific sub- headings. The information is reviewed and indexed by humans However, the number of reviews are limited. Yahoo is an example

Niche search engines These engines are like single search engines, except that they cover a restricted subset of resources Examples might include engines for business, engineering, physics, or government information A very restricted version of a niche engine only allows you to search that site Northern Light is a good example

Multiple-threaded search engines These are also called meta-search engines These engines submit your query to two or more search engines simultaneously They gather and display the results as a single page These engines compete on the basis of the number and variety of engines they allow you to search These engines are becoming more popular One problem is the amount of redundancy in the returns

What are the problems with search engines? There are weaknesses and problems common to all attempts to index the Internet These are still more important than the limitations of single search services The theoretical problem of indexing virtual hypertext It is not economical and not even possible to index all information on the Internet in “full text” It is necessary to define the limit of documents and information units in order to allow a target-oriented access while searching In comparison with the world of printed information this involves considerable difficulties

The information units are considerably smaller and less defined “Containers” like a book, a series, a journal title or an issue do not occur often The information units ranges in size from a whole server or service to single text strings or icons The mix of different types of information on the net make uniform and homogenous indexing and searching impossible Document types include: directories, lists, menus, full-text of every-day electronic mail, scientific articles and books, field-structured database records, software, audio, video, images and numerical information

Considering the great number of authors on the net and their different abilities, the quality of input into the search services varies a great deal Often it is so poor that the search-results are seriously influenced in a negative way There is incorrect, uncontrolled HTML coding and incomplete use of important content-describing metadata like titles or keywords Incorrect functional text mark up and abuse of the same for layout purposes are occurring as well as the reverse, abuse of layout markings as functional characterization

Other problems include Terminological weaknesses, incorrect formulation of titles and headings, and ambiguity Inability to distinguish between permanent and temporary documents There are problems with harvesting methods, indexing programs, IR-methods, and user interfaces Performance problems

Dead links on search engines Search engine % Dead links % 400 errors Alta vista 13.7% 9.3% Excite 8.7% 5.7% Northern Light 5.7% 2.0% Google 4.3% 3.3% Hotbot 2.3% 2.0% Fast 2.3% 1.8% MSN Inktomi 1.7% 1.0% Anzwers 1.3% .07% Data from: Aug. 14, 2001 Searchengine Showdown http://www.notess.com/search/stats/dead.shtml

Accessing digital information: Search engines I.The problem