Theorie van databanken en zoekmachines Theory of databases and search engines

Theorie van databanken en zoekmachinesTheory of databases and search engines Paul Nieuwenhuysen • Vrije Universiteit Brussel • Information and Library Science, University of Antwerp Belgium Slides used to support this presentation at the one-day workshop on “Technisch-wetenschappelijke databanken”“Databases for Technology and Science” organised by KVIV,27 November 2003, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium

- contents - summary - structure- overview of this lecture • Databases and computerised information retrieval • Knowledge organisation: classifications and thesaurus systems • Online access information sources and services, including Internet directories and text search engines • Evaluations in information retrieval

Databases and computerized information retrieval

Types of databases: examples Examples: The databases that form the basis for • catalogues of books or other types of documents • computerized bibliographies • address directories • a full text newspaper, newsletter, magazine, journal+ collections of these • WWW and Internet search engines • intranet search engines • ...

Information retrieval: via a database to the user Informationcontent Linear file Inverted file Database Search engine User Search interface

Information retrieval: building a database Records fed into the database management system Indexing Records derived from the input and stored in the database Inverted file, index, register of the database Retrieval User

Information retrieval: the basic processes in search systems Information problem Text documents Representation Representation Query Indexed documents Evaluation and feedback Comparison Retrieved, sorted documents

Information retrieval systems: many components make up a system • Any retrieval system is built up of many more or less independent components. • These components can be modified to increase the quality of the results more or less independently.

Information retrieval systems: important components the information content system to describe formal aspects of information items system to describe the subjects of information items concrete descriptions of information items = application of the used information description systems information storage and retrieval computer program(s) computer system used for retrieval type of medium or information carrier used for distribution

What determines the results of a search in a retrieval system? • the information retrieval system ( = contents + system) • the user of the retrieval system and the search strategy applied to the system Result of a search

Layered structure of a database Database (File) Records Fields Characters + in many systems:relations / links between records

A simple database architecture: all records together form a database The ‘salami architecture’ = ‘sliced bread architecture’ • the salami or the bread is a “database” • each slice of salami or bread is a “database record” • there are no relations between slices / records • the retrieval system tries to offer the appropriate slices / records to the user

Structure of a bibliographic file Record No. 1 Title Author 1: name + first name Author 2: ... Source Descriptor 1 Descriptor 2 ... Record No. 2 Sub- fields Repeated fields

Text retrieval and language: an overview Problems related to language / terminology occur1. even when the same language is used in searching and in the searched databases2. in the case of “multi-linguality”: “cross-language information retrieval” that is when more than 1 language is used • in the search terms • in the contents of the searched database(s) and/orin the subject descriptors of the searched database(s) L

Text retrieval and language: enhancing retrieval J • Retrieval can be enhanced by coping with the problems caused by the use of natural language. • Contributions to this enhancement of retrieval can be made by • the database producer • the computerized retrieval system • the searcher/user • (The distinction between these is not very sharp and clear in all cases.)

Text retrieval and language: a word is not a concept Problem: A word or phrase or term is not the same as a concept or subject or topic. Word Word Concept L

Text retrieval and language: a word is not a concept So, to ‘cover’ a concept in a search, to increase the recall of a search, the user of a retrieval system should consider an expansion of the query; that is: the user should also include other words in the query to ‘cover’ the concept. L

Text retrieval and language: ambiguity of meaning Problem: Ambiguity of meaning may be the cause of low precision. Word Concept Concept L

Text retrieval and language: relation with recall and precision Recapitulating the two problems discussed, we can say that • Expansion of the query allows to increase the recall. • Disambiguation of the query allows to increase the precision. L

Text retrieval and language: phrases composed of words • Problem: Most retrieval systems can search for words, but they do not directly recognize or ‘know’ phrases / terms composed of more than 1 word. L

Enhanced text retrieval using natural language processing Information problem Text documents Representation Representation Query Indexed documents Evaluation and feedback Natural language processing of the documents AND of the query Comparison and matching of both Retrieved, sorted documents

Text retrieval and language: conclusions • The use of terms and language to retrieve information from databases/collections/corpora causes many problems. • These problems are not recognized or underestimated by many users of search/retrieval systems= The power of retrieval systems is overestimated by many users. • Much research and development is still needed to enhance text retrieval.

Hints on how to use information sources: Boolean combinations Most text search systems understand the basic Boolean operators: OR = obtain records that contain one or both search terms AND = obtain records that contain both search terms NOT = exclude records that contain a search term

Hints on how to use information sources: Boolean combinations In the case of computer-based information sources, use Boolean combinations of search terms when appropriate and when possible. term x1 OR term x2 OR term x3 term y1 OR term y2 OR term y3 term z1 OR term z2 OR term z3 AND AND AND ...

Knowledge organisation: classifications, and thesaurus systems

Knowledge organisation: introduction • To organise knowledge / documents / books / reports / information / data / records / things / items / materials for more efficient storage and retrieval, some related, similar tools / systems / methods / approaches are used. • Often but not yet always, this process is assisted by a computer system. • Good systems are expanded and updated when the need arises.

Knowledge organisation: some tools • Various related tools / systems / methods / approaches are available: • Classification • Taxonomy • Controlled list of selected keywords • Thesaurus • Ontology • Subject-related metadata • …

Thesaurus: description • Thesaurus (contents) = • system to control a vocabulary (= words and phrases + their relations) • + the contents of this vocabulary • Thesaurus program = program to create, manage, modify and/or search a thesaurus using a computer

Thesaurus relations Term(s) with broader meaning BT (= Broader Term) RT (= Related Term) UF (= Use(d) For) Other term(s) TermSynonym(s) NT (= Narrower Term) Term(s) with narrower meaning

Thesaurus applications related to information searching • For users (!) of a database:When the database to be searched is produced with added descriptors (words and terms) that are taken from a controlled list of approved, selected words and terms, then the searcher can use some printed or computer-based system first, to find more and ‘correct’ suitable words and terms that belong to that controlled list of descriptors; then, the searcher can use these descriptors (and only these words or terms) in a database query.

Thesaurus applications related to information searching • For users (!) of a database: When the database to be searched is NOT produced with added descriptors (words and terms) that are taken from a controlled list of words and terms, then the searcher can use one or several thesaurus systems first, to find more words and terms and more suitable words and terms; then the searcher can use these found words and terms to formulate a query for that database (to increase recall and precision).

Thesaurus systems that cover all subjects • General systems • Universal systems • Covering all subjects • Broad and shallow systems • Horizontal systems

Examples Thesaurus systems that cover all subjects: examples (1) • Library of Congress Subject Headings (LCSH) • thesaurus system built into word processing software • thesaurus system that runs on a pc (independent of Internet) see for instance http://www.wordweb.co.uk/free/ • thesaurus systems that can be used free of charge through the WWW • http://education.yahoo.com/reference/thesaurus/index.html • http://thesaurus.plumbdesign.com/

Thesaurus systems focused on a particular subject • Focused on a particular subject domain = narrow and deep, vertical systems • Examples: • INSPEC: physics, electronics, information technology • Medline (the Medical Subject Headings = MeSH) • ...

Online access information sources and services

Internet: subject-oriented meta-information offered via WWW Information about information sources: in the form of • subject guides = texts with references • subject hypertext directories = subject guides • key word indexes, generated automatically, for searching • collections of links or forms to the above • (multi-threaded search systems)

Internet global subject directories:introduction • They are virtual libraries with open shelves, for browsing. • They are manually generated, man-made by many people. • They can be browsed following a tree structure or a more complicated variation. • The most famous of these systems belong to the most popular and most visited sites on the WWW: e.g. Yahoo!

Internet global subject directories: pros and cons • They cover a small number of selected WWW sites, in comparison with the total number of sites that are accessible.  + The selected, included sites should be better than average. - They are not suitable for deep, detailed, specific searches with a high coverage.

Internet global subject directories:why use one? • They are suitable mainly for broad searches that can be difficult to formulate in words, but NOT for more specific searches that require combinations of several concepts.

Internet global subject directories:searching directories with a query • Many of the Internet directories include an index to search their contents with a query. • However, then the assisting classification structure is not well exploited and the user should be aware of the problems and difficulties of information retrieval with natural language queries. • Furthermore, the possibility to use the system in this way may be confusing, as these directories are not real full-text Internet indexes, like those provided by other search tools.

Internet indexes:automated search tools • Several systems allow to search for and to locate many items (addressable resources) in the Internet in a more systematic, direct way than by only browsing/navigating. • These systems do NOT search the contents of computers through the real Internet in real time and completely when a user makes a query. Searching in that way would be much too slow due to limitations in the technology.

Internet indexes: scheme of the mechanism User searching for Internet based information Internet client hardware and software user interface to a search engineInternet information source Internet index search engine Internet crawler and indexing system database of Internet files, including an index

Internet indexes:description of the mechanism Each of these search systems is based on: • a database of links to pages / URLs that can be retrieved by searching with queries through a big index that is built machine-made on the basis of the contents, the texts, of these pages(to build this database and to keep it up to date, pages are continuously collected from the Internet by a “robot” computer software system) • a search system with a user interface in a WWW form, to allow the user to search through that database

Internet indexes: building their database Internet documents fed into the database management system Indexing Inverted file, full text index, register of the database Records derived from the input and stored in the database Retrieval User

Internet indexes:AltaVista • The primary search interface can be found in the US. The following addresses all lead to the same information: • http://www.altavista.com/ • http://www.av.com/ • http://av.com/ • Mirror site in UK: • http://uk.altavista.com/ • http://www.altavista.co.uk/

Internet indexes:All the Web • The search interface can be found at:http://www.alltheweb.com/http://alltheweb.com/ • You can search the WWW and ftp servers. • The database is one of the biggest. • Not only HTML and plain text files, but also the full text of many Adobe PDF files is indexed. • Offers also a module to search for pictures/images. • Offers spelling suggestions in the search interface.

Internet indexes: Google (Part 1) • http://www.google.com/ • Full-text searching is possible of many files that are available through the WWW. • Not only HTML and plain text pages are covered, but also the first part is indexed of many files in other file formats, such as • Adobe PDF, • Microsoft Word, Microsoft Excel, Microsoft PowerPoint • Rich Text Format…

Internet indexes: Google (Part 2) • One of the most popular systems in 2001, 2002, 2003… • For retrieval an algorithm is used that takes into account the links between WWW pages.A retrieved page is ranked higher when • many sites/pages point to it • “important” sites/pages point to it

Internet indexes: Google refers to a thesaurus • In Google, the words used in a search query are returned to the user with hyperlinks to a dictionary and to a thesaurus on the WWW, that can be used partly free of charge. • The thesaurus can of course show the user synonyms, narrower terms, related terms for the word.In this way, this system can be used to expand a search query, so that the query better covers the search concept.

Internet indexes: Google can expand a query: how? • If you want to retrieve more documents, then you can request Google to include synonyms of one or several of the words in your query in an automatic way. • This works since 2003. • You can do this by putting a tilde ~ in front the selected word. • Example of a query: word1 ~word2 word3 word4

Theorie van databanken en zoekmachines Theory of databases and search engines