Computer comunication B

Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries

IR • Information retrieval started from bibliography retrieval to become full-text term retrieval in a dataset, to be finally expanded to web information retrieval • The information retrieval system anlyses the contents of the sources of information and the sources of the user’s queries and matches the two to retrieve the relevant items • COMPONENTS • The document subsystem • The indexing subsystem • The searching subsystem • The matching subsystem • The searching subsystem is one of the fundamental parts of a information retrieval system

IR searching Models • Searching models can be seen as searching strategies • Boolean search model • Probabilistic retrieval model • Vector processing model

The boolean search model • IR system use boolean logic to allow the users to express their choice using these operators • George Boole initiated a system of symbolic logic formed by three operators: • The logical sum + (OR) • Allows to specify alternatives between (or among) search terms • The logical product X (AND) • Allows to specify the search for the coincidence of two concepts • The logical difference – • Allows to exclude terms from the search

The boolean operators • The logical sum + (OR) • House OR castle • The logical product X (AND) • house AND castle • The logical difference – (NOT) • House NOT castle • Boolean operators can be visualized with the so called Venn diagrams House House Castle NOT AND OR

The boolean model: Pro and contra • It is an easy search model • Despite its simplicity users are not able to effectively use the three boolean operators, especially for more complicated queries. • The search is sometimes not too precise, i.e. the search can give too many items after the search is the search is too broad, or too few responses if the search is too strict (probability to miss important items). • Boolean search does not permit ranking, i.e. the importance of items in an document are not ordered.

Boolean search: example • Catalogue RUG library • There are index terms: • Boole as author is indexed is different than boolean or Boole in the titel index-term • The three boolean operators are used • There is integration of wildcards (see later) • http://opc.ub.rug.nl/IMPLAND=Y/SRT=YOP/LNG=NE/DB=1/

Probabilistic retrieval models • Tags the last problem outlined for the boolean model: • Probabilistic models try to rank the found documents in order of decreasing probability of usefulness or relevance given by the user

Vector space models • Documents are characterized/evaluated according to their index-terms • Each document is identified with a vector • The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. • The value regarding an index is the number of times a specific term appears (sometimes the value is 0) • A metrics for the similarity between two documents is the co-sinus of the angle between their vectors • Searches are interpreted as well in terms of vectors

Vector space models

Evaluation for a search • Precision: • How many of the found documents are relevant to the search? • Recall • How many of the relevant documents are found to the search? • Fall-out • How many of the irrelevant documents are found to the search?

Wildcards 1 • Wilcards are characters that can be a substitute for any subset of all possible characters • In other words they are unknown subparts in a term • Usually wildcards are signaled with an asterisk * • Usually the asterisk is a wildcard character that substitutes zero or more unknown characters. • Example: aphas* → aphasiology, aphasia, aphasic, aphasics, aphasiological etc… • Wildcards are an advantage for the user of the system but it is not convenient for the system self • The user does not have to repeatedly ask for different searches • But the system needs to interpret the term and test (search) all the possible terms stemming from it

Wildcards 2 • Wilcard characters usually substitute a group of letters that can not stand alone as words, but can form a word is united to a specific root • Sun* → wc:0= sun. wc: -s = suns. Ws: -set = sunset … • The search via wildcards in the beginning of a word or within a word is not so easy (the resulting possibilities are larger)

Wildcards 3: Permuterm index • Wilcard

Web information retrieval • IR was created for bibliography retrieval. Nevertheless there is much information that has to be accessed in the web. IR addresses even this search • Traditional and web IR differ on a number of characteristics

Web information retrieval • The web is far more distributed and larger than the traditional set of information sources • The web is increasingly growing • The web has different levels of depth for a search • The web has different type and format of documents • The quality of documents in the web varies • The information in the web changes rapidly • Distributed users

Web information retrieval • The web is far larger than the traditional set of information sources • Not only the amount of information and documents is larger but the retrieval system (in traditional IR systems) has to deal with different a different set of standards (sofware etc). Actually the web does not have a “set of standards” • As a consequence the search is more difficult

Web information retrieval • The web is increasingly growing • The amount of information in the web is growing (and it will probably grow). • The conventional text retrieval systems should be tested and readapted to work with larger datasets

Web information retrieval • The web has different levels of depth for a search • The web can have two types of access: one free and the other one the “deep” one accessible only with passwords or special programs. WIR can get access only to the surface information. • The web has different type and format of documents • Traditional IR works with texts. In the web there are several types of documents (Images, soundfiles etc..). Both indexing and information retrieval are therefore more complex

Web information retrieval • The quality of documents in the web varies • IR systems are not designed to check the quality of the information resources, therefore there is no control over the quality • The information in the web changes rapidly • This differs from traditional text retrieval systems which are quite static according to the rapid changes of the web. Keeping track of the rapid changes is a challenge The sources often move. There is a difficulty to track them back

Web information retrieval • Distributed users • The builder of conventional IR systems knows approximately the target of users for a IR system. A builder for web information retrieval system does not have any “typical” user

Search engines • Search engines can are a sort of IR systems • They allow to run the search using search terms and using keywords or key sentences • Most search engines allow the use of boolean operators (AND, OR, NOT) • Special programs called “spiders” regularly collect information on web pages • The search engine finds documents that match the search • The web engine does not search the web for every search but searches a given database formed by the spider programs. This database is regularly updated. • There are many types of web engines according to different specialties as well (Google, Altavista news, Google Images, etc)

Digital libraries • A digital library: • “must accomplish all essential services of traditional libraries and also exploit the well-known advantage of a digital storage” • Digital libraries provide access to different information sources, in various forms (text, images, audiofiles etc) • Digital libraries create the access for a variety of information via different sources • The web • E-journals • Online databases • Remote digital libraries • Every digital library has a library-user interface

The digital library www Online databeses E-journals Remote digital libraries Digital library interface USER

Digital libraries • Some digital libraries • Alexandria Digital Library Project • http://www.cdlib.org/ • http://www.gutemberg.org • http://www.theeuropeanlibrary.org

Digital libraries • Digital libraries use features of IR systems • Users can browse or search the collections • Some digital libraries permit to search in a network of digital libraries • Boolean search is most used in digital libraries • The search is via keywords or sentences with the use of wildcards

Introduction to modern information retrieval (Chowdhury, G.G.)

Computer comunication B