Database Searching: Theory and Practice

LIS618 lecture 0 Thomas Krichel 2003-09-14

today's lecture • I will not talk about the strike. • A look at the course home page http://wotan.liu.edu/home/krichel/lis618n03a • administrative stuff • historical matters about the course • about me • business of database searching • indexes • the Boolean information retrieval model • practice example on Dialog

Organization • homepage http://wotan.liu.edu/home/krichel/lis618n03a • Contents to be discussed today. • Send mail to krichel@openlib.org • Your name • Your secret word for grades delivery • Interrupt me with as many questions as possible! • Ask for breaks!

Proposed Organization • Normal lecture • Quiz at the beginning of every lecture • Factually oriented, around 15 minutes • Remove worst performance • Average to form 50% • Search exercise 50% • Formal syllabus to be made early next week!

Search exercise • find victim of an information need • best to take someone you know in a professional capacity • conduct interview about an information need experienced by the victim, write down expectations • search in formal database and on web • discuss results with the victim • write essay, no longer than 7 pages.

about the course • This course is new wine in an old bottle • Officially a merger of • lis566 information resources on the Internet • mailing lists • usenet news • web searching • lis618 database searching • access and use of commercial databases

mix of theory and practice • I am not a database search practitioner. • Each database is different, practical skills are not easily transferable. • Thus my emphasis in the course is more on theory. • In the past, I theory first, then practice. • This year I will try to mix. Some theory and some practice in every session.

What databases? • Dialog has been the traditional database covered. • They were the market leaders in online databases in the past. • Nowadays the field is much more open • In addition I have done Nexis, FirstSearch (OCLC) in the past. • But I am open to suggestions.

About me • Born 1965, in Völklingen (Germany) • Studied economics and social sciences at the Universities of Toulouse, Paris, Exeter and Leiceister. • PhD in theoretical macroeconomics • Lecturer in Economics at the University of Surrey 1993 and 2001 • Since 2001 assistant professor at the Palmer School

Why? • During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature. • At the same time, I discovered easy access to freely downloadable software over the Internet. • I decided to work towards downloadable scientific documents. This lead to my library career (eventually).

Steps taken I • 1993 founded the NetEc project at http://netec.mcc.ac.uk, later available at http://netec.ier.hit-u.ac.jp as well as at http://netec.wustl.edu. • These are networking projects targeted to the economics community. The bulk is • Information about working papers • Downloadable working papers • Journal articles were added later

Steps taken II • Set up RePEc, a digital library for economics research. Catalogs • Research documents • Collections of research documents • Researchers themselves • Organizations that are important to the research process • Decentralized collection, model for the open archives initiative

Steps taken III • Co-founder of Open Archives Initiative • Work on the Academic Metadata Format • Co-founded rclis, a RePEc clone for (Research in Computing, Library and Information Science)

Interest in databases • From my point of view I have two interests in database searching • As a provider, I must understand how people search in order to provide some data that they can use and will use. • As an economist, I have a strong interest in information as a commodity. The database market is an important market place. • Main emphasis of course is still on databases.

Database searching (DS) • subset of the subject of information retrieval (IR) • DS mainly thought as applicable to the set of large structured databases as opposed to do web searching • for those, a general knowledge of what databases are seems useful • Concentrate on textual databases

traditional social model • user goes to a library • describes problem to the librarian • librarian does the search • without the user present • with the user present • hands over the result to the user • user fetches full-text or asks a librarian to fetch the full text.

economic rational for traditional model • In olden days the cost of telecommunication was high. • database use costs • cost of communication • cost of access time to the database • the traditional model controls an upper bound on costs

disintermediation • with access cost time gone, the traditional model is under threat • there is disintermediation where the librarian looses her role • but that may not be good news for information retrieval results • user knows subject matter best • librarian knows searching best

Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • with more and more data appearing on the web DS may be a subject in decline • it is primarily concerned with non-web databases • There is more and more web-based methods of searching

Public access vs quality • Now the public at large is able to do online searching. • At the same time need for quality answers has grown. • Quality-filtered services will become more important. • In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. • Publishers have direct offerings and intermediated vending is in decline.

Main theory part • Literature: "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto • Don't buy it. It is a not a good book.

before the IR process • provider • define data that is available • documents that can be used • document operations • document structure • index • user • user need • IR system familiarity

the IR process • query expresses user need in a query language • processing of query yields retrieved documents • calculation of relevance ranking • examination of retrieved documents • possible relevance cycle

main problem • user is not an expert at the formulation of a query • garbage in garbage out, the retrieval yields poor result • ways out • design very intuitive interface for the query • give expert guidance

taxonomy of classic IR models • Boolean, or set-theoretic • fuzzy set models • extended Boolean • vector, or algebraic • generalized vector model • latent semantic indexing • neural network model • probabilistic • inference network • belief network

summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces.

key aid: index • an index is a list of terms, with a list of locations where the term is to be found. • The way to express locations usually depends on the form that the indexed data takes. • for a book, it is usually the page number, e.g. "shmoo 34, 75" • for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209" • there is usually more than one location of the term.

key aid: index terms • index term is a part of the document that has a meaning on its own. • it is usually a noun word. • retrieval based on index term raises questions • semantics in query or document is lost • matching done in imprecise space of index terms • predicting relevance is a central problem • the IR model determines the process of relevance ranking

basic concept: weight of index term • given all nouns, not all appear to have the same relevance to the text • sometimes, we can have a simple measure of the importance of a term, example? • more generally, for each indexing term and each document we can associate a weight with the term and the document. • usually, if the document does not contain the term, its weight is zero

Boolean model • in the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • thus powerful queries can be written

Classic implementation: dialog http://training.dialog.com/sem_info/courses/pdf_sem/dlg1.pdf http://training.dialog.com/sem_info/courses/pdf_sem/dlg2.pdf http://training.dialog.com/sem_info/courses/pdf_sem/dlg3.pdf http://training.dialog.com/sem_info/courses/pdf_sem/dlg4.pdf

Dialog is a databank • over 500 databases • these are also known as files and cover • references and abstracts for published literature, • business information and financial data; • complete text of articles and news stories; • statistical tables • Directories • DIALOG uses the Boolean model

DIALOG interface • is still rooted in "traditional" database systems • dismissed as "dial-a-dog" • is uses a command-driven interface • it is very complicated to learn fully • it is not suitable for the end-user • it therefore offers a valuable skill to the information professional • it is a challenge for a professor to teach

Accessing DIALOG • On the web, go to • http://www.dialogweb.com/ • Enter username and password • Forget about subaccount • then click on logon • On the next screen go to command search • "continue" at the next screen

two steps in DIALOG • step one: select databases (aka files) to look at • step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss.

sample search • We want to know something about "current awareness in digital libraries" • From dialogweb command search: • databases • social sciences and humanities • library and information science • leads you to http://www.dialogweb.com/cgi/logoff?mode= guided&url=/cgi/dwframe?href=search.html

This is database selection… • At that screen you see a number of "files" with their number. • You can select those that you want to search • then you click "begin datasbase" • and you get back to the command search • "b numbers" it will say. That is the command to begin working with files.

Boolean seach • Do a number of searches • s current(N)awarness • s digital(N)library • s digital(N)libraries • Each search retrieves a set of documents • The sets can be combined • s s1 and (s2 or s3)

What is the deal? • There are two stages. • At stage two we make Boolean queries. • Each query splits the the records into matching and non-matching records. • The set of matching records is return. • It can be further searched or combined with other sets using Boolean operators. • Try this at home.

http://openlib.org/home/krichel Thank you for your attention!

Database Searching: Theory and Practice

Database Searching: Theory and Practice

Presentation Transcript

LIS618 lecture 2

Lecture 0

Lecture 0

LIS618 lecture 5

LIS618 lecture 6

LIS618 lecture 6

LIS618 lecture 6

LIS618 lecture 6

Lecture 0

LIS618 lecture 0

Lecture 0

LIS618 lecture 3

LIS618 lecture 2

LIS618 lecture 3