Text Based Information Retrieval Document and Query Representation Lecture I

Text Based Information RetrievalDocument and Query RepresentationLecture I Dr. AboudMadlin

PLAN • Introduction • Document and Query Representation • Controlled language index terms • Natural language index terms • Text analysis • Lexical Analysis • Index Term Weighting

1- Introduction • Documents • Logical Units of Text • Units of records (text & other components) • Units of semantic entity • units of text grouped together for a purpose • Units of unformatted text • Text as written by authors of documents. • Text types studied by text linguistics • Collection can have different Languages and subject domain

Description of Text • First level • Letters • Syllables and Morphemes • Words • Phrases • Clauses • Sentences • Second level • Schematic : structure of the text • Thematic

Ambiguity in Meaning Morpheme Word Phrase Decrease Sentence Text Domain

Problems and Benefits • Natural Language • Powerful means of communication • Text Retrieval = IR Problem • Not supply all relevant documents to the query • Not understand the query (information need) • Not understand the content of document text • Poor strategy for matching query and document content

Document Representation • Since • Documents are full of text. • Not every words of the text are meaningful for searching/retrieval. • Documents themselves do not have identifiable attributes such as author, titles. • Documents need to be processed and represented to a concise and identifiable formats/structures.

Document Representation • Documents should be represented to help users identify and receive information from the system. • to identify authors and titles • to identify subjects • to provide summaries/abstracts • to classify subject categories

Document Representation • Controlled language index terms - fixed terms • Assignment of thesaurus terms and subject, and classification codes • Natural language index terms • Extraction of words, phrases, and collocations from text • Possibly weighted

Query Representation • Key terms (natural language or controlled) possibly connected with Boolean operators and weighted • Possibly generated from natural language queries • Query expansion and relevance feedback

Controlled language index terms Document Surrogates • Each document should have a unique identifier • Accession (sequential) number • Classification number • Barcodes • ISBN number • Good for the computer but not enough for the user? • “Go to bookstore and get the book 0-471-14338-3.” • “Do you want to have 200737-103146 for dinner?” • Citation • A set of information to make it easy to identify a document.

Index • Computerized Indexing • Indexing based on citations • Indexing based on full text • Subject indexing • Creating a set of control vocabularies (Thesaurus or Subject headings) to represent documents • Assigning terms of control vocabularies to documents

Computerizing Indexing • The Computer creates indexing files based on document surrogates • to improve access speed • to increase access points • to improve precision • to reduce false drops • to identify similar documents

Computerized Indexing Title indexing • Sort all the titles alphabetically • Not consider the beginning “a” or “the” • Convert all letters to uppercases. • Matching always starts from the beginning of the title (not individual words). • Most early IR systems (such as library catalogs) used title indexing

Computerized Indexing Keyword indexing • Parsing every individual words from documents • First decision: What is a word? • Are digits words? • How about the letter and digit combination: B6, B12 • Is F-16 one word or two words? • Hyphens • Online, on-line, on line ? • F-16 • Example: parser.c • List all the words alphabetically with points back to documents – inverted indexing.

Computerized Indexing Phrase indexing • There is no “safe” ways to parse phrases out of titles or full text of documents. • One way to do phrase indexing is by positions: if two word are used next to each other, they are (potentially) a phrase. • Most phrase indexes are done manually.

Computerized Indexing Inverted Indexing • Purpose: • Preparing documents for search engines to search • Objective: • Create a sorted list of words with pointers indicating which and WHERE the words appear in the documents. • Process the list in many different ways to meet the retrieval needs

Computerized Indexing Inverted Indexing • Inverted indexing consists of an ordered list of indexing terms, each indexing term is associated with some document identification numbers. • Retrieval is done by first searching in the ordered list to find the indexing term, then using the document identification numbers to locate documents

Computerized Indexing Inverted IndexingExamples • ISYS102 Introduction to information systems • Info110 Human computer interaction • info300 Information retrieval systems

Computerized Indexing Inverted IndexingExamples Step 1: Generate a list of all the words • ISYS102 Introduction to information systems • ISYS110 Human computer interaction • ISYS300 Information retrieval theories and systems ISYS102 Introduction ISYS102 to ISYS102 information ISYS102 systems ISYS110 human ISYS110 computer ISYS110 interaction ISYS300 information ISYS300 retrieval ISYS300 theories ISYS300 and ISYS300 systems

Computerized Indexing Inverted IndexingExamples Step2: remove stop words ISYS102 Introduction ISYS102 to ISYS102 information ISYS102 systems ISYS110 human ISYS110 computer ISYS110 interaction ISYS300 information ISYS300 retrieval ISYS300 theories ISYS300 and ISYS300 systems

Computerized Indexing Inverted IndexingExamples Step 3: Invert the list Introduction ISYS102 information ISYS102 Systems ISYS102 Human ISYS110 Computer ISYS110 Interaction ISYS110 Information ISYS300 Retrieval ISYS300 Theories ISYS300 Systems ISYS300

Computerized Indexing Inverted IndexingExamples Step 4: Sort the list Computer ISYS110 Human ISYS110 Information ISYS102 Information ISYS300 Interaction ISYS110 Introduction ISYS102 Retrieval ISYS300 Systems ISYS102 Systems ISYS300 Theories ISYS300

Computerized Indexing Inverted IndexingExamples Step 5: Merge same words in the list Computer ISYS110 Human ISYS110 Information ISYS102, ISYS300 Interaction ISYS110 Introduction ISYS102 Retrieval ISYS300 Systems ISYS102, ISYS300 Theories ISYS300

Example: Create an inverted indexing for the following:

Computerized Indexing Subject Indexing • A human analytic process for identifying, selecting, and representing document concepts • Create indexing languages • Using standardized, limited vocabularies for index purposes. • Assign indexing terms to documents • Using only the terms in the index language selected.

Computerized Indexing Subject Indexing Controlled Vocabulary • Goals: • To permit easy locations of documents by topic. • To define topic areas, and hence relate one document to another. • to provide multiple access pointers to documents • to enforce a uniformity throughout an information retrieval system

Computerized Indexing Subject Indexing Controlled Vocabulary • Formats: • Hierarchical Classified list • hierarchical subject descriptors • associative cross references • classification notation (codes) • Alphabetical list • include both descriptors and other lead-in terms

Broader Term Synonymous Term Keyword/ Descriptor Related Term Narrower Term Main Componentsin a Controlled Vocabulary

Broader Terms Computer Science Software Related Terms Synonyms Information Retrieval Management Data Base DB Primary key Relation Foreign key Example Narrower Terms

Advantages of Subject Indexing • facilitates concept search • search by topics/subjects, not just by words • link related documents by subject terms • Make implicit information explicit • Provides a standard terminology to index and search documents. • Use small indexing vocabulary • Help the searcher find related terms

Disadvantages of Subject Indexing • Expensive manual operations • To construct the controlled vocabulary • To assign terms to documents • Difficult to keep up to date • Terminology changes very fast • New terms are added daily. • Inconsistent process of human indexing • Same documents are assigned different indexing terms by different indexers • The user may not use the same terms to find documents as the indexer would use to index the documents.

Two Examples of Document Representation • Controlled Vocabulary • human-based indexing • subject-based indexing • Inverted indexing • computer-based indexing • statistical-based indexing

Considerations of Document Representation • Discriminating power • to identify a document uniquely , to reduce ambiguity • Descriptiveness • describe all the information as complete as possible • Similarity Identification • to group similar documents • Difficulty for the computer to assign keywords, subject descriptors, or classification numbers to documents • Conciseness • simple and clear • reduce process time and storage space • Needs by both the computer and the user

Relationships of four considerations • Higher discrimination power may lower the capability of identifying similarities among documents. • Good descriptiveness may defeat the conciseness • What’s good for the computer may not always be good for the user. • A good representation should seek a balance of the four, and take consideration of both the computer and the user.

Improving the IndexingAssignment of Natural language index terms • So far we treated words simply as tokens when creating the inverted indexing • To improve the indexing, we should also consider • Meanings of words • Structures of language • Word usages

Text Analysis • General Process • Lexical analysis : Word (token) extraction • Removing of Stop words • Stemming • Identification of phrases and collocation (optional) • Term Weighting (Frequency counts) • Zipf’s Law

1- Lexical Analysis • Process of Tokenization • Text  { word} • Most Languages • Word = string of characters separated by white spaces and/or punctuation • Difficulties: • Abbreviations (ect. , ..) • Transformed to original format using MRD (Machine Readable Dictionary) • Hyphenated terms (_, -) • Apostrophes • Numbers

2- Removing of Stop words • Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • Why do we need to remove stop words? • Reduce indexing file size • stop words accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching

2- Removing of Stop words • Building a stop list: • Based upon word classes : Function words (e.g. articles, prepositions, …) • Based upon threshold frequency of occurrence of words: • In a general corpus that reflects a broad range of subjects • In the document collection (domain-specific stop list) • Potential problems of removing stop words • small stop list does not improve indexing much • large stop list may eliminate some words that might be useful for someone or for some purposes • stop words might be part of phrases (ex : put on, take off ,..) • needs to process for both indexing and queries.

Some English Stop words

3- Stemming • Techniques used to find out the root/stem of a word: • lookup “user engineering” • user 15 engineering 12 • users 4 engineered 23 • used 5 engineer 12 • using 5 • stem: use engineer

Advantages of stemming • improving effectiveness • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. • Criteria for stemming • correctness • retrieval effectiveness • compression performance

Basic stemming methodsUse tables and rules • Affix removal algorithms (suffixes, prefixes) • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …... • transform the remaining word • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

Some MethodsPorter stem Algorithm (1980) • Remove affixes by applying A set of condition/action rules • condition on the stem • condition on the suffix • condition on the rules • different combination of conditions will activate different rules. • Uses linguistic knowledge • Implementation: • stem.c Stem(word) …….. ReplaceEnd(word, step1a_rule); rule=ReplaceEnd(word, step1b_rule); if (rule==106) || (rule ==107) ReplaceEnd(word, 1b1_rule); … …

Basic stemming methods Sound-based stemming • Sounder rules: letter Numeric equivalent • B, F, P, V 1 • C, G, J, K, Q, S, X, Z 2 • D, T, 3 • L 4 • M, N, 5 • R, 6 • A, E, I, O, U, W, Y not coded • Words sound similar often have same codes • The code is not unique • high compression rate

Basic stemming methods N-gram stemmers • A n-gram is n-consecutive letters • Conflates terms based on number of n-grams (= sequences of n consecutive letters) that they share • Often use of bigrams or trigrams • Terms that are strongly related by the number of shared n-grams are clustered • Heuristics help in detecting the root form • Language- independent technique • Example : All diagrams of the word “statistics” are st ta ti is st ti ic cs All diagrams of “statistical” are st ta ti is st ti ic ca al

Basic stemming methods N-gram stemmers • The similarity of two words can be calculated by: • Where • A is the number of unique diagrams in the first word • B is the number of unique diagrams in the second word • c is the number of unique diagrams share by A and B

4- Identification of phrases and collocation (optional) • Phrases • Good indicators of text’s content (especially noun and prepositional phrases): • Important concepts in subject domain: • e.g. joint venture • Less ambiguous than the single words they are composed of

4- Identification of phrases and collocation (optional) Recognition of phrases • Use of (MRD) with phrases: • Only practical in restricted subject domains • Statistical approach: • Assumption: words that often co-occur might denote a phrase • For phrases: not always correct and meaningful • Linguistic (language-dependent) approach

Text Based Information Retrieval Document and Query Representation Lecture I