580 likes | 583 Views
CS533 Information Retrieval. Dr. Michal Cutler Lecture #24 April 28, 1999. Relevance feedback. The main idea Issues Query modification examples. Relevance Feedback. A techniques for modifying a query The weights of query terms may be modified and/or new terms are added to query
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #24 April 28, 1999
Relevance feedback • The main idea • Issues • Query modification examples
Relevance Feedback • A techniques for modifying a query • The weights of query terms may be modified and/or new terms are added to query • Relevance feedback is a very powerful technique, which yields significant improvement in retrieval results
Using Relevance feedback (traditional) • Use initial query to retrieve some items • Ask user to judge retrieved items as relevant/non-relevant • (true/false, no notion of more fuzzy terms such as very, somewhat) • Modify query
Main idea Modified query Original query Relevant documents Non relevant documents Move query towards good items, away from bad ones
Changing the query -how? • Change the weight of query terms? • Add new query terms? How many new terms and which ones will be included in the modified query? • In TREC only the “best” 20-100 terms in relevant documents are usually added • What weights to new terms? • Delete some query terms?
Modify based on what? • Original query? • Relevant retrieved documents? • Use or ignore (non relevant & retrieved) documents? • Use non-retrieved items? How?
All retrieved documents? • Long documents may cover many topics and move the query in undesirable directions • Should all retrieved documents be used, or maybe only some (for example ignore long ones)?
Part or whole document? • Use only “good passages” instead of the whole document?
Not enough information? • Query has 0 hits? • Hits but no relevant documents? • Only one relevant document?
Query Modification for vector space model • In this method the original query is ignored and a new query is formed based on the retrieved set • Given a set R of relevant retrieved items and a set • N of nonrelevant retrieved items
Query Modification • Let Di be the weight vector for document i. • This method computes • an average weight vector from good items and • subtracts from it • the average vector of the bad items
The new query is: Query Modification
Query Modification (2) • Original query Q0 is retained • Method includes three parameters determined experimentally, or using some learning technique
Feedback without relevance information • In well performing retrieval systems the top small n documents have a high precision • Experiments have shown that assuming that all top small n documents are relevant and performing positive feedback improves retrieval results (TREC)
Feedback with other retrieval models • Feedback has been employed for other ranking retrieval models such as the probabilistic and fuzzy models • Pure Boolean systems use feedback to extend Boolean queries
Using relevance feedback (Dunlop 1997) • Main idea: use also • (relevant & nonmatching) documents and/or • (non relevant & nonmatching) documents • Such documents can be found by browsing hypertext or hypermedia collections
Using relevance feedback (Dunlop 1997) • A (relevant & non matching) document should affect the query more than a (relevant & matching) one • A (non relevant & non matching) document should affect the query less than a (non relevant & matching) one
Automatic abstract generation • AI approach • IR approach • Examples of text extraction systems
Types of Abstracts • Automatic creation of summaries of texts • Short summaries indicating what document is about • Longer documents which summarize the main information contained in the text
Abstracts types • Summary in response to a query • summarizing more than one text, or • creating a summary of portions of the text which are relevant to the query
Abstracts classification • Classified as indicative and/or informative • Indicative abstracts - help reader decide whether to read document • Informative abstracts - contain also informative material such as main results and conclusions.
Abstracts • In this case a user may not need to read paper • critical or comparative material are more difficult to generate and ignored in this discussion
Evaluation criteria • Cohesion • Balance and coverage • Repetition • Length
Automatic abstracting • Information retrieval approach involves • Selection of portions from the text, • An attempt to make them belong together
Automatic abstracting • Artificial intelligence approaches to text summarization: • Extract semantic information, • Instantiate pre-defined constructs • Use instantiated constructs to generate a summary
An artificial intelligence approach • DeJong’s FRUMP system analyses news articles by: • Instantiating slots in one of a predefined set of scripts . • Using the instantiated script to generate a summary
An artificial intelligence approach • In Rau’s SCISOR system, a detailed linguistic analysis of a text results in the construction of semantic graphs. • A natural language generator produces a summary from the stored material
The artificial intelligence approach • Systems are only capable of summarizing text in a narrow domain • The artificial intelligence approaches are fragile. • If the system does not recognize the main topic its extraction may be erroneous
Automatic text extraction • First experiment reported by Luhn (1958) • Provided extracts • An extract is a set of sentences (or paragraphs) selected to provide a good indication of the subject matter of the document
Luhn’s approach 1. For each sentence • look for clues of its importance, • compute a score for the sentence based on the clues found in it
Luhn’s approach 2. Select all the sentences with a score above a threshold, or the highest scoring sentences up to a predefined sum of the scores 3. Print the sentences in their order of occurrence in the original text
Concept importance (Luhn) • Extracted words, • Eliminated stop words, • Did some conflation (stemming), • Selected words with frequency above a threshold • (First to associate concept importance with frequency)
Sentence importance (Luhn) • Looked for clusters of keywords in a sentence and • Based the sentence score on these clusters • A cluster is formed by keywords that occur close together (no more than 4 words between)
Sentence score (Luhn) • If the length of the cluster is X, • and it contains Y significant words, • the score of the cluster is Y2/X. • The score of the sentence is the highest cluster score, or 0 if the sentence has no clusters
Example (Luhn) • Sentence is (- -[*-**- -*] - -) (11 words) • Length of cluster X=7 (number words in brackets) • Y=4 (number of significant (*) words) • Sentence score is Y2/X=16/7=2.3
Clues for word importance (Edmunson 1964) • Keywords in titles • Keywords selected from the title, subtitle and headings of the document have higher score • Edmunson eliminated stop words, and gave higher scores to terms from the main title than from lower level headings
Sentence importance clues (Edmunson) • The location of the sentence • Frequently the first and the last sentence of a paragraph are the most important sentences
Sentence importance clues (Edmunson) • Edmunson used this observation to score sentences using their location • in a paragraph, • in a document (first few and last few paragraphs are important), • below a heading, etc.
Clues for sentence importance (Edmunson) • Certain words and phrases, which are not keywords, provide information on sentence importance • Used 783 bonus words, which increase sentence score • 73 stigma words which decrease the score
Clues for sentence importance (Edmunson) • Bonus words include superlatives and value words such as “greatest” “significant” • Stigma words include anaphors and belittling expressions such as “hardly” “impossible”
Indicator phrases (Rush 1973) • Used word control list as positive and negative indicators • Negative indicators eliminated sentences • Strong positives were expressions about the text topic: “our work”, “the purpose of”
Indicator constructs (Paice 1981) • More elaborate positive constructs: • “The main aim of the present paper is to describe” • “The purpose of this article is to review” • “our investigation has shown that”
Indicative constructs (Paice) • There are only 7 or 8 distinctive types of indicative phrases, which can be identified in relevance to a template allowing substitution of alternative words or phrases • Not all texts contain such phrases. • Useful when they can be found
Skorokhod’ko’s extraction (1972) • Builds a semantic structure for the document • Generates a graph, • Sentences are nodes, • Sentences that refer to same concepts (thesaurus) are connected by an edge
Skorokhod’ko’s extraction • The most significant sentences are those which are related to a large number of other sentences • Such sentences are prime candidates for extraction
Skorokhod’ko’s extraction • Sentences are scored based on: • Number of sentences to which they are significantly related and • Degree of change in the graph structure which would result from a deletion of the sentence
Text cohesion • Extracts discussed so far suffer from lack of cohesion • We discuss lack of cohesion caused by explicit references in a sentence which can only be understood by reference to material elsewhere in the text
Text cohesion • This covers - • anaphoric references (he, she, etc.), • lexical or definite references (these objects, the oldest), and • the use of rhetorical connectives ( “so” “however” “on the other hand” • Other levels of coherence not addressed
Text cohesion (Rush) • Rush attempted to deal with the problem of anaphors by either adding preceding sentences, or if more than three preceding sentences would need to be added by deleting the sentence