CS533 Information Retrieval Dr. Michal Cutler Lecture #10 February 24, 1999
This class Review of models covered so far Knowledge in Artificial Intelligence Knowledge based information retrieval
Models covered so far Pure Boolean Fuzzy Boolean models Vector space model Latent semantic indexing Probabilistic models
Knowledge and Artificial Intelligence • Early AI systems were based on general problem solving and search techniques • Had limited usefulness because of their lack of knowledge about the domain of the problem they deal with • Knowledge based systems provided better solution
Knowledge Representation • Knowledge representation and reasoning with knowledge became a major research area in AI • Many techniques such as semantic nets, frames, and production rules were used for representing knowledge
Knowledge Representation • It became evident that successful natural language understanding requires the representation of a large amount of “common sense” knowledge • Building knowledge based systems is time consuming and requires maintenance when the knowledge is dynamic
Knowledge Based IR • Knowledge based information retrieval attempts to identify the occurrence of high level concepts in a document • We will discuss a specific knowledge based information retrieval system • Lecture is based on on-line information on the definition and usage of topic sets
RUBRIC/TOPIC • Technique similar to frame based methods in natural language understanding • Concepts and their relationship represent the knowledge needed for retrieval • Evidential reasoning provide the link between a document and its concepts
What Are Topics? • A topic groups information related to a concept or a subject area. • Topics enable the encapsulation of knowledge • When you add topics to a Verity search application, users can select those topics when they define their queries.
Topic Organization • Topics organize search criteria in a format similar to an outline. • Operators and modifiers join related groups of search criteria. • Topics may be independent units, or units with relationships to other topics in a hierarchical structure.
Weight Assignments • By assigning weights to search criteria, the most relevant documents receive the highest scores. • A weight is between 0.01 and 1.00. • A weight of 1.00 indicates that the search criteria is of great interest. • Document score is determined by the accumulation of evidence of search criteria and weights.
Knowledge Bases • The knowledge base for a Topic application can consist of topic sets. • Knowledge bases offer the ability to find information without having to compose sophisticated queries. • Typically, the administrator of Topic sets up the knowledge base and maintains its contents.
Topic Sets • The subject of a topic is identified by its name. • In the example below, the topic is performing-arts. • This topic is composed of its name, performing-arts, and its evidence topics, ballet, drama, dance, opera, symphony, and mime.
WORD ballet WORD drama WORD dance WORD opera WORD symphony NOT-WORD mime Performing-arts ACCRUE The performing arts topic
Operators and modifiers • Operators represent logic to be applied to evidence topics. • This logic qualifies the kinds of needed documents • Modifiers apply further logicto evidence topics.
Adding a topic • A modifier can specify that documents containing an evidence topic not be included in the result. • More general and less general topics may be added • The topic film is added to the performing-art structure to form the top-level topic, art.
Topic hierarchy • Sophisticated topics are composed of top-level topics, subtopics, and evidence topics. • A topic set consists of several top-level topics. • Note that subtopics and evidence topics can be used by multiple top-level topics.
WORD ballet WORD drama WORD dance WORD opera WORD symphony NOT-WORD mime Performing-arts ACCRUE Art ACCRUE WORD film OR motion-pictures WORD movie art-films OR ... Film ACCRUE The performing arts topic
literature ACCRUE philosophy ACCRUE language ACCRUE history ACCRUE art ACCRUE liberal-art ACCRUE Performing-arts ACCRUE film ACCRUE visual-arts ACCRUE video OR The liberal-arts topic
Document Scoring • If an evidence topic is present (absent), its score is 1.00 (0.00). • If an evidence topic is weighted, • the scores of the evidence topics are multiplied by the weights, • then the resulting products are combined as specified by the operator of the parent topic.
Document Scoring • If this parent topic is, in turn, the child of another topic which is being searched, • its score is multiplied by its assigned weight, and • the resulting product is combined with the products of its siblings in a manner specified by the operator assigned to the parent topic.
Document Scoring This process continues until the parent topic is reached. For ACCRUE the result is the highest product of each child's weight and score, with a little added to the score for each child which is present in the document.
Document Scoring For AND(OR)the lowest (highest) product of each child's weight and score is taken. If a child uses a proximity operator (PHRASE, SENTENCE, or PARAGRAPH), or a relational operator, the child receives a score of 1.00 if the topic is present, and a score of 0.00 if the topic is not present.
0.5 boeings-comps OR 0.5 boeing-label AND 0.5 boeing-people ACCRUE WORD Boeing WORD Company BOINGCO OR The Boeing topic
WORD Boeing WORD computer WORD services 0.5 boeings-comp-service PHRASE 0.5 boeing-aerspace SENTECE 0.5 boeing-defense PARAGRAPH WORD Boeing WORD aerospace WORD electronics 0.5 boeings- comps OR WORD Boeing WORD defense The Boeing topic
0.8 paul-binder PHRASE 0.5 arthur- hitsman PHRASE 0.3 ted-johnson PHRASE 0.5 boeings- people ACCRUE WORD Ted WORD Johnson WORD Paul WORD Binder WORD Arthur WORD Hitsman The Boeing topic
Evidence topics boeing, computer, and services appear in phrase Evidence topics boeing and defense appear in paragraph Evidence topics boeing and company appear in document Evidence topics ted and johnson appear in the same phrase Example document
The scores • boeing-comps, which uses the OR operator, has a score of 0.50. • boeing-people, which uses the ACCRUE operator, has a score of 0.30. • BOEINGCO, which uses OR, compares the products of each child's weight and score, and takes the highest product • The document is scored as 0.50.
The Query Language • Evidence Operators • Proximity Operators • Relational Operators • Concept Operators • Boolean Operators • Score Operators • Modifiers
Relational Operators • Relational operators search document fields (such as AUTHOR) • Perform filtering function by selecting documents that contain specified field values. • Documents retrieved using relational operators are not relevance-ranked
Relational Operators Also string comparison operators
Concept Operators Used with scores
Overcomes the need for vocabulary overlap between query and document By using phrases, and adjacency operators can deal with polysemy Enables the specification of very precise queries Top retrieved documents have high precision Advantages
Defining and fine tuning a topic set requires substantial work The knowledge base must be well maintained Disadvantages