220 likes | 223 Views
Information Retrieval (1955-1992). Primary Users Law Clerks Reference Librarians (Some) News organizations, product research, congressional committees, medical/chemical abstract searches Primary Search Models Boolean keyword searches on Abstract, Title, keyword Vendors
E N D
InformationRetrieval(1955-1992) • Primary Users • Law Clerks • Reference Librarians • (Some) News organizations, product research, congressional committees, medical/chemical abstract searches • Primary Search Models • Boolean keyword searches on Abstract, Title, keyword • Vendors • Mead Data Central(Lexis – Nexis) • Dialog • Westlaw • Total searchable online data : O(10 terabytes)
Information Retrieval(1993+) • Primary users • 1st time computer users • novices • Primary search modes • Still Boolean keyword searches with limited probabilistic models • But FULL TEXT Retrieval • Vendors • Lycos, Infoseek, Yahoo, Excite, AltaVista, Google • Total online data : ???
Growth of the Web # of web sites or Volume of web traffic ? Mosaic Netscape Exponential Growth 1992 1993 1994 1995 1996 1997 1998 Volume doubling every 6 months
Observation • Early IR system basically extended library catalog systems, allowing • Keyword searches, • Limited abstract searches in addition to Author/Title/Subject and including Boolean combination functionality • IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)
In Contrast • Today, IR has a much wider role in the age of digital libraries • Full document retrieval • (hypertext, postscript or optical image(TIFF) • representations) • Question answering
Old View • Function of IR : • Map queries to relevant documents • New View • Satisfy user’s information need • Infer goals/information need from: • query itself • past user query history • User profiling(aol.com vs. CS dept.) • Collective analysis of other user feedback on similar queries … AND … OR … 15 1 8
In addition, return information in a format useful/intelligible to the user • weighted orderings • clusterings of documents by different attributes • visualization tools • ** Text Understanding techniques to extract answer to questions or at least subregion of text • Who is the current mayor of Columbus, Ohio? • don’t need full AP/CNN article on city scandals, • just the answer(and available source for proof)
Boolean Systems Function #1 : Provide a fast, compact index into the database (of documents or references) • (granularity) • Index options • Doc number • Page number in Doc • Actual word offset Chihuahua Nanny Data structure: Inverted file
Boolean Operations Chihuahua AND Nanny g Join ( ) Chihuahua OR Nanny g Union ( ) Proximity searches Chihuahua W/3 Nanny
Vector IR model d1 d2 Find optimal f( ) f( ) f( ) V1 V2 Query Sim (Vi , VQ) = Sim’ (Di , Q) Sim (V1, V2) »Sim’ (d1, d2) Cosine distance
Vector models V1 Bit vector capturing essence/meaning of D1 D1 V2 Sim (V1 , Q1) D2 Q1 Query Find maxSim (Vi , Q1)
^ V1 Dimensionality Reduction d1 f( ) V1 Initial (term) vector representation Dimensionality Reduction(SVD/LSI) More compact/reduced dimensionality model of d1
V1 D1 3 Japanese Japan Nippon Japanese Nihon Japanese Japanese .. 5 Japan * 1 The 192 1 1 Clustering words Offset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w))) Raw Term Vector Condensed Vector Stem : books book computer comput computation comput
Soap Opera Soap opera 0 0 1 d1 Collocation (Phrasal Term) Soap Opera Soap opera 1 1 0 d2
m1 m2 f(d1) f(d2) document document Vector g Abstractly it is a compressed document(meaning preserving) A meaning or context vector representation …… …… …… …… …… …… Compression : m1 = m2 iff d1 = d2 f( ) must be invertible Summarization : m1 = m2 iff d1 and d2 are about the same thing (mean the same thing)
What is the optimal method for meaning preserving compression? • Issues • size of representation(ideally size(Vi) << size(Di)) • cost of computation of vectors • one time cost at model creation • cost of similarity function • must be computed for each query • crucial to speed that this be minimized
header processing • retain/model cross references to cited(Xref) articles/mail:
Body Processing/Term Weighting • 1. remove (most) function words • (but must treat words such as ‘NOT’ specially) • 2. downweight by frequency • 3. use text analysis to • decide which function words carry meaning: • Mr. Dong Ving The of the Hanoi branch. • The Who • (use named entity analysis and • phrasal dictionaries)
recognizer recognizer recognizer recognizer Supervised Learning/Training Collective Discrimination Project #1 Input data stream A Chihuahua Breeding Club B Training In Real time (ongoing) Personal C B A C J Labeled (routed) output Junk mail J
Other related problems: Mail/News Routing and Filtering 119 Data Stream 121 Project #1 at work Project #2 at work Chihuahua breeding Scuba club Personal Junk mail 125 Inboxes (prioritize/partition mail reading) 131 Typically model long-term information needs (People put effort into training and user feedback that they aren’t willing to invest for single query-based IR)
Features for classification • Subject line • Source/Sender • X-annotations • Date/time • Length • Other recipients • Message content (regions weighted differently)
TDE TDB Probabilistic IR models – Intermediate Topic models/detectors S 0 0 0 1 0 0 Q 1 0 0 0 0 0 V1 TV1 0 1 0 1 0 0 V2 Topic Detectors (Topic Models) TDA V1 V2 f( ) f( ) d1 d2