1 / 22

Information Retrieval (1955-1992)

Information Retrieval (1955-1992). Primary Users Law Clerks Reference Librarians (Some) News organizations, product research, congressional committees, medical/chemical abstract searches Primary Search Models Boolean keyword searches on Abstract, Title, keyword Vendors

Download Presentation

Information Retrieval (1955-1992)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InformationRetrieval(1955-1992) • Primary Users • Law Clerks • Reference Librarians • (Some) News organizations, product research, congressional committees, medical/chemical abstract searches • Primary Search Models • Boolean keyword searches on Abstract, Title, keyword • Vendors • Mead Data Central(Lexis – Nexis) • Dialog • Westlaw • Total searchable online data : O(10 terabytes)

  2. Information Retrieval(1993+) • Primary users • 1st time computer users • novices • Primary search modes • Still Boolean keyword searches with limited probabilistic models • But FULL TEXT Retrieval • Vendors • Lycos, Infoseek, Yahoo, Excite, AltaVista, Google • Total online data : ???

  3. Growth of the Web # of web sites or Volume of web traffic ? Mosaic Netscape Exponential Growth 1992 1993 1994 1995 1996 1997 1998 Volume doubling every 6 months

  4. Observation • Early IR system basically extended library catalog systems, allowing • Keyword searches, • Limited abstract searches in addition to Author/Title/Subject and including Boolean combination functionality • IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)

  5. In Contrast • Today, IR has a much wider role in the age of digital libraries • Full document retrieval • (hypertext, postscript or optical image(TIFF) • representations) • Question answering

  6. Old View • Function of IR : • Map queries to relevant documents • New View • Satisfy user’s information need • Infer goals/information need from: • query itself • past user query history • User profiling(aol.com vs. CS dept.) • Collective analysis of other user feedback on similar queries … AND … OR … 15 1 8

  7. In addition, return information in a format useful/intelligible to the user • weighted orderings • clusterings of documents by different attributes • visualization tools • ** Text Understanding techniques to extract answer to questions or at least subregion of text • Who is the current mayor of Columbus, Ohio? • don’t need full AP/CNN article on city scandals, • just the answer(and available source for proof)

  8. Boolean Systems Function #1 : Provide a fast, compact index into the database (of documents or references) • (granularity) • Index options • Doc number • Page number in Doc • Actual word offset Chihuahua Nanny Data structure: Inverted file

  9. Boolean Operations Chihuahua AND Nanny g Join ( ) Chihuahua OR Nanny g Union ( ) Proximity searches Chihuahua W/3 Nanny

  10. Vector IR model d1 d2 Find optimal f( ) f( ) f( ) V1 V2 Query Sim (Vi , VQ) = Sim’ (Di , Q) Sim (V1, V2) »Sim’ (d1, d2) Cosine distance

  11. Vector models V1 Bit vector capturing essence/meaning of D1 D1 V2 Sim (V1 , Q1) D2 Q1 Query Find maxSim (Vi , Q1)

  12. ^ V1 Dimensionality Reduction d1 f( ) V1 Initial (term) vector representation Dimensionality Reduction(SVD/LSI) More compact/reduced dimensionality model of d1

  13. V1 D1 3 Japanese Japan Nippon Japanese Nihon Japanese Japanese .. 5 Japan * 1 The 192 1 1 Clustering words Offset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w))) Raw Term Vector Condensed Vector Stem : books book computer comput computation comput

  14. Soap Opera Soap opera 0 0 1 d1 Collocation (Phrasal Term) Soap Opera Soap opera 1 1 0 d2

  15. m1 m2 f(d1) f(d2) document document Vector g Abstractly it is a compressed document(meaning preserving) A meaning or context vector representation …… …… …… …… …… …… Compression : m1 = m2 iff d1 = d2 f( ) must be invertible Summarization : m1 = m2 iff d1 and d2 are about the same thing (mean the same thing)

  16. What is the optimal method for meaning preserving compression? • Issues • size of representation(ideally size(Vi) << size(Di)) • cost of computation of vectors • one time cost at model creation • cost of similarity function • must be computed for each query • crucial to speed that this be minimized

  17. header processing • retain/model cross references to cited(Xref) articles/mail:

  18. Body Processing/Term Weighting • 1. remove (most) function words • (but must treat words such as ‘NOT’ specially) • 2. downweight by frequency • 3. use text analysis to • decide which function words carry meaning: • Mr. Dong Ving The of the Hanoi branch. • The Who • (use named entity analysis and • phrasal dictionaries)

  19. recognizer recognizer recognizer recognizer Supervised Learning/Training Collective Discrimination Project #1 Input data stream A Chihuahua Breeding Club B Training In Real time (ongoing) Personal C B A C J Labeled (routed) output Junk mail J

  20. Other related problems: Mail/News Routing and Filtering 119 Data Stream 121 Project #1 at work Project #2 at work Chihuahua breeding Scuba club Personal Junk mail 125 Inboxes (prioritize/partition mail reading) 131 Typically model long-term information needs (People put effort into training and user feedback that they aren’t willing to invest for single query-based IR)

  21. Features for classification • Subject line • Source/Sender • X-annotations • Date/time • Length • Other recipients • Message content (regions weighted differently)

  22. TDE TDB Probabilistic IR models – Intermediate Topic models/detectors S 0 0 0 1 0 0 Q 1 0 0 0 0 0 V1 TV1 0 1 0 1 0 0 V2 Topic Detectors (Topic Models) TDA V1 V2 f( ) f( ) d1 d2

More Related