1 / 31

Improving Internet Search using Query Expansion and Focusing

Improving Internet Search using Query Expansion and Focusing. Prepared for The Fifth International Military Applications Symposium Conference Theme: Military Personnel Research Track :Recruiting I Sumali J. Conlon, Wendy Wang, Jie Zhang, and Yue-hua She

slone
Download Presentation

Improving Internet Search using Query Expansion and Focusing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Internet Search using Query Expansion and Focusing Prepared for The Fifth International Military Applications Symposium Conference Theme: Military Personnel Research Track :Recruiting I Sumali J. Conlon, Wendy Wang, Jie Zhang, and Yue-hua She Division of Management Information Systems School of Business Administration University of Mississippi University, MS 38677 Phone (662)-915-5470 • This research is supported by the Office of Naval Research under grant N00014-00-0668.

  2. Outline • Research Goals • Motivation • System Architecture • Data • Techniques • Conclusions and Future Research

  3. Research Goals Improve Precision and Recall Rates in Internet Search Future Goals: • Automatic Summarization • Web Mining

  4. Motivation The Internet and other online information has been growing exponentially on-line information • 20% numerical • 80% textual Current web search engines provide Low Precision & Low Recall Precision rate = #retrieved pages that are relevant_ # retrieved pages Recall rate = #relevant pages retrieved_ #relevant pages on the Web

  5. Modern Information Retrievalby R. Baeza-Yates, Berthier Ribeiro-Neto, Ricardo Baeza-Yates List Price: $50.00Our Price:$50.00See All New: from $45.49See All Used: from $32.00 Availability: Usually ships within 24 hoursEdition: Paperback • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott see larger photo See more product details

  6. Database structure • Books(Isbn, title, authors, list_price, our_price) • Related_books(main_book_isbn, related_books_isbn) SQL: Select isbn, title, authors From books, related_books Where main_book_isbn = :main_book_isbn and isbn = related_books_isbn; • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott

  7. Motivation • Current Search Engines work best with • Proper Nouns – “Office of Naval Research” • Other unusual words – “Geophysics” • Problems: • Lack of syntactic and semantic knowledge • Search engines cannot understand human language

  8. 4 Levels of Current Information Retrieval techniques • Texts are “bags-of-words” • Statistical techniques • Linguistics--Syntax -- “Junior College” vs. “College Junior” – Use parser • Linguistics--Syntax and Semantics -- “white collar worker” includes “professional” – Use parser and semantic representation

  9. Linguistic Analysis is very difficult “We still know very little about how linguistic phrases should be used, and what happens when we manipulate entities, concepts, and relations that these phrases denote, and not just the words used to make them” From Natural Language Information Retrieval [Strzalkowski, 1999, p 16]

  10. Improving Internet Search improving recall rate: query expansion • Proper Nouns • Syntactic Analysis • Lexical Semantic Relations improving precision rate: focusing • Co-occurrences Both use knowledge from a knowledge base

  11. System Architecture

  12. Input Request y Proper Noun N Parser Search the KB Send request to the search engines Query Expansion & Focusing Calculate Weights WWW y Threshold N Discarded Figure 2. Steps of improving recall rate

  13. The Parser - Implemented in Prolog S “Who sells computers in LA." NP VP N V NP N PP P NP Who sells computers in LA S(noun(“who”), verb_pp(“sells”, noun(“computers”), pp(“in”, noun_p1(“LA”))))

  14. Data • WordNet--Princeton • Proper noun lexicon • Idiom lexicon • Noun phrase lexicon • Co-occurrence lexicon

  15. Data (continued) • WordNet: a lexical database - Princeton University • Synonyms (automobile ISA motor vehicle) • Hypernyms (automobile ISA_KIND_OF …) • Hyponyms (…ISA_KIND_OF automobile) • Holonyms (professor ISA_member_of faculty) • Meronyms (air bag PART_OF automobile) 95,600 word forms [Miller et al., 1990; Miller, 1995]

  16. Semantic Network University Institution of Higher Education ISA ISA Is_Part_OF ISA Educational Institution Academia Is_A_Kind_Of School Has_Member Building ISA Is_A_Kind_Of person Organization Is_A_Kind_Of Naval Academy

  17. Data (continued) The Knowledge Base isa(New York City, city) antonym(fast, slow) isa(CA, state) antonym(large, small) isa(Mississippi, state) antonym(East, West) isa(Texas, state) part_of(CPU, computer) isa(IBM, company) part_of(Florida, Southeast US) isa(Ole Miss, university) part_of(Georgia, Southeast US) isa(Informs, Organization) part_of(Miami, Florida)

  18. Tools • Perl/CGI • Oracle 9 • Linux

  19. Traditional Information Retrieval technique Vector-Space Model (tf*idf) Di = (Wi1, Wi2, Wi3, … Wit) Di – document (or query) text Wik – weight of term Tk (log(fik) + 1.0) * log(N/nk) Wik = ( tj=1 [log(fij) + 1.0) * log(N/nj)]2 )1/2 fik = occurrence frequency of Tk in Qi N = collection size, nk = number of documents with term Tk assigned S(Di,Qj) = tk=1 (dik * qjk) S = Similarity of two texts (documents and queries)

  20. Our Techniques • Syntax “The car that is red” = “the red car” • Query Expansion - Capture same concept terms (phrases) • Focusing - Reduce useless phrases • Vector-Space Model

  21. Query Expansion • Original search phrase (x,y) • Database is queried for similar words (synonyms, hypernyms) -X returns {x1,x2,…,xn} -Y returns {y1,y2 ,…,ym} (x,y), (x,y1), (x,y2), …, (X,Ym) (x1,y), (x1,y1), (x1,y2), …,(x1,ym) … (xn,y), (xn,y1), (xn,y2), …,(xn,ym)

  22. Relevance Feedback Collect terms from the output of the original query Synonyms At Cornell and Siemens Corporate Research, Inc. -- did not improve precision and recall rates significantly (too many useless terms) Query Expansion in Literature

  23. Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)

  24. Query Expansion will produce queries: Too Many Queries, most are not useful!

  25. Use co-occurrence data from the web Compare the expanded list to the terms appearing in the co-occurrence data Keep lexically related words if they overlap with co-occurrences Send addition requests to the web Focusing Stage -- select only useful synonyms

  26. Search for “Fast Computers” Will produce phrases like: fast computers fast pace computer very fast computers modern high-speed computers interconnect high-speed computers high-speed parallel computers high-speed digital computers first high-speed computers latest high-speed computers new high-speed computers powerful high-speed computers large high-speed computer speeding up computers using… small, high-speed computers

  27. Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)

  28. B11 Extension B12 B1 B13 Evaluate links: B2 B B3 … C1 C A C1 C2 D C2 C3 … C3 … … K

  29. Limitations • Domain Specific • Requires Sophisticated Parser (syntactic analysis) • Requires Extensive Knowledge Base

  30. Conclusion • Use NLP techniques to Improve precision and recall rates • Query Expansion and Focusing • Knowledge base Future Research • Automatic Text Summarization • Mining the Web • This technique can be used in recruiting!

  31. THANK YOU!

More Related