1 / 33

Content Based Search

Content Based Search. Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in). Agenda-e-Day. Motivation What Do People Want from Search Engine? Types of Search Engines Existing Search Engines (Google, Yahoo, Ask AppliedSemantics) INIS – International Nuclear Information System

hova
Download Presentation

Content Based Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Based Search Rajesh Kumar Jain Roll No: 07405402 (rkjain@cse.iitb.ac.in)

  2. Agenda-e-Day • Motivation • What Do People Want from Search Engine? • Types of Search Engines • Existing Search Engines (Google, Yahoo, Ask AppliedSemantics) • INIS – International Nuclear Information System • AgroExplorer • Our approach – Functional Architecture with exa. • Conclusion and Future Work

  3. Motivation • Web major source of information. • Need for search engines • Efficient and time saving. • Language barrier. • Most relevant documents. • Meaning Based Search • Used to retrieve most relevant documents • Multilingual Search • Used to eliminate language barrier.

  4. What Do People Want from Search Engine? • Integrated Solutions • Distributed Solutions • Efficient, Flexible Indexing and Retrieval • Interfaces and Browsing • Effective Retrieval • Multimedia Retrieval • Information Extraction • Relevance Feedback

  5. Types of Search Engines • Individual Search engines • Compile their own databases. • Further classified as • Keyword based search engines. • Search on the keywords. e.g. Google. • Meaning based search engines. • Search on the meaning or semantics. e.g. AgroExplorer • Meta Search engines • Do not compile their own databases. • Search databases of different search engines. e.g. Dogpile. • Subject Directories • Created and maintained by human editors. I.e. LIBRARIANS' INDEX http://lii.org, INFOMINE http://infomine.ucr.edu, ACADEMIC INFO, http://www.academicinfo.us

  6. Existing Search Engines -Google • Keyword Based Search • Page Rank • Relative importance of the web page. • Anchor Text

  7. Existing Search Engines – . • Yahoo! search http://search.yahoo.com • ? Huge (15 or more billion web pages) • ? Relevancy ranking (word proximity and placement) - not popularity ranking • ? Capitalize OR, AND, or AND NOT. Put parentheses around words joined by OR. • ? No search-size word limit (Google limits you to 32 terms) • Services and tools similar to Google's

  8. Existing Search Engines – . Differences between searching Google and Yahoo! Search • Parentheses around ORed terms – sometimes works without parentheses ("global warming" OR "greenhouse effect") rise "sea level" (california OR "los angeles" OR "san diego" OR "san francisco") • Supports intitle: site: inurl: hostname:(for entire site name - hosthame:google.com • Shortcuts available at http://tools.search.yahoo.com/shortcuts

  9. .com Existing Search Engines – . Ask.com http://ask.com • Subject-Specific Popularity ranking (links from pages on same subject as your search) • Search results analyzed to provide: • BROADER & NARROWER TERMS suggestions • Smaller database than Google or Yahoo! - about 2 billion No differences between basic searching in Google and searching Ask.

  10. Existing Search Engines – AppliedSemantics • Internet’s first meaning based search engine. • Used in Google Adsense (Advertising solutions). • CIRCA technology used. (Conceputal Information Retrieval and Communication Architecture) • CIRCA has • a scalable, language independent ontology. • Ontology has • Millions of words with their meanings • Conceptual relationships to other meanings.

  11. CIRCA • Identifies concepts related to specific words and phrases. • Finds how close “phrase A” is to “concept B”. • For a given query • Finds the distance between the query and various concepts in the database. • E.g. Query – “Colorado Bicycle trips”. • Possible concepts– region, bicycling, travel, etc.

  12. .com Existing Search Engines – .

  13. INIS There are three major INIS products: • The INIS Database, which today contains 2.9 million bibliographic records; it is accessible by subscription only and has currently 1.3 million authorized users. • A unique collection of over 850 000 full-text documents (non-conventional "grey" literature – NCL) in 63 languages, including many documents that cannot easily be found anywhere else. • The INIS Multilingual Thesaurus – a major tool for describing nuclear information and knowledge in a structured form, which assists in multilingual and semantic searches.

  14. INIS-Features and Benefits • IAEA official design • Direct access to NCL documents in pdf format • Extended and configurable hyper-linking of external web addresses and emails, facilitating easier access to NCL documents on external systems or contacting authors • Weekly email notifications • Improved usability: • Allows users to see the query and its results at the same time • Allows users to preserve previously run queries for comparison purposes. • Displays records in reverse chronological order, giving users quick access to the latest records. • Better documentation: • Tool-tips assist users in performing tasks • Static help pages with "how-to" documents, manuals and glossary of terms can be opened in separate window for consultation.

  15. INIS-Features and Benefits • Improved configurability: • Allows users to fully customize the search mask and search results pages • The interface can be used in English, German and Spanish, with Portuguese to be added soon. More languages can be added upon demand • Anonymous users can register their own profiles and enjoy personalized features • Improved Index/Authority Navigator with search-composing assistant (CTRL-CLICK) • Increased data export capabilities: new formats (XML, Excel, formatted text, delimited text, HTML), sorting of exports • The type-ahead, search-ahead functionality "INIS Suggest" assists users when entering search terms and shows the hit count before the search is executed; this provides additional useful information when composing queries • Searches are much faster, now enabling queries that used to time out in the old system. Most queries are estimated to be between 5 and 20 times faster

  16. INIS-Features and Benefits • Support for concurrent users: a round-robin load balancer distributes the load among different databases • Improved maintenance: all update procedures are automated, require no human intervention and notify administrators in case of problems • Zero downtime per week: updates are transparent to users, who can use the system 24/7 without performance detriments.

  17. AgroExplorer • A meaning based multilingual search engine. • Agriculture domain. • UNL is used as interlingua. • Supports english, hindi and and marathi languages. Methodology • User phrases the query in native language. • System translates it to Universal Networking Language (UNL). • UNL corpus is searched. • Related documents in UNL are fetched. • Fetched documents are converted to native language.

  18. AgroExplorer

  19. Query Output • Complete Expression Matching. • Retrieves completely relevant documents where query UNL graph is a subgraph of any sentence UNL graph. • Partial Expression Matching • Retrieves relevant documents where query UNL graph is a part of any sentence UNL graph. • Universal Word Matching • Search on Universal words which are concepts, not just keywords. • Keyword Based Matching. • Traditional search. Lucene search engine used.

  20. Multilingual Information Retrieval Need • Document collection contains documents in many languages. • User may not be fluent to express query in document language. Approaches • Machine translation for text translation • Thesaurus/Dictionary Based • Corpus Based (Sub word clusters)

  21. Our Aproach – Functional Architecture

  22. Example… • Commercial Description: 1. Automobile Radio and Stereo Retail Store; 2. Automobile Engine Rebuilding, Repair, and Exchange Workshop; 3. Car Repair and Retail Shop; 4. Jeep Repair and Retail Shop; and 5. Motor Mending and Replacement Workshop.

  23. Example… • For our search, we shall compare these encoding and retrieval techniques: ·a flat list of words, ·a structured list of words, ·a flat list of word senses plus the linguistic Ontology ·a structured list of word senses, using WordNet’s ontology.

  24. NO. QUERY DESCRIPTIONS FOUND 1 Automobile 1, 2 2 Automobile Retail 1 3 Car Repair 3 4 Motor Repair - 5 Engine Repair 2 6. Motor Exchange - Method – Flat list of Words Both recall and precision of this method is very bad!!!

  25. NO. BUSINESS TYPE ACTIVITY OBJECT MARKET AREA 1 Store Retail Radio Automobile Store Retail Stereo Automobile 2 Workshop Rebuilding Engine Automobile Workshop Repair Engine Automobile Workshop Exchange Engine Automobile 3 Shop Retail Car Shop Repair Car 4 Shop Retail Jeep Shop Repair Jeep 5 Workshop Replacement Motor Workshop Mending Motor Method – Structured list of Words

  26. Method – Structured list of Words Recall remains the same because we have not eliminated the semantic-match problems.

  27. NO. DISAMBIGUATED DESCRIPTION 1 [car, auto, automobile, machine, motorcar], [radio receiver, receiving set, radio set, radio, tuner, wireless], [stereo, stereo system, stereophonic system], [retail, sell retail], [shop, store] 2 [car, auto, automobile, machine, motorcar], [engine], [rebuilding], [repair, fix, fixing, mending, reparation], [substitution, exchange], [workshop, shop] 3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store] 4 [jeep, landrover], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store] 5 [motor], [repair, fix, fixing, mending, reparation], [replacement, replacing], [workshop, shop] Method –WordNet Synset and Linguistic ontology

  28. NO. DISAMBIGUATED QUERY DESCRIPTIONS FOUND 1 [car, auto, automobile, machine, motorcar] 1, 2, 3, 4 2 [car, auto, automobile, machine, motorcar], [retail, sell retail] 1, 3, 4 3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation] 2, 3, 4 4 [motor], [repair, fix, fixing, mending, reparation] 2, 5 5 [locomotive, engine, locomotive engine, railway locomotive], [repair, fix, fixing, mending, reparation] — 6 [motor], [substitution, exchange] 2, 5 Method – Flat list of Word senses and Linguistic ontology

  29. Method – Flat list of Word senses and Linguistic ontology · Decouple the user vocabulary from the data vocabulary, by covering the most common English words; · Increase recall, by exploiting the hierarchy to make generic queries and recognizing synonyms; · Increase precision, through the disambiguation mechanism and the ability to navigate the hierarchy to select specific queries

  30. Conclusion and Future action… • Meaning based search engines can include the concept or idea expressed by the user in his query and can thus provide more accurate results than the traditional keyword search engines. • Universal Networking Language (UNL) can be used as an effective interlingua, to represent information in documents written in natural languages. • Multilingual search engines can help the users to access documents written in languages, other than the query language. Future Work • The lack of a large scored, multilingual corpus and the adverse effects of polysemous words are found to be the cause of most of the limitations of MLIR systems. Research efforts are being directed towards these fields and approaches to use interlingua like UNL, subword clusters, etc. effectively for MLIR.

  31. References • “What Do People Want from Information Retrieval?”, W. Bruce Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst • “Beyond Google”, Joe Barker, jbarker@library.berkeley.edu, John Kupersmith, jkupersm@library.berkeley.edu, A “Know Your Library” Workshop Teaching Library, University of California, Berkeley Fall 2006 • D.W. Oard and B.J. Dorr, A survey of multilingual text retrieval.Institute of Advanced Computer Studies and Computer Science Department University of sity of Maryland, 1996. • Mrugank Surve, Sarvjeet Singh, Satish Kagathara, AgroExplorer Group and , Pushpak Bhattacharyya, AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital Libraries, Delhi, India, February,2004. • The UNL Center, The Universal Networking Language (UNL) Specifications. UNDL Foundation, 3rd edition, December 2004. • S. Singh, A Multilingual Meaning Based Search Engine, B.Tech Project Report, Indian Institute of Technology Bombay, 2003. • U. Hahn, K. Marko, S. Schulz, Subword Clusters as Light Weight Interlingua for Multilingual Document Retrieval, Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, (MT-Summit X) Phuket, Thailand. 2005.

  32. References (cont) • K. Marko, U. Hahn, S. Schulz, P. Daumke, and P. Nohama, Interlingual indexing across different language, In RIAO 2004 – Conference Proceedings. Avignon, • France, 26-28 April 2004. • Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library, Technologies Project, 1998. • K. Marko, S. Schulz, A. Medelyan and U. Hahn. 2005, Bootstrapping Dictionaries • for Cross Language Information Retrieval, In SIGIR 2005 , Proceedings of the Proceedings of the • 28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 15-19, 2005.

  33. Thank You !

More Related