1 / 85

Text Information Retrieval and Applications – Advanced Topics

Text Information Retrieval and Applications – Advanced Topics. By J. H. Wang May 27, 2009. Outline. Advanced Retrieval Technologies Cross-Language Information Retrieval Multimedia Information Retrieval Semantic Retrieval Applications to IR Advanced Google Meta Search

risa
Download Presentation

Text Information Retrieval and Applications – Advanced Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Information Retrieval and Applications – Advanced Topics By J. H. Wang May 27, 2009

  2. Outline • Advanced Retrieval Technologies • Cross-Language Information Retrieval • Multimedia Information Retrieval • Semantic Retrieval • Applications to IR • Advanced Google • Meta Search • Search Result Clustering

  3. Advanced Retrieval Technologies • Cross-Language Information Retrieval (CLIR) • Multimedia IR (image, speech, music, video) • Semantic retrieval (XML, Semantic Web)

  4. Cross-Language Information Retrieval • Cross Language Information Retrieval (CLIR) -- A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language

  5. Cross Language Web Search • A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language

  6. Why “Cross-Language”? • Source: Global Reach (global-reach.biz/globstats)

  7. Internet World Users by Language

  8. Top Ten Languages Used in the Web Source: Internet World Stats (Mar. 31, 2009) More and more non-English users!

  9. Web Content More and more non-English pages Source: Network Wizards Internet Domain Survey (Jan 99 )

  10. Chart of Web Content (by Language) [Source: Vilaweb.com, as quoted by eMarketer (Feb. 2001)] • Total Web pages: 313 B • English 68.4% • Japanese 5.9% • German 5.8% • Chinese 3.9% • French 3.0% • Spanish 2.4% • Russian 1.9% • Italian 1.6% • Portuguese 1.4% • Korean 1.3% • Other 4.6%

  11. Language Percent of Public Sites • English 72% • German 7% • Japanese 6% • Spanish 3% • French 3% • Italian 2% • Dutch 2% • Chinese 2% • Korean 1% • Portuguese 1% • Russian 1% • Polish 1% [Source: OCLC, 2002]

  12. Web Users and Pages(10 years ago) Challenge of Scalability ! Total Users: 800MChinese Users: 110M Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), 1.5M (US), and others. Source: Global Reach, 2004

  13. Number of Chinese Web Pages 10,030,000,000 pages Scalability Problem !

  14. Number of Web Pages The world’s largest search engine ? Billions Of Textual Documents IndexedDecember 1995-September 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch (Nov. 2004)

  15. Number of Web Pages • Estimated size: • Web pages in the world: 19.2 billion pages (indexed by Yahoo as of August 2005) • Websites in the world: 70,392,567 websites (indexed by Netcraft as of August 2005) • Web pages per website: 273 (rounding to the nearest whole number) • Updated estimate: • 231,510,169 distinct websites (as found by the Netcraft Web Server Survey in April 2009) • 63.2 billion [Source: http://news.netcraft.com/archives/web_server_survey.html] [Source: http://www.boutell.com/newfaq/misc/sizeofweb.html]

  16. Number of Web Pages • 1 trillion unique URLs (We knew the web was big, by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, 25 July 2008) • 19,200,000,000 pages (Mayer, Tim, 8 August 2005, Our Blog is Growing Up And So Has Our Index) • 320,000,000 pages (World Wide Web is 320 million and growing, BBC News Sci/Tech, 3 April 1998.) • 1,000,000,000 pages (Internet. How much information? 2000. Regents of the University of California.) • 800,000,000 pages (Maran, Ruth, and Paul Whitehead. "Web Pages." Internet and World Wide Web Simplified, 3rd ed. Foster City: IDG Books Worldwide, 1999. ) • 8,034,000,000 pages (Miller, Colleen. web sites: number of pages. NEC Research, IDC.) [Source: http://hypertextbook.com/facts/2007/LorantLee.shtml]

  17. Challenge of Cross-Language Web Search • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup • 81% of the search terms could not be obtained from common English-Chinese translation dictionaries 中央處理器 (CPU), 電子商務 (E-commerce), 個人數位助理(PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War), 非典型肺炎 (SARS), …

  18. Challenge • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup • 81% of the search requests could not be obtained from common English-Chinese translation dictionaries • How to find effective translations automatically for query terms not included in a dictionary ?

  19. Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 Possible global use

  20. English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 Need for CLIR services

  21. English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Query Translation

  22. English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Cost-ineffective to construct translation dictionaries Query Translation

  23. English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Query Translation Taking the Web as online corpus to deal with translation of unknown terms  Web

  24. Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 故宮/故宮博物院 English Query Query Translation National Palace Museum ? Online Term Translation Suggestions  Web

  25. Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 English/Japanese/Korean Queries  Query Translation ? Auto- generated Translation Lexicons  Web

  26. CLIR • Conventional approach to query translation • Parallel documents as the corpus • Assume long queries • Problems of CLIR in digital libraries • No corpus for cross-lingual training • Short queries  “Out-of-dictionary” terms • Ex: proper nouns, new terminologies, …

  27. Translation Lexicon Construction for CLIR • To use the Web as the corpus for query translation • Web mining techniques • Anchor-text-based[ACM TOIS ‘04, ACM TALIP ‘02] • Search-result-based [JCDL ‘04] • To extract terms from real document collections as possible queries • Term extraction method [SIGIR ‘97]

  28. Web Mining Approach to Term Translation Extraction The Web • LiveTrans: http://wkd.iis.sinica.edu.tw/LiveTrans/ Source query Anchor texts Academia Sinica LiveTrans Engine Target translations Search results 中央研究院/中研院

  29. National Palace Museum vs. 故宮博物院Search-Result Page Noises • Mixed-language characteristic in Chinese pages • How to extract translation candidates? • Which candidates to choose?

  30. Yahoo vs. 雅虎 -- Anchor-Text Set • Anchor text (link text) • The descriptive text of a link on a Web page • Anchor-text set • A set of anchor texts pointing to the same page (URL) • Multilingual translations • Yahoo/雅虎/야후 • America/美国/アメリカ • Anchor-text-set corpus • A collection of anchor-text sets 야후-USA Korea Yahoo Search Engine Yahoo! America http://www.yahoo.com • アメリカのYahoo! 美国雅虎 雅虎搜尋引擎 Japan Taiwan China

  31. Anchor-TextCorpus Search-Result Pages Term Translation Extraction from Different Resources WebSpider Term Extraction Search Engine SimilarityEstimation Source Query Target Translation National Palace Museum 國立故宮博物院, 故宮, 故宮博物院

  32. LiveTrans: Cross-language Web Search

  33. More Examples

  34. More Examples

  35. Multimedia IR • Different forms of information need • Image retrieval • Speech information retrieval • Music information retrieval • Video information retrieval

  36. Image Retrieval • Content-based • Query by image content • Query by example (以圖找圖) • Similarity in visual features • Color, texture, shape, … • Relevance feedback • Text-based • Annotation

  37. Content-Based Image Retrieval (CBIR) • Example systems • CIRES (Content-based Image Retrieval System): http://amazon.ece.utexas.edu/~qasim/research.htm • SIMPLIcity: http://www-db.stanford.edu/IMAGE/ • National Museum of History: http://210.201.141.12/cgi-bin/cbir-query.cgi?tid=-1 • …

  38. Relevance Feedback (RF) Source: Dr. Cheng Image Similar images (no RF)

  39. Similar Images Using Relevance Feedback Image Similar images using RF

  40. Automatic Image Annotation Problem 1 Keywords? Visual Similarity polar bear ice snow white bear snow tundra polar bears snow fight Image Banks with Annotations

  41. Spoken Document Retrieval • Spoken document retrieval • Indexing speech messages using speech recognition • Retrieving relevant messages for a text/speech query • Techniques • Document Processing: acoustic change detection, speech/non-speech detection, Mandarin/non-Mandarin detection, story segmentation, speaker recognition/clustering • Speech Recognition • Indexing/Retrieval

  42. SoVideo

  43. Music Information Retrieval • Finding a song by similar melody • Query by singing • Query by humming • Singer identification • Background noise • Singer voice model

  44. Video Information Retrieval • Difference with CBIR • Temporal information • Structural organization • Complexity of querying system • Techniques • Video segmentation • Keyframe identification

  45. Semantic Retrieval • HTML vs. XML • Semantic Web (Agent, Ontology, RDF)

  46. Common Language of the Web • HTML • Link: Pi Pj • URL (URI), anchor text • Part-of National Taiwan University http://www.ntu.edu.tw/ NTU

  47. 100 53 50 50 50 3 9 3 3 Link Analysis –Hubs & Authorities in PageRank

  48. Current Web Search • Keyword-based search (e.g., Google) • Full text indexing • Page authority (link analysis) • Page popularity (query log and user’s click) • Problems • Not specific • Data in pages have no semantic annotations • Yo-yo Ma’s most recent CD • No topic disambiguation • Documents with different topics mix together • Yo-yo Ma’s CDs, concerts, biography, gossips,…

  49. Search on Semantic Web • Metadata search • To increase precision and flexibility • Topic-based search • To help contextualize queries and overlay results in terms of a knowledge base

  50. XML (Extensible Markup Language) • More flexible tags • DTD (Data Type Definition) • Definition of the tags

More Related