Text information retrieval and applications advanced topics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 85

Text Information Retrieval and Applications – Advanced Topics PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on
  • Presentation posted in: General

Text Information Retrieval and Applications – Advanced Topics. By J. H. Wang May 27, 2009. Outline. Advanced Retrieval Technologies Cross-Language Information Retrieval Multimedia Information Retrieval Semantic Retrieval Applications to IR Advanced Google Meta Search

Download Presentation

Text Information Retrieval and Applications – Advanced Topics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text information retrieval and applications advanced topics

Text Information Retrieval and Applications – Advanced Topics

By J. H. Wang

May 27, 2009


Outline

Outline

  • Advanced Retrieval Technologies

    • Cross-Language Information Retrieval

    • Multimedia Information Retrieval

    • Semantic Retrieval

  • Applications to IR

    • Advanced Google

    • Meta Search

    • Search Result Clustering


Advanced retrieval technologies

Advanced Retrieval Technologies

  • Cross-Language Information Retrieval (CLIR)

  • Multimedia IR (image, speech, music, video)

  • Semantic retrieval (XML, Semantic Web)


Cross language information retrieval

Cross-Language Information Retrieval

  • Cross Language Information Retrieval (CLIR) -- A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language


Cross language web search

Cross Language Web Search

  • A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language


Why cross language

Why “Cross-Language”?

  • Source: Global Reach (global-reach.biz/globstats)


Internet world users by language

Internet World Users by Language


Top ten languages used in the web

Top Ten Languages Used in the Web

Source: Internet World Stats (Mar. 31, 2009)

More and more non-English users!


Web content

Web Content

More and more

non-English pages

Source: Network Wizards Internet Domain Survey (Jan 99 )


Chart of web content by language

Chart of Web Content (by Language)

[Source: Vilaweb.com, as quoted by eMarketer (Feb. 2001)]

  • Total Web pages: 313 B

    • English 68.4%

    • Japanese 5.9%

    • German 5.8%

    • Chinese 3.9%

    • French 3.0%

    • Spanish 2.4%

    • Russian 1.9%

    • Italian 1.6%

    • Portuguese 1.4%

    • Korean 1.3%

    • Other 4.6%


Language percent of public sites

Language Percent of Public Sites

  • English 72%

  • German 7%

  • Japanese 6%

  • Spanish 3%

  • French 3%

  • Italian 2%

  • Dutch 2%

  • Chinese 2%

  • Korean 1%

  • Portuguese 1%

  • Russian 1%

  • Polish 1%

[Source: OCLC, 2002]


Web users and pages 10 years ago

Web Users and Pages(10 years ago)

Challenge of Scalability !

Total Users: 800MChinese Users: 110M

Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), 1.5M (US), and others.

Source: Global Reach, 2004


Number of chinese web pages

Number of Chinese Web Pages

10,030,000,000 pages

Scalability Problem !


Number of web pages

Number of Web Pages

The world’s

largest search engine ?

Billions Of Textual Documents IndexedDecember 1995-September 2003

KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.

Source: Search Engine Watch (Nov. 2004)


Number of web pages1

Number of Web Pages

  • Estimated size:

    • Web pages in the world: 19.2 billion pages (indexed by Yahoo as of August 2005)

    • Websites in the world: 70,392,567 websites (indexed by Netcraft as of August 2005)

    • Web pages per website: 273 (rounding to the nearest whole number)

  • Updated estimate:

    • 231,510,169 distinct websites (as found by the Netcraft Web Server Survey in April 2009)

    • 63.2 billion

[Source: http://news.netcraft.com/archives/web_server_survey.html]

[Source: http://www.boutell.com/newfaq/misc/sizeofweb.html]


Number of web pages2

Number of Web Pages

  • 1 trillion unique URLs (We knew the web was big, by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, 25 July 2008)

  • 19,200,000,000 pages (Mayer, Tim, 8 August 2005, Our Blog is Growing Up And So Has Our Index)

  • 320,000,000 pages (World Wide Web is 320 million and growing, BBC News Sci/Tech, 3 April 1998.)

  • 1,000,000,000 pages (Internet. How much information? 2000. Regents of the University of California.)

  • 800,000,000 pages (Maran, Ruth, and Paul Whitehead. "Web Pages." Internet and World Wide Web Simplified, 3rd ed. Foster City: IDG Books Worldwide, 1999. )

  • 8,034,000,000 pages (Miller, Colleen. web sites: number of pages. NEC Research, IDC.)

[Source: http://hypertextbook.com/facts/2007/LorantLee.shtml]


Challenge of cross language web search

Challenge of Cross-Language Web Search

  • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

  • 81% of the search terms could not be obtained from common English-Chinese translation dictionaries

中央處理器 (CPU), 電子商務 (E-commerce),

個人數位助理(PDA), 雅虎 (Yahoo),

太空總署 (NASA), 星際大戰 (Star War),

非典型肺炎 (SARS), …


Challenge

Challenge

  • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup

  • 81% of the search requests could not be obtained from common English-Chinese translation dictionaries

  • How to find effective translations automatically for query terms not included in a dictionary ?


Query translation clir in dl

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

Possible global use


Query translation clir in dl1

English Query

Porcelain

?

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

Need for CLIR services


Query translation clir in dl2

English Query

Porcelain

?

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

瓷器/瓷/陶瓷

Query Translation


Query translation clir in dl3

English Query

Porcelain

?

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

瓷器/瓷/陶瓷

Cost-ineffective to construct

translation dictionaries

Query Translation


Query translation clir in dl4

English Query

Porcelain

?

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

瓷器/瓷/陶瓷

Query Translation

Taking the Web as online corpus

to deal with translation of

unknown terms

Web


Query translation clir in dl5

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

故宮/故宮博物院

English Query

Query Translation

National

Palace

Museum

?

Online Term Translation

Suggestions

Web


Query translation clir in dl6

Query Translation & CLIR in DL

Chinese Query

Mono-Lingual Document Search

Chinese Digital Libraries

瓷器

瓷器/瓷/陶瓷

English/Japanese/Korean Queries

Query Translation

?

Auto-

generated

Translation

Lexicons

Web


Text information retrieval and applications advanced topics

CLIR

  • Conventional approach to query translation

    • Parallel documents as the corpus

    • Assume long queries

  • Problems of CLIR in digital libraries

    • No corpus for cross-lingual training

    • Short queries

       “Out-of-dictionary” terms

      • Ex: proper nouns, new terminologies, …


Translation lexicon construction for clir

Translation Lexicon Construction for CLIR

  • To use the Web as the corpus for query translation

    • Web mining techniques

      • Anchor-text-based[ACM TOIS ‘04, ACM TALIP ‘02]

      • Search-result-based [JCDL ‘04]

  • To extract terms from real document collections as possible queries

    • Term extraction method [SIGIR ‘97]


Web mining approach to term translation extraction

Web Mining Approach to Term Translation Extraction

The Web

  • LiveTrans: http://wkd.iis.sinica.edu.tw/LiveTrans/

Source query

Anchor texts

Academia Sinica

LiveTrans Engine

Target translations

Search results

中央研究院/中研院


National palace museum vs search result page

National Palace Museum vs. 故宮博物院Search-Result Page

Noises

  • Mixed-language characteristic in Chinese pages

  • How to extract translation candidates?

  • Which candidates to choose?


Yahoo vs anchor text set

Yahoo vs. 雅虎 -- Anchor-Text Set

  • Anchor text (link text)

    • The descriptive text of a link on a Web page

  • Anchor-text set

    • A set of anchor texts pointing to the same page (URL)

    • Multilingual translations

      • Yahoo/雅虎/야후

      • America/美国/アメリカ

  • Anchor-text-set corpus

    • A collection of anchor-text sets

야후-USA

Korea

Yahoo Search Engine

Yahoo! America

http://www.yahoo.com

  • アメリカのYahoo!

美国雅虎

雅虎搜尋引擎

Japan

Taiwan

China


Term translation extraction from different resources

Anchor-TextCorpus

Search-Result

Pages

Term Translation Extraction from Different Resources

WebSpider

Term

Extraction

Search

Engine

SimilarityEstimation

Source Query

Target

Translation

National Palace Museum

國立故宮博物院, 故宮, 故宮博物院


Livetrans cross language web search

LiveTrans: Cross-language Web Search


More examples

More Examples


More examples1

More Examples


Multimedia ir

Multimedia IR

  • Different forms of information need

  • Image retrieval

  • Speech information retrieval

  • Music information retrieval

  • Video information retrieval


Image retrieval

Image Retrieval

  • Content-based

    • Query by image content

      • Query by example (以圖找圖)

    • Similarity in visual features

      • Color, texture, shape, …

    • Relevance feedback

  • Text-based

    • Annotation


Content based image retrieval cbir

Content-Based Image Retrieval (CBIR)

  • Example systems

    • CIRES (Content-based Image Retrieval System): http://amazon.ece.utexas.edu/~qasim/research.htm

    • SIMPLIcity: http://www-db.stanford.edu/IMAGE/

    • National Museum of History: http://210.201.141.12/cgi-bin/cbir-query.cgi?tid=-1


Relevance feedback rf

Relevance Feedback (RF)

Source: Dr. Cheng

Image

Similar

images

(no RF)


Similar images using relevance feedback

Similar Images Using Relevance Feedback

Image

Similar

images

using RF


Automatic image annotation

Automatic Image Annotation

Problem 1

Keywords?

Visual Similarity

polar bear ice snow

white bear snow tundra

polar bears snow fight

Image Banks with Annotations


Spoken document retrieval

Spoken Document Retrieval

  • Spoken document retrieval

    • Indexing speech messages using speech recognition

    • Retrieving relevant messages for a text/speech query

  • Techniques

    • Document Processing: acoustic change detection, speech/non-speech detection, Mandarin/non-Mandarin detection, story segmentation, speaker recognition/clustering

    • Speech Recognition

    • Indexing/Retrieval


Sovideo

SoVideo

http://slam.iis.sinica.edu.tw/demo.htm


Music information retrieval

Music Information Retrieval

  • Finding a song by similar melody

    • Query by singing

    • Query by humming

  • Singer identification

    • Background noise

    • Singer voice model

  • Demo:

    • http://slam.iis.sinica.edu.tw/demo.htm


Video information retrieval

Video Information Retrieval

  • Difference with CBIR

    • Temporal information

    • Structural organization

    • Complexity of querying system

  • Techniques

    • Video segmentation

    • Keyframe identification


Semantic retrieval

Semantic Retrieval

  • HTML vs. XML

  • Semantic Web (Agent, Ontology, RDF)


Common language of the web

Common Language of the Web

  • HTML

    • Link: Pi Pj

    • URL (URI), anchor text

      • Part-of

National Taiwan University

http://www.ntu.edu.tw/

NTU


Link analysis hubs authorities in pagerank

100

53

50

50

50

3

9

3

3

Link Analysis –Hubs & Authorities in PageRank


Current web search

Current Web Search

  • Keyword-based search (e.g., Google)

    • Full text indexing

    • Page authority (link analysis)

    • Page popularity (query log and user’s click)

  • Problems

    • Not specific

      • Data in pages have no semantic annotations

      • Yo-yo Ma’s most recent CD

    • No topic disambiguation

      • Documents with different topics mix together

      • Yo-yo Ma’s CDs, concerts, biography, gossips,…


Search on semantic web

Search on Semantic Web

  • Metadata search

    • To increase precision and flexibility

  • Topic-based search

    • To help contextualize queries and overlay results in terms of a knowledge base


Xml extensible markup language

XML (Extensible Markup Language)

  • More flexible tags

  • DTD (Data Type Definition)

    • Definition of the tags


Xml search

XML Search

  • XML Text Search Engines

    • Amberfish (Etymon)

    • X3 (X-cubed) (DocSoft)

    • UltraSeek (Verity)

  • XML Structured Query Engines

    • Fxgrep

    • Cheshire II (UC Berkeley)

  • XML Query Languages

    • XQuery (W3C XMLQuery)

    • XQL

    • XML-QL


Semantic web

Semantic Web

  • "The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001


Semantic web1

Semantic Web

Agent

Agent

RDF

ontology

Agent


Semantic web2

Semantic Web

  • RDF (Resource Description Framework)

    • Common language

  • Ontology

    • Knowledge representation

  • Agent


Why semantic web

Why Semantic Web?

  • Standardizing knowledge sharing and reusability on the Web

  • Interoperable (independent of devices and platforms)

  • Machine readable—enabling intelligent processing of information


An example of semantic relation

An Example of Semantic Relation

author

work

written by

publisher

publish


What is a software agent

What is a Software Agent?

  • A paradigm shift of information utilization from direct manipulation to indirect access and delegation

  • A kind of middleware between information demand (client) and information supply (server)

  • A software that has autonomous, personalized, adaptive, mobile, communicative, social, decision making abilities


What is ontology

What is Ontology?

  • An ontology is a formal and explicit specification of sharedconceptualization of a domain of interest (T. Gruber)

    • Formal semantics

    • Consensus of terms

    • Machine readable and processible

    • Model of real world

    • Domain specific


What is ontology 2

What is Ontology?(2)

  • Generalization of

    • Entity relationship diagrams

    • Object database schemas

    • Taxonomies

    • Thesauri

  • Conceptualization contains phenomena like

    • Concepts/classes/frames/entity types

    • Constraints

    • Axioms, rules


Agents and ontology

Agents and Ontology

  • Agents must have domain knowledge to solve domain-specific problems

  • Agents must have common sharable ontology to communicate and share knowledge with each other

  • The common sharable ontology must be represented in a standard format so that all software agents can understand and communicate


Agents and semantic web

Agents and Semantic Web

  • Semantic Web provides the structure for meaningful content of Web pages, so that software agents roaming from page to page will carry out sophisticated tasks

    • An agent coming to a clinic’s web page will know Dr. Henry works at the clinic on Monday, Wednesday and Friday without having the full intelligence to understand the text…

    • Assumption is Dr. Henry make the page using an off-the-shelf tool, as well as the resources listed on the Physical Therapy Association’s site


Knowledge representation on the web

Knowledge Representation on the Web

  • The challenge of the Web is to provide a language to express both data and rules for reasoning about the data[meta-data] that allows rules from any existing knowledge representation system to be exported onto the Web

  • Adding logic to the Web means to use rules to make inference, choose actions and answer questions. The logic must be powerful enough but not too complicated for agents to consider a paradox


Language layers on the web

Language Layers on the Web

Trust

DAML-L (logic)

Declarative Languages:

OIL, DAML+Ont

DC

PICS

XHTML SMIL

RDF

XML

HTML

Semantic web infrastructure is built on RDF data model


Languages on the web

Languages on the Web

  • HTML+URL

  • XML+DTD (Data Type Definition)

  • RDF+RDF schema


Statements rdf

Statements: RDF

  • The basic structure of RDF is object-attribute-value

  • In terms of labeled graph: [O]-A->[V]

A

O

V


Semantic web search engine

Semantic Web Search Engine

  • Swoogle: http://swoogle.umbc.edu/ [CIKM 2004]

  • SHOE (Simple HTML Ontology Extensions): http://www.cs.umd.edu/projects/plus/SHOE/search/

  • SWSE: http://www.swse.org/

  • http://www.semanticwebsearch.com/


Applications to ir

Applications to IR

  • Advanced Google

  • Meta Search

  • Search Result Clustering


What do users really want

What do Users Really Want?

  • Topic-based vs. keyword-based

    • “NTU”

  • How to improve current search engines?

  • Resources about Search Engines

    • Search Engine Watch: http://searchenginewatch.com/

    • Research Buzz: http://researchbuzz.com/


Advanced google

Advanced Google

  • Is Google good enough?

    • “NTU”

    • “NTU university”

    • “NTU university Singapore”

  • More and more Services

    • Google Web, Image, News, Video, Google Desktop Search , …

    • Google Groups, Gmail, Google Talk, Google Calendar, …

    • Google Mobile, Google SMS, Google Local, …

    • Google Print (Book Search), Google Maps, Google Earth, …

    • Google Scholar, Translate, Finance, Docs, Reader, …

  • More about Google Services

    • http://www.google.com/options/

    • Google Labs: http://labs.google.com/


More types of document search

More Types of Document Search

  • Google: Web, Image, News, Groups, Desktop (Office, mail),

  • Microsoft: +Lookout (mail)

  • Yahoo: +Stata (mail), +Adobe (PDF)


Searching different media

Searching Different Media

  • Multimedia Search: MP3, Blog, messenger, mobile, …

    • Baidu.com: MP3, image, news, …

    • Singingfish.com (AOL): audio/video, …

    • GoFish.com: audio, video, mobile, games

    • AllTheWeb.com: pictures, audio, video, …

  • Blog search engines

    • Daypop, Bloogz, Waypath, …

  • A9.com (by Amazon)

    • Books, movies, …

    • Bookmark, history, discover, diary

  • Mobissimo.com

    • Airfare search, hotel search

  • Yahoo-OCLC toolbar: library search

    • Searching Open WorldCat (OCLC union catalog)


Different forms of presentation

Different Forms of Presentation

  • Clusty.com (by Vivisimo)

    • Clustering engine

  • Snap.com (by Idealab)

    • Sorting by popularity, satisfaction, Web popularity, Web satisfaction, domain, …

  • Alexa.com (by Amazon)

    • Average user review ratings, …

  • Visualization

    • TouchGraph Google Browser: http://www.touchgraph.com/TGGoogleBrowser.html

    • Kartoo.com: a visual meta search engine

    • Girafa

    • ConceptSpace

    • LostGoggles (formerly MoreGoogle): thumbnail preview


Focused search engines

Focused Search Engines

  • Scirus: http://scirus.landingzone.nl

    • For scientific information only

  • Google Scholar: http://scholar.google.com/

    • For scholarly literature


Some google hacks and searching tricks

Some Google Hacks and Searching Tricks

  • References:

    • Tara Calishain and Rael Dornfest, “Google Hacks,” O’Reilly

    • Kevin Hemenway and Tara Calishain, “Spidering Hacks,” O’Reilly

    • http://douweosinga.com/projects/googlehacks

    • Tara Calishain, “Web Search Garage,” Prentice Hall

    • Chris Sherman, “Google Power: Unleash the Full Potential of Google,” McGraw Hill


Further utilizing google

Further Utilizing Google…

  • Google API: http://www.google.com/apis/

    • 1,000 automated queries per day

  • Google Hacks

    • Google Talk

    • Word Color

    • Google Battle

    • Google Date

    • Google Best Time to Visit

    • Google Protocol


Meta federated search

Meta (Federated) Search

  • To search simultaneously several individual search engines and their databases of web pages

    • Ixquick, Metacrawler, Dogpile, …

  • Clustering meta-searchers

    • Vivisimo, KillerInfo, …

  • Meta-search engines for deep digging

    • SurfWax, Copernic Agent, …


Meta search engine

Meta Search Engine

Web

SE1

MetaSearchEngine

SE2

User

SEn


Search result clustering

Search Result Clustering

  • Why search result clustering?

  • Why is SRC different from document clustering?

    • In assessment of algorithm’s quality

    • Precision, recall vs. user-oriented, subjective assessment


Example of search result clustering

Example of Search Result Clustering

National Taiwan University

NTU Hospital

NTU?

Nanyang Technological University, Singapore


Example clustering search engines

Example Clustering Search Engines

  • Vivisimo.com

    • Clusty.com

  • WebClust.com

  • KillerInfo.com

  • InfoNetWare.com

  • SnakeT (Snippet Aggregation for Knowledge ExTraction): http://roquefort.unipi.it/

    • A hierarchical clustering engine for snippets

  • Mooter.com


Example on vivisimo

Example on Vivisimo


Vivisimo cont

Vivisimo (cont.)


Clusty com

Clusty.com


Infonetware com

InfoNetWare.com


Thanks for your attention

Thanks for Your Attention!


  • Login