Research Problems in Digital Libraries: Data Mining and Text Mining

Research Problems in Digital Libraries:Data Mining and Text Mining Jaime Carbonell and Raj Reddy Carnegie Mellon University April 21, 2006 Talk presented at CS50 symposium at CMU

Keepers of the Faith

Digital Libraries and Universal Access to Information • Create a Universal Digital Library containing all the books ever published • Unfortunately many of the books are in English • Not readable by over 80% of the population

Information Overload • If we read a book every day • we can only read, at most, 40,000 books in a life time • Having millions of books online and accessible creates an information overload • “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon • Multilingual search technology can help to reduce the overload • permits users to search very large data bases quickly and reliably • independent of language and location

Understanding Language • Books in non-native languages remain incomprehensible to most people • Translation and Summarization essential for world wide use • Current translation systems are not yet perfect • Significant improvements in language understanding systems in the past few decades • Systems based on statistical and linguistic techniques have shown significant performance improvements • improve performance using machine learning • Digitization projects will act as test bed • for validating Language Understanding Systems Research • e.g. The Million Book Digital Library Project

The Million Book Digital Library • Collaborative venture among many countries including USA, China and India • So far 400,000 books have been scanned in China and 200,000 in India • Content is made freely available around the globe • Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip

Million Book Project: Status • 21 Centers in India • 17 centers in China • 1 Center in Egypt • Planned : Australia and Europe • About 600,000 books scanned • About 120,000+ accessible on the web from India • http://dli.iiit.ac.in/ • Uses 8TB of storage • 10 TB server at CMU Library planned for July 2005 • 1,000,000 books by the end of 2007 • Capacity to scan a million pages a day expected to be operational by the end of 2006

Million Book Project: Research Challenges • Providing Access to Billions everyday • Distributed Cached Servers in every country and region • Self-Healing Data Bases • Easy to use interfaces for Billions • Text Mining Challenges • Multilingual Information Retrieval • Summarization • Text Categorization • Named-Entity identification • Novelty Detection • Translation

Information Bill of Rights • Get theright information • To theright people • At theright time • On theright medium • In theright language • With theright level of detail

Relevant Text Mining Technologies IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization • “…right information” • “…right people” • “…right time” • “…right medium” • “…right language” • “…right level of detail”

… The Right Information:Next Generation Search Engines • Search Criteria Beyond Query-Relevance • Google:Popularity(link density, click freq, …) • Vivisimo: Panoramic view (clustering + labeling) • Information novelty(content differential, recency) • Trustworthiness of source • Appropriateness to user (difficulty level, …) • Hidden web: 10X visible web (Federated search) • “Find What I Mean” Principle • Search on semantically related terms • Induce user profile from past history, etc. • Disambiguate terms (e.g. “Jordan”)

Clustering (Vivisimo-style) Search vs Standard IR documents query IR Cluster summaries

MMR Ranking vs Standard IR documents query MMR IR λcontrols spiral curl

… In The Right Level of DetailSynthetic Document = Summary++ • Extractive combo (tracking, MMR, …) • Centrality of info • KIT model relevant • Novelty (vs last time) • Entities, relations, dates, … + raw text • Later: contradiction & attitude detection • Combine: CMU, IBM (NE + rel extraction), UMD (user model, summ), Stanford (contradiction detection) Entities ……… Relations ……. Audio transcripts Textual summary Texts (Eng, Arabic, Chinese …) Analyst zoom-in Novel Attitude mixed Sources

… In the Right Language (MT) Interlingua Semantic Analysis Sentence Planning Transfer Rules Syntactic Parsing Text Generation Source (Arabic) Target (English) Direct: EBMT, SMT

EBMT example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

Illustration of Multi-Engine MT

1986 1991 1993 1996 2000 Interlingua Spoken Language Multi Engine Example Based Statistical Low Resource Automatic MT Evaluation Portable Letras Avenue MEMT METEOR Diplomat Tongues GEBMT KANT MT Lab KBMT-89 JANUS C-STAR I Pangloss RADD - MT/TIDES GALE Enthusiast TransTac C-STAR II ThaiLator Nespole Lingwear Semantic Annotation Speechalator Q & A Extraction CALL

“Language of Life”: vocabulary chemical groups, properties of AA

Evolutionary Methods for Discovering Sequence  Structure Mapping Distribution of amino acids A Multiple Sequence Alignment Human Monkey Mouse Rat Cow Dog Fly Worm Yeast Conserved Properties across Rhodopsin

Results: -Helical Rung Prediction • 1DBG: correctly identify 10 out of 11 rungs

Concluding Observations… and Exaggerations • Everything can be reduced to Information • Information is the key everything • All “natural” information has an underlying language (genomics, linguistics, …) • Information is all levels of graunularity • Subatomic  DNA/proteins  society  … • Information + language + computation = lifetime employment

Research Problems in Digital Libraries: Data Mining and Text Mining

Research Problems in Digital Libraries: Data Mining and Text Mining

Presentation Transcript

CS276B Web Search and Mining

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

CS490D: Introduction to Data Mining Prof. Chris Clifton

Data Mining

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

Web Mining Research: A Survey

CHAPTER 17: DATA MINING BASICS

CS583 – Data Mining and Text Mining

Solving Some Text Mining Problems with Conceptual Graphs

Data Mining with DB

Text mining and the Semantic Web

Spatial and Temporal Data Mining

Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998

Text Analytics And Text Mining Best of Text and Data