1 / 35

Research Problems in Digital Libraries: Data Mining and Text Mining

Research Problems in Digital Libraries: Data Mining and Text Mining. Jaime Carbonell and Raj Reddy Carnegie Mellon University April 21, 2006 Talk presented at CS50 symposium at CMU. Keepers of the Faith. Digital Libraries and Universal Access to Information.

caril
Download Presentation

Research Problems in Digital Libraries: Data Mining and Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Problems in Digital Libraries:Data Mining and Text Mining Jaime Carbonell and Raj Reddy Carnegie Mellon University April 21, 2006 Talk presented at CS50 symposium at CMU

  2. Keepers of the Faith

  3. Digital Libraries and Universal Access to Information • Create a Universal Digital Library containing all the books ever published • Unfortunately many of the books are in English • Not readable by over 80% of the population

  4. Information Overload • If we read a book every day • we can only read, at most, 40,000 books in a life time • Having millions of books online and accessible creates an information overload • “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon • Multilingual search technology can help to reduce the overload • permits users to search very large data bases quickly and reliably • independent of language and location

  5. Understanding Language • Books in non-native languages remain incomprehensible to most people • Translation and Summarization essential for world wide use • Current translation systems are not yet perfect • Significant improvements in language understanding systems in the past few decades • Systems based on statistical and linguistic techniques have shown significant performance improvements • improve performance using machine learning • Digitization projects will act as test bed • for validating Language Understanding Systems Research • e.g. The Million Book Digital Library Project

  6. The Million Book Digital Library • Collaborative venture among many countries including USA, China and India • So far 400,000 books have been scanned in China and 200,000 in India • Content is made freely available around the globe • Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip

  7. Million Book Project: Status • 21 Centers in India • 17 centers in China • 1 Center in Egypt • Planned : Australia and Europe • About 600,000 books scanned • About 120,000+ accessible on the web from India • http://dli.iiit.ac.in/ • Uses 8TB of storage • 10 TB server at CMU Library planned for July 2005 • 1,000,000 books by the end of 2007 • Capacity to scan a million pages a day expected to be operational by the end of 2006

  8. Million Book Project: Research Challenges • Providing Access to Billions everyday • Distributed Cached Servers in every country and region • Self-Healing Data Bases • Easy to use interfaces for Billions • Text Mining Challenges • Multilingual Information Retrieval • Summarization • Text Categorization • Named-Entity identification • Novelty Detection • Translation

  9. Information Bill of Rights • Get theright information • To theright people • At theright time • On theright medium • In theright language • With theright level of detail

  10. Relevant Text Mining Technologies IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization • “…right information” • “…right people” • “…right time” • “…right medium” • “…right language” • “…right level of detail”

  11. … The Right Information:Next Generation Search Engines • Search Criteria Beyond Query-Relevance • Google:Popularity(link density, click freq, …) • Vivisimo: Panoramic view (clustering + labeling) • Information novelty(content differential, recency) • Trustworthiness of source • Appropriateness to user (difficulty level, …) • Hidden web: 10X visible web (Federated search) • “Find What I Mean” Principle • Search on semantically related terms • Induce user profile from past history, etc. • Disambiguate terms (e.g. “Jordan”)

  12. Clustering (Vivisimo-style) Search vs Standard IR documents query IR Cluster summaries

  13. MMR Ranking vs Standard IR documents query MMR IR λcontrols spiral curl

  14. … In The Right Level of DetailSynthetic Document = Summary++ • Extractive combo (tracking, MMR, …) • Centrality of info • KIT model relevant • Novelty (vs last time) • Entities, relations, dates, … + raw text • Later: contradiction & attitude detection • Combine: CMU, IBM (NE + rel extraction), UMD (user model, summ), Stanford (contradiction detection) Entities ……… Relations ……. Audio transcripts Textual summary Texts (Eng, Arabic, Chinese …) Analyst zoom-in Novel Attitude mixed Sources

  15. … In the Right Language (MT) Interlingua Semantic Analysis Sentence Planning Transfer Rules Syntactic Parsing Text Generation Source (Arabic) Target (English) Direct: EBMT, SMT

  16. EBMT example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

  17. Illustration of Multi-Engine MT

  18. 1986 1991 1993 1996 2000 Interlingua Spoken Language Multi Engine Example Based Statistical Low Resource Automatic MT Evaluation Portable Letras Avenue MEMT METEOR Diplomat Tongues GEBMT KANT MT Lab KBMT-89 JANUS C-STAR I Pangloss RADD - MT/TIDES GALE Enthusiast TransTac C-STAR II ThaiLator Nespole Lingwear Semantic Annotation Speechalator Q & A Extraction CALL

  19. “Language of Life”: vocabulary chemical groups, properties of AA

  20. Evolutionary Methods for Discovering Sequence  Structure Mapping Distribution of amino acids A Multiple Sequence Alignment Human Monkey Mouse Rat Cow Dog Fly Worm Yeast Conserved Properties across Rhodopsin

  21. Results: -Helical Rung Prediction • 1DBG: correctly identify 10 out of 11 rungs

  22. Concluding Observations… and Exaggerations • Everything can be reduced to Information • Information is the key everything • All “natural” information has an underlying language (genomics, linguistics, …) • Information is all levels of graunularity • Subatomic  DNA/proteins  society  … • Information + language + computation = lifetime employment

More Related