The new bill of rights of information society
Download
1 / 42

the new - PowerPoint PPT Presentation


  • 232 Views
  • Updated On :

The New “Bill of Rights” of Information Society. Raj Reddy and Jaime Carbonell Carnegie Mellon University March 23, 2006 Talk at Google. New Bill of Rights. Get the right information e.g. search engines To the right people e.g. categorizing, routing At the right time

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'the new ' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The new bill of rights of information society l.jpg

The New “Bill of Rights” of Information Society

Raj Reddy and Jaime Carbonell

Carnegie Mellon University

March 23, 2006

Talk at Google


New bill of rights l.jpg
New Bill of Rights

  • Get the right information

    • e.g. search engines

  • To the right people

    • e.g. categorizing, routing

  • At the right time

    • e.g. Just-in-Time (task modeling, planning)

  • In the right language

    • e.g. machine translation

  • With the right level of detail

    • e.g. summarization

  • In the right medium

    • e.g. access to information in non-textual media


Relevant technologies l.jpg
Relevant Technologies

search engines

classification, routing

anticipatory analysis

machine translation

summarization

speech input and output

  • “…right information”

  • “…right people”

  • “…right time”

  • “…right language”

  • “…right level of detail”

  • “…right medium”



The right information l.jpg
The Right Information

  • Right Information from future Search Engines

    • How to go beyond just “relevance to query” (all) and “popularity”

  • Eliminate massive redundancy e.g. “web-based email”

    • Should not result in

      • multiple links to different yahoo sites promoting their email, or even non-Yahoo sites discussing just Yahoo-email.

    • Should result in

      • a link to Yahoo email, one to MSN email, one to Gmail, one that compares them, etc.

  • First show trusted info sources and user-community-vetted sources

    • At least for important info (medical, financial, educational, …), I want to trust what I read, e.g.,

      • For new medical treatments

        • First info from hospitals, medical schools, the AMA, medical publications, etc. , and

        • NOT from Joe Shmo’s quack practice page or from the National Enquirer.

  • Maximum Marginal Relevance

  • Novelty Detection

  • Named Entity Extraction


Beyond pure relevance in ir l.jpg
Beyond Pure Relevance in IR

  • Current Information Retrieval Technology Only Maximizes Relevance to Query

  • What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...??

  • Novelty is approximated by non-redundancy!

    • we really want to maximize: relevance to the query, given the user profile and interaction history,

      • P(U(f i , ..., f n ) | Q & {C} & U & H)

        where Q = query, {C} = collection set,

        U = user profile, H = interaction history

    • ...but we don’t yet know how. Darn.


Maximal marginal relevance vs standard information retrieval l.jpg
Maximal Marginal Relevance vs. Standard Information Retrieval

documents

query

MMR

Standard IR

IR


Novelty detection l.jpg
Novelty Detection Retrieval

  • Find the first report of a new event

  • (Unconditional) Dissimilarity with Past

    • Decision threshold on most-similar story

    • (Linear) temporal decay

    • Length-filter (for teasers)

  • Cosine similarity with standard weights:


New first story detection directions l.jpg
New First Story Detection Directions Retrieval

  • Topic-conditional models

    • e.g. “airplane,” “investigation,” “FAA,” “FBI,” “casualties,”  topic, not event

    • “TWA 800,” “March 12, 1997”  event

    • First categorize into topic, then use maximally-discriminative terms within topic

  • Rely on situated named entities

    • e.g. “Arcan as victim,” “Sharon as peacemaker”


Link detection in texts l.jpg
Link Detection in Texts Retrieval

  • Find text (e.g. Newstories) that mention the same underlying events.

  • Could be combined with novelty (e.g. something new about interesting event.)

  • Techniques: text similarity, NE’s, situated NE’s, relations, topic-conditioned models, …


Named entity identification l.jpg
Named-Entity identification Retrieval

Purpose: to answer questions such as:

  • Who is mentioned in these 100 Society articles?

  • What locations are listed in these 2000 web pages?

  • What companies are mentioned in these patent applications?

  • What products were evaluated by Consumer Reports this year?


Named entity identification12 l.jpg
Named Entity Identification Retrieval

President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization.


Methods for ne extraction l.jpg
Methods for NE Extraction Retrieval

  • Finite-State Transducers w/variables

    • Example output:

      FNAME: “Bill” LNAME: “Clinton” TITLE: “President”

    • FSTs Learned from labeled data

  • Statistical learning (also from labeled data)

    • Hidden Markov Models (HMMs)

    • Exponential (maximum-entropy) models

    • Conditional Random Fields [Lafferty et al]


Named entity identification14 l.jpg
Named Entity Identification Retrieval

Extracted Named Entities (NEs)

People Places

President Clinton Singapore

Mickey Kantor Japan

Ms. Xuemei Peng China

Mr. Hideto Suzuki Australia

Mr. Langford


Role situated ne s l.jpg
Role Situated NE’s Retrieval

Motivation: It is useful to know roles of NE’s:

  • Who participated in the economic meeting?

  • Who hosted the economic meeting?

  • Who was discussed in the economic meeting?

  • Who was absent from the the economic meeting?


Emerging methods for extracting relations l.jpg
Emerging Methods Retrievalfor Extracting Relations

  • Link Parsers at Clause Level

    • Based on dependency grammars

    • Probabilistic enhancements [Lafferty, Venable]

  • Island-Driven Parsers

    • GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’]

    • Tree-bank-trained probabilistic CF parsers [IBM, Collins]

  • Herald the return of deep(er) NLP techniques.

  • Relevant to new Q/A from free-text initiative.

  • Too complex for inductive learning (today).


Relational ne extraction l.jpg
Relational NE Extraction Retrieval

Example: (Who does What to Whom)

"John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q4 2001 without problems from federal regulators"


Fact extraction application l.jpg
Fact Extraction Application Retrieval

  • Useful for relational DB filling, to prepare data for “standard” DM/machine-learning methods

    Acquirer Acquiree Sh.price Year

    __________________________________

    Flexicon Logi-truck 18 1999

    Flexicon Supplyhouse 30 2001

    buy.com reel.com 10 2000

    ... ... ... ...


Right people text categorization l.jpg
“…right people” RetrievalText Categorization


The right people l.jpg
The Right People Retrieval

  • User-focused search is key

    • If a 7-year old is working on a school project

      • taking good care of one’s heart and types in “heart care”, she will want links to pages like

        • “You and your friendly heart”,

        • “Tips for taking good care of your heart”,

        • “Intro to how the heart works” etc.

        • NOT the latest New England Journal of Medicine article on “Cardiological implications of immuo-active proteases”.

    • If a cardiologist issues the query, exactly the opposite is desired

    • Search engines must know their users better, and the user tasks

  • Social affiliation groups for search and for automatically categorizing, prioritizing and routing incoming info or search results. New machine learning technology allows for scalable high-accuracy hierarchical categorization.

    • Family group

    • Organization group

    • Country group

    • Disaster affected group

    • Stockholder group


Text categorization l.jpg
Text Categorization Retrieval

Assign labels to each document or web-page

  • Labels may be topics such as Yahoo-categories

    • finance, sports, NewsWorldAsiaBusiness

  • Labels may be genres

    • editorials, movie-reviews, news

  • Labels may be routing codes

    • send to marketing, send to customer service


Text categorization22 l.jpg
Text Categorization Retrieval

Methods

  • Manual assignment

    • as in Yahoo

  • Hand-coded rules

    • as in Reuters

  • Machine Learning (dominant paradigm)

    • Words in text become predictors

    • Category labels become “to be predicted”

    • Predictor-feature reduction (SVD, 2, …)

    • Apply any inductive method: kNN, NB, DT,…



Right timeframe just in time no sooner or later l.jpg
“…right timeframe” RetrievalJust-in-Time - no sooner or later


Just in time information l.jpg
Just in Time Information Retrieval

  • Get the information to user exactly when it is needed

    • Immediately when the information is requested

    • Prepositioned if it requires time to fetch & download (eg HDTV video)

      • requires anticipatory analysis and pre-fetching

  • How about “push technology” for, e.g. stock alerts, reminders, breaking news?

    • Depends on user activity:

      • Sleeping or Don’t Disturb or in Meeting  wait your chance

      • Reading email  now if info is urgent, later otherwise

      • Group info before delivering (e.g. show 3 stock alerts together)

      • Info directly relevant to user’s current task  immediately


Right language translation l.jpg
“…right language” RetrievalTranslation


Access to multilingual information l.jpg
Access to Multilingual Information Retrieval

  • Language Identification (from text, speech, handwriting)

  • Trans-lingual retrieval (query in 1 language, results in multiple languages)

    • Requires more than query-word out-of-context translation (see Carbonell et al 1997 IJCAI paper) to do it well

  • Full translation (e.g. of web page, of search results snippets, …)

    • General reading quality (as targeted now)

    • Focused on getting entities right (who, what, where, when mentioned)

  • Partial on-demand translation

    • Reading assistant: translation in context while reading an original document, by highlighting unfamiliar words, phrases, passages.

    • On-demand Text to Speech

  • Transliteration


In the right language l.jpg

Knowledge-Engineered MT Retrieval

Transfer rule MT (commercial systems)

High-Accuracy Interlingual MT (domain focused)

Parallel Corpus-Trainable MT

Statistical MT (noisy channel, exponential models)

Example-Based MT (generalized G-EBMT)

Transfer-rule learning MT (corpus & informants)

Multi-Engine MT

Omnivorous approach: combines the above to maximize coverage & minimize errors

“…in the Right Language”


Types of machine translation l.jpg
Types of Machine Translation Retrieval

Interlingua

Semantic Analysis

Sentence Planning

Transfer Rules

Syntactic Parsing

Text Generation

Source

(Arabic)

Target

(English)

Direct: EBMT


Ebmt example l.jpg
EBMT example Retrieval

English:I would like to meet her.

Mapudungun: Ayükefun trawüael fey engu.

English: The tallest man is my father.

Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.

English:I would like to meet the tallest man

Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru

Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.


Multi engine machine translation l.jpg
Multi-Engine Machine Translation Retrieval

  • MT Systems have different strengths

    • Rapidly adaptable: Statistical, example-based

    • Good grammar: Rule-Based (linguisitic) MT

    • High precision in narrow domains: KBMT

    • Minority Language MT: Learnable from informant

  • Combine results of parallel-invoked MT

    • Select best of multiple translations

    • Selection based on optimizing combination of:

      • Target language joint-exponential model

      • Confidence scores of individual MT engines



State of the art in memt for new hot languages l.jpg
State of the Art in MEMT Retrievalfor New “Hot” Languages

We can do now:

Gisting MT for any new language in 2-3 weeks (given parallel text)

Medium quality MT in 6 months (given more parallel text, informant, bi-lingual dictionary)

Improve-as-you-go MT

Field MT system in PCs

We cannot do yet:

High-accuracy MT for open domains

Cope with spoken-only languages

Reliable speech-speech MT (but BABYLON is coming)

MT on your wristwatch


Right level of detail summarization l.jpg
“…right level of detail” RetrievalSummarization


Right level of detail l.jpg
Right Level of Detail Retrieval

  • Automate summarization with hyperlink one-click drilldown on user selected section(s).

  • Purpose Driven: summaries are in service of an information need, not one-size fits all (as in Shaom’s outline and the DUC NIST evaluations)

    • EXAMPLE: A summary of a 650-page clinical study can focus on

      • effectiveness of the new drug for target disease

      • methodology of the study (control group, statistical rigor,…)

      • deleterious side effects if any

      • target population of study (e.g. acne-suffering teens, not eczema suffering adults ….depending on the user’s task or information query


Information structuring and summarization l.jpg
Information Structuring and Summarization Retrieval

  • Hierarchical multi-level pre-computed summary structure, or on-the-fly drilldown expansion of info.

    • Headline <20 words

    • Abstract 1% or 1 page

    • Summary 5-10% or 10 pages

    • Document 100%

  • Scope of Summary

    • Single big document (e.g. big clinical study)

    • Tight cluster of search results (e.g. vivisimo)

    • Related set of clusters (e.g. conflicting opinions on how to cope with Iran’s nuclear capabilities)

    • Focused area of knowledge (e.g. What’s known about Pluto? Lycos has good project in this via Hotbot)

    • Specific kinds of commonly asked information(e.g. synthesize a bio on person X from any web-accessible info)


Document summarization l.jpg
Document Summarization Retrieval

Types of Summaries


Right medium finding information in non textual media l.jpg
“…right medium” RetrievalFinding information in Non-textual Media


Indexing and searching non textual analog content l.jpg
Indexing and Searching RetrievalNon-textual (Analog) Content

  • Speech  text (speech recognition)

  • Text  speech

    • TTS: FESTVOX by far most popular high-quality system

  • Handwriting  text (handwriting recognition)

  • Printed text  electronic text (OCR)

  • Picture  caption key words (automatically) for indexing and searching

  • Diagram, tables, graphs, maps  caption key words (automatically)


Conclusion l.jpg
Conclusion Retrieval


What is text mining l.jpg
What is Text Mining Retrieval

  • Search documents, web, news

  • Categorize by topic, taxonomy

    • Enables filtering, routing, multi-text summaries, …

  • Extract names, relations, …

  • Summarize text, rules, trends, …

  • Detect redundancy, novelty, anomalies, …

  • Predict outcomes, behaviors, trends, …

  • Who did what to whom and where?


    Data mining vs text mining l.jpg
    Data Mining vs. Text Mining Retrieval

    Text: HTML, free form

    TM universe: 103X DM

    TM tasks:

    All the DM tasks,

    plus:

    Extraction of roles, relations and facts

    Machine translation for multi-lingual sources

    Parse NL-query (vs. SQL)

    NL-generation of results

    • Data: relational tables

    • DM universe: huge

    • DM tasks:

      • DB “cleanup”

      • Taxonomic classification

      • Supervised learning with predictive classifiers

      • Unsupervised learning clustering, anomaly detection

      • Visualization of results


    ad