Opportunities in natural language processing
1 / 90

Opportunities in Natural Language Processing - PowerPoint PPT Presentation

  • Uploaded on

Opportunities in Natural Language Processing. Outline. Overview of the field Why are language technologies needed? What technologies are there? What are interesting problems where NLP can and can’t deliver progress? NL/DB interface Web search Product Info, e-mail

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Opportunities in Natural Language Processing' - leon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Opportunities in natural language processing

Opportunities inNatural Language Processing


  • Overview of the field

    • Why are language technologies needed?

    • What technologies are there?

  • What are interesting problems where NLP can and can’t deliver progress?

    • NL/DB interface

    • Web search

    • Product Info, e-mail

    • Text categorization, clustering, IE

    • Finance, small devices, chat rooms

    • Question answering

What is natural language processing
What is Natural Language Processing?

  • Natural Language Processing

    • Process information contained in natural language text.

    • Also known as Computational Linguistics (CL), Human Language Technology (HLT), Natural Language Engineering (NLE)

  • Can machines understand human language?

    • Define ‘understand’

    • Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.

What is it
What is it..

  • Analyze, understand and generate human languages just like humans do.

  • Applying computational techniques to language domain..

  • To explain linguistic theories, to use the theories to build systems that can be of social use..

  • Started off as a branch of Artificial Intelligence..

  • Borrows from Linguistics, Psycholinguistics, Cognitive Science & Statistics.

  • Make computers learn our language rather than we learn theirs.

Why study nlp
Why Study NLP?

  • A hallmark of human intelligence.

  • Text is the largest repository of human knowledge and is growing quickly.

    • emails, news articles, web pages, IM, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, ……

    • Are we reading any faster than before?

Why are language technologies needed
Why are language technologies needed?

  • Many companies would make a lot of money if they could use computer programmes that understood text or speech. Just imagine if a computer could be used for:

    • answering the phone, and replying to a question

    • understanding the text on a Web page to decide who it might be of interest to

    • translating a daily newspaper from Japanese to English (an attempt is made to do this already)

    • understanding text in journals / books and building an expert systems based on that understanding


  • Also called Natural Language Processing (Application

  • part)

  • Show me Star Trek..?? (Talk to your TV set)

  • Will my computer talk to me like another human ??

  • Will the search engine get me exactly what I am looking for??

  • Can my PC read the whole newspaper and tell me the important news only..??

  • Can my palmtop translate what that Japanese lady is telling me.. ??

  • Ahhh.. Can my PC do my English homework ??

  • Do you know how our brain processes language ??

Nlp applications
NLP Applications

  • Question answering

    • Who is the first Taiwanese president?

  • Text Categorization/Routing

    • e.g., customer e-mails.

  • Text Mining

    • Find everything that interacts with BRCA1.

  • Machine (Assisted) Translation

  • Language Teaching/Learning

    • Usage checking

  • Spelling correction

    • Is that just dictionary lookup?

Application areas
Application areas

  • Text-to-Speech & Speech recognition

  • Natural Language Dialogue Interfaces to Databases

  • Information Retrieval

  • Information Extraction

  • Document Classification

  • Document Image Analysis

  • Automatic Summarization

  • Text Proofreading – Spelling & Grammar

  • Machine Translation

  • Story understanding systems

  • Plagiarism detection

  • Can u think of anything else ??

Big deal
Big Deal

  • L = Words + rules + exceptions..

  • Ambiguity at all levels..

  • We speak different languages..

  • And language is a cultural entity..

  • So they are not equivalent..

  • Highly systematic but also complex..

  • Keeps changing.. New words, New rules and New exceptions..

  • Source : Electronic texts / Printed texts / Acoustic Speech Signal.. they are noisy..

  • Language looks obvious to us.. But it is a Big Deal ☺!

Where does it fit in the cs taxonomy
Where does it fit in the CS taxonomy?



Artificial Intelligence





Natural Language Processing









Early days
Early days..

  • How to measure Intelligence of a Machine?

  • Turing test – Alan Turing (1950)

    • A machine can be accepted to be intelligent if it can fool a judge that its human over a tele-typing exercise.

  • ELIZA by Weizenbaum (1966)

    • Pretends to be a psychiatrist and converses with a user on his problems.

    • Uses Keyword pattern matching

    • Many users thought the machine really understood their problem.

    • Many such systems exist now. E.g. Alan, Alice, David Can such tests be taken as a measure for Intelligence ? Debate goes on..

Early days1
Early days..


    • Can understand Natural Language command.

    • Developed by Terry Winograd MIT AI Lab (1968 –70) using Lisp.

    • Works on a “Blocks World” a simulated environment in which blocks like coloured cubes, cylinders, pyramids can be moved around, placed over each other, etc.

    • Understands a bit of anaphora.

    • Memory to store history.

    • Successful demonstration of AI.

What s the world s most used database
What’s the world’s most used database?

  • Oracle?

  • Excel?

  • Perhaps, Microsoft Word?

    • Data only counts as data when it’s in columns?

    • But there’s oodles of other data: reports, spec. sheets, customer feedback, plans, …

    • “The Unix philosophy”

Databases in 1992
“Databases” in 1992

  • Database systems (mostly relational) are the pervasive form of information technology providing efficient access to structured, tabular data primarily for governments and corporations: Oracle, Sybase, Informix, etc.

  • (Text) Information Retrieval systems is a small market dominated by a few large systems providing information to specialized markets (legal, news, medical, corporate info): Westlaw, Medline, Lexis/Nexis

  • Commercial NLP market basically nonexistent

    • mainly DARPA work

Databases in 2002
“Databases” in 2002

  • A lot of new things seem important:

    • Internet, Web search, Portals, Peer­to­Peer, Agents, Collaborative Filtering, XML/Metadata, Data mining

  • Is everything the same, different, or just a mess?

  • There is more of everything, it’s more distributed, and it’s less structured.

  • Large textbases and information retrieval are a crucial component of modern information systems, and have a big impact on everyday people (web search, portals, email)

Linguistic data is ubiquitous
Linguistic data is ubiquitous

  • Most of the information in most companies, organizations, etc. is material in human languages (reports, customer email, web pages, discussion papers, text, sound, video) – not stuff in traditional databases

    • Estimates: 70%, 90% ?? [all depends how you measure]. Most of it.

  • Most of that information is now available in digital form:

    • Estimate for companies in 1998: about 60% [CAP Ventures/Fuji Xerox]. More like 90% now?

The problem
The problem

  • When people see text, they understand its meaning (by and large)

  • When computers see text, they get only character strings (and perhaps HTML tags)

  • We'd like computer agents to see meanings and be able to intelligently process text

  • These desires have led to many proposals for structured, semantically marked up formats

  • But often human beings still resolutely make use of text in human languages

  • This problem isn’t likely to just go away.

Levels of language analysis
Levels of Language Analysis

  • Phonology

  • Morphology

  • Syntax

  • Semantics

  • Pragmatics

  • Discourse


  • Speech processing

    • Humans process speech remarkably well.

    • Speech interface can replace keyboards and monitors.

    • Convert Acoustic signals to Text.

    • Phonemes are the smallest recognizable speech unit in a language.

    • Graphemes are the textual representation.

    • Phonemes can be identified using their phonetic & spectral features.

Speech so is it difficult
Speech – So is it difficult ?

  • “It's very hard to wreck a nice beach ”

  • Pronunciation of different speakers

  • Pace of speech

  • Speech ambiguity – Homonyms

    • I ate eight cakes

    • That band is banned

    • I went to the mall near by to buy some food

    • The Finnish were the first ones to finish

    • I know no James Bond.

Morphology what is a word
Morphology: What is a word?

  • Morphology is all about the words.

  • Make more words from less ☺.

  • Structures and patterns in words

  • Analyzes how words are formed from minimal units of meaning, or morphemes, e.g., dogs= dog+s.

  • Words are a sequence of Morphemes.

    • Morpheme – smallest meaningful unit in a word. Free & Bound.

  • Inflectional Morphology – Same Part of Speech

    • Buses = Bus + es

    • Carried = Carry + ed

  • Derivational Morphology – Change PoS.

    • Destruct + ion = Destruction (Noun)

    • Beauty + ful = Beautiful (Adjective)

  • Affixes – Prefixes, Suffixes & Infixes

  • Rules govern the fusion.

Morphology is not as easy as it may seem to be
Morphology Is not as Easy as It May Seem to be

  • Examples from Woods et. al. 2000

    • Delegate(delegasyon,heyet)

      (de + leg + ate) take the legs from

    • Caress(okşamak)

      (car + ess(dişilik eki)) female car

    • cashier (cashy + er) more wealthy

    • lacerate (lace + rate) speed of tatting

    • ratify (yırtmak, yaralamak; (kalbini) kırmak)

      (rat + ify) infest with rodents(kemigenlerin istilası)

    • Infantry(piyade)

      (infant(bebek, küçük çocuk ) + ry) childish behavior

A turkish example oflazer guzey 1994
A Turkish Example [Oflazer & Guzey 1994]

  • uygarlastiramayabileceklerimizdenmissinizcesine

  • urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF

  • an adverb meaning roughly “(behaving) as if you were one of those whom we might not be able to civilize.”

Why not just use a dictionary
Why not just Use a Dictionary?

  • How many words are there in a language?

    • English: OED 400K entries

    • Turkish: 600x106 forms

    • Finnish: 107 forms

  • New words are being invented all the time

    • e-mail

    • IM


  • Words convey meaning. But when they are put together they convey more.

  • Syntax is the grammatical structure of the sentence. Just like the syntax in programming languages.

    • structures and patterns in phrases

    • how phrases are formed by smaller phrases and words

  • Identifying the structure is the first step towards understanding the meaning of the sentence.

  • Syntactic Analysis (Parsing) = Process of assigning a parse tree to a sentence.

  • Constituents, Grammatical relations, subcategorization and dependencies.

Is that all
Is that all?

  • Grammar of a language is very complex.

  • No one can write down the set of all rules that governs the sentence construction.

  • Naturally the solution is Machine Learning.

  • Where do they learn from? – Tree banks.

    E.g. Penn Treebank – Manually annotated trees for sentences (over 2 mil words) from a large Wall Street Journal corpus.


  • What do you mean..?

  • Words – Lexical Semantics

  • Sentences – Compositional Semantics

  • Converting the syntactic structures to semantic format – meaning representation.

  • Semantics: the meaning of a word or phrase within a sentence

    • How to represent meaning?

      • Semantic network? Logic? Policy?

    • How to construct meaning representation?

      • Is meaning compositional?


  • Pragmatics: structures and patterns in discourses

  • Sentence standing alone may not mean so much. It may be ambiguous.

  • What information is contained in the contextual sentences that is not conveyed in the actual sentence?

  • Discourse / Context makes utterances more complicated.

    • Implicatures:

      • How many times do you go skating each week?

    • Speech acts:

      • Do you know the time?

    • Anaphora – Resolving the pronoun’s reference. Co-reference resolution

      • “I read the book by Dr. Kalam. It was great”

      • “We gave the monkeys the bananas because they were hungry”

      • “We gave the monkeys the bananas because they were over-ripe”

      • Jane races Mary on weekends. She often beats her.

    • Ellipsis – Incomplete sentences

      • “What’s your name?”

      • “Srini, and yours?”

      • The second sentence is not complete, but what it means can be inferred from the first one.

Why is natural language understanding difficult
Why is Natural Language Understanding difficult?

  • The hidden structure of language is highly ambiguous

  • Structures for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00)

Challenges in nlp ambiguity
Challenges in NLP: Ambiguity

  • Words or phrases can often be understood in multiple ways.

    • Teacher Strikes Idle Kids

    • Killer Sentenced to Die for Second Time in 10 Years

    • They denied the petition for his release that was signed by over 10,000 people.

    • child abuse expert/child computer expert

    • Who does Mary love? (three-way ambiguous)

Probabilistic statistical resolution of ambiguities
Probabilistic/Statistical Resolution of Ambiguities

  • When there are ambiguities, choose the interpretation with the highest probability.

  • Example: how many times peoples say

    • “Mary loves …”

    • “the Mary love”

  • Which interpretation has the highest probability?

Challenges in nlp variations
Challenges in NLP: Variations

  • Syntactic Variations

    • I was surprised that Kim lost

    • It surprised me that Kim lost

    • That Kim lost surprised me.

  • The same meaning can be expressed in different ways

    • Who wrote “The Language Instinct”?

    • Steven Pinker, a MIT professor and author of “The Language Instinct”, ……
















The student put the book on the table


  • Analyze the structure of a sentence

Opportunities in natural language processing

















Teacher strikes idle kids

Teacher strikes idle kids

Enabling technologies
Enabling Technologies

  • Stemming

    • Reduce detects, detected, detecting, detect, to the same form.

  • POS Tagging

    • Determine for each word whether it is a noun, adjective, verb, …..

  • Parsing

    • sentence  parse tree

  • Word Sense Disambiguation

    • orange juice vs. orange coat

  • Learning from text

Translating user needs
Translating user needs

User need

User query


For RDB, a lot

of people know

how to do this

correctly, using

SQL or a GUI tool

The answers

coming out here

will then be

precisely what the

user wanted

Translating user needs1
Translating user needs

User need

User query


For meanings in text,

no IR-style query

gives one exactly

what one wants;

it only hints at it

The answers

coming out may

be roughly what

was wanted, or

can be refined


Translating user needs2
Translating user needs

User need

NLP query


For a deeper NLP

analysis system,

the system subtly

translates the

user’s language

If the answers coming

back aren’t what was

wanted, the user

frequently has no idea

how to fix the problem


Aim practical applied nlp goals
Aim: Practical applied NLP goals

Use language technology to add value to data by:

  • interpretation

  • transformation

  • value filtering

  • augmentation (providing metadata)

    Two motivations:

  • The amount of information in textual form

  • Information integration needs NLP methods for coping with ambiguity and context

Knowledge extraction vision

Multi-dimensional Meta-data Extraction

Knowledge Extraction Vision

Terms and technologies
Terms and technologies

  • Text processing

    • Stuff like TextPad (Emacs, BBEdit), Perl, grep. Semantics and structure blind, but does what you tell it in a nice enough way. Still useful.

  • Information Retrieval (IR)

    • Implies that the computer will try to find documents which are relevant to a user while understanding nothing (big collections)

  • Intelligent Information Access (IIA)

    • Use of clever techniques to help users satisfy an information need (search or UI innovations)

Terms and technologies1
Terms and technologies

  • Locating small stuff. Useful nuggets of information that a user wants:

    • Information Extraction (IE): Database filling

      • The relevant bits of text will be found, and the computer will understand enough to satisfy the user’s communicative goals

    • Wrapper Generation (WG) [or Wrapper Induction]

      • Producing filters so agents can “reverse engineer” web pages intended for humans back to the underlying structured data

    • Question Answering (QA) – NL querying

    • Thesaurus/key phrase/terminology generation

Terms and technologies2
Terms and technologies

  • Big Stuff. Overviews of data:

    • Summarization

      • Of one document or a collection of related documents (cross-document summarization)

    • Categorization (documents)

      • Including text filtering and routing

    • Clustering (collections)

  • Text segmentation: subparts of big texts

  • Topic detection and tracking

    • Combines IE, categorization, segmentation

Terms and technologies3
Terms and technologies

  • Digital libraries

  • Text (Data) Mining (TDM)

    • Extracting nuggets from text. Opportunistic.

    • Unexpected connections that one can discover between bits of human recorded knowledge.

  • Natural Language Understanding (NLU)

    • Implies an attempt to completely understand the text …

  • Machine translation (MT), OCR, Speech recognition, etc.

    • Now available wherever software is sold!

Problems and approaches

find all web pages containing

the word Liebermann

read the last 3 months of

the NY Times and provide

a summary of the campaign

so far

Problems and approaches

  • Some places where I see less value

  • Some places where I see more value

Natural language interfaces to databases
Natural Language Interfaces to Databases

  • This was going to be the big application of NLP in the 1980s

    • > How many service calls did we receive from Europe last month?

    • I am listing the total service calls from Europe for November 2001.

    • The total for November 2001 was 1756.

  • It has been recently integrated into MS SQL Server (English Query)

  • Problems: need largely hand-built custom semantic support (improved wizards in new version!)

    • GUIs more tangible and effective?

Nlp for ir web search
NLP for IR/web search?

  • It’s a no-brainer that NLP should be useful and used for web search (and IR in general):

    • Search for ‘Jaguar’

      • the computer should know or ask whether you’re interested in big cats [scarce on the web], cars, or, perhaps a molecule geometry and solvation energy package, or a package for fast network I/O in Java

    • Search for ‘Michael Jordan’

      • The basketballer or the machine learning guy?

    • Search for laptop, don’t find notebook

    • Google doesn’t even stem:

      • Search for probabilistic model, and you don’t even match pages with probabilistic models.

Nlp for ir web search1
NLP for IR/web search?

  • Word sense disambiguation technology generally works well (like text categorization)

  • Synonyms can be found or listed

  • Lots of people have been into fixing this

    • e-Cyc had a beta version with Hotbot that disambiguated senses, and was going to go live in 2 months … 14 months ago

    • Lots of startups:

      • LingoMotors

      • iPhrase “Traditional keyword search technology is hopelessly outdated”

Nlp for ir web search2
NLP for IR/web search?

  • But in practice it’s an idea that hasn’t gotten much traction

    • Correctly finding linguistic base forms is straightforward, but produces little advantage over crude stemming which just slightly over equivalence classes words

    • Word sense disambiguation only helps on average in IR if over 90% accurate (Sanderson 1994), and that’s about where we are

    • Syntactic phrases should help, but people have been able to get most of the mileage with “statistical phrases” – which have been aggressively integrated into systems recently

Nlp for ir web search3
NLP for IR/web search?

  • People can easily scan among results (on their 21” monitor) … if you’re above the fold

  • Much more progress has been made in link analysis, and use of anchor text, etc.

  • Anchor text gives human-provided synonyms

  • Link or click stream analysis gives a form of pragmatics: what do people find correct or important (in a default context)

  • Focus on short, popular queries, news, etc.

  • Using human intelligence always beats artificial intelligence

Nlp for ir web search4
NLP for IR/web search?

  • Methods which use of rich ontologies, etc., can work very well for intranet search within a customer’s site (where anchor-text, link, and click patterns are much less relevant)

    • But don’t really scale to the whole web

  • Moral: it’s hard to beat keyword search for the task of general ad hoc document retrieval

  • Conclusion: one should move up the food chain to tasks where finer grained understanding of meaning is needed

Product info
Product info

  • C-net markets this information

  • How do they get most of it?

    • Phone calls

    • Typing.

Inconsistency digital cameras
Inconsistency: digital cameras

  • Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor

  • Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million

  • Image sensor Total Pixels: Approx. 2.11 million-pixel

  • Imaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V)

  • CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] )

    • Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] )

    • Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )

  • These all came off the same manufacturer’s website!!

  • And this is a very technical domain. Try sofa beds.

Product information comparison shopping etc
Product information/ Comparison shopping, etc.

  • Need to learn to extract info from online vendors

  • Can exploit uniformity of layout, and (partial) knowledge of domain by querying with known products

  • E.g., Jango Shopbot (Etzioni and Weld)

    • Gives convenient aggregation of online content

  • Bug: not popular with vendors

    • A partial solution is for these tools to be personal agents rather than web services

Email handling
Email handling

  • Big point of pain for many people

  • There just aren’t enough hours in the day

    • even if you’re not a customer service rep

  • What kind of tools are there to provide an electronic secretary?

    • Negotiating routine correspondence

    • Scheduling meetings

    • Filtering junk

    • Summarizing content

  • “The web’s okay to use; it’s my email that is out of control”

Text categorization is a task with many potential uses
Text Categorization is a task with many potential uses

  • Take a document and assign it a label representing its content (MeSH heading, ACM keyword, Yahoo category)

  • Classic example: decide if a newspaper article is about politics, business, or sports?

  • There are many other uses for the same technology:

    • Is this page a laser printer product page?

    • Does this company accept overseas orders?

    • What kind of job does this job posting describe?

    • What kind of position does this list of responsibilities describe?

    • What position does this list of skills best fit?

    • Is this the “computer” or “harbor” sense of port?

Text categorization
Text Categorization

  • Usually, simple machine learning algorithms are used.

  • Examples: Naïve Bayes models, decision trees.

  • Very robust, very re-usable, very fast.

  • Recently, slightly better performance from better algorithms

    • e.g., use of support vector machines, nearest neighbor methods, boosting

  • Accuracy is more dependent on:

    • Naturalness of classes.

    • Quality of features extracted and amount of training data available.

  • Accuracy typically ranges from 65% to 97% depending on the situation

    • Note particularly performance on rare classes

Email response ecrm
Email response: “eCRM”

  • Automated systems which attempt to categorize incoming email, and to automatically respond to users with standard, or frequently seen questions

  • Most but not all are more sophisticated than just keyword matching

  • Generally use text classification techniques

    • E.g., Echomail, Kana Classify, Banter

    • More linguistic analysis: YY software

  • Can save real money by doing 50% of the task close to 100% right

Recall vs precision







Recall vs. Precision

  • High recall:

    • You get all the right answers, but garbage too.

    • Good when incorrect results are not problematic.

    • More common from automatic systems.

  • High precision:

    • When all returned answers must be correct.

    • Good when missing results are not problematic.

    • More common from hand-built systems.

  • In general in these things, one can trade one for the other

    • But it’s harder to score well on both

Financial markets
Financial markets

  • Quantitative data are (relatively) easily and rapidly processed by computer systems, and consequently many numerical tools are available to stock market analysts

    • However, a lot of these are in the form of (widely derided) technical analysis

    • It’s meant to be information that moves markets

  • Financial market players are overloaded with qualitative information – mainly news articles – with few tools to help them (beyond people)

    • Need tools to identify, summarize, and partition information, and to generate meaningful links

Text clustering in browsing search and organization
Text Clustering in Browsing, Search and Organization

  • Scatter/Gather Clustering

    • Cutting, Pedersen, Karger, Tukey ’92, ’93

  • Cluster sets of documents into general “themes”, like a table of contents

  • Display the contents of the clusters by showing topical terms and typical titles

  • User chooses subsets of the clusters and re-clusters the documents within them

  • Resulting new groups have different “themes”


  • June 11, 2001: The latest KDnuggets Poll asked: What types of analysis did you do in the past 12 months.

    • The results, multiple choices allowed, indicate that a wide variety of tasks is performed by data miners. Clustering was by far the most frequent (22%), followed by Direct Marketing (14%), and Cross-Sell Models (12%)

  • Clustering of results can work well in certain domains (e.g., biomedical literature)

  • But it doesn’t seem compelling for the average user, it appears (Altavista, Northern Light)

Citeseer researchindex

  • An online repository of papers, with citations, etc. Specialized search with semantics in it

  • Great product; research people love it

  • However it’s fairly low tech. NLP could improve on it:

    • Better parsing of bibliographic entries

    • Better linking from author names to web pages

    • Better resolution of cases of name identity

      • E.g., by also using the paper content

      • Cf. Cora, which did some of these tasks better

Chat rooms groups discussion forums usenet
Chat rooms/groups/discussion forums/usenet

  • Many of these are public on the web

  • The signal to noise ratio is very low

  • But there’s still lots of good information there

  • Some of it has commercial value

    • What problems have users had with your product?

    • Why did people end up buying product X rather than your product Y

  • Some of it is time sensitive

    • Rumors on chat rooms can affect stockprice

      • Regardless of whether they are factual or not

Small devices
Small devices

  • With a big monitor, humans can scan for the right information

  • On a small screen, there’s hugely more value from a system that can show you what you want:

    • phone number

    • business hours

    • email summary

      • “Call me at 11 to finalize this”

Machine translation
Machine translation

  • High quality MT is still a distant goal

  • But MT is effective for scanning content

  • And for machine-assisted human translation

  • Dictionary use accounts for about half of a traditional translator's time.

  • Printed lexical resources are not up-to-date

  • Electronic lexical resources ease access to terminological data.

  • “Translation memory” systems: remember previously translated documents, allowing automatic recycling of translations

Online technical publishing
Online technical publishing

  • Natural Language Processing for Online Applications: Text Retrieval, Extraction & CategorizationPeter Jackson & Isabelle Moulinier (Benjamins, 2002)

  • “The Web really changed everything, because there was suddenly a pressing need to process large amounts of text, and there was also a ready-made vehicle for delivering it to the world. Technologies such as information retrieval (IR), information extraction, and text categorization no longer seemed quite so arcane to upper management. The applications were, in some cases, obvious to anyone with half a brain; all one needed to do was demonstrate that they could be built and made to work, which we proceeded to do.”

Task information extraction
Task: Information Extraction


  • A lot of information that could be represented in a structured semantically clear format isn’t

  • It may be costly, not desired, or not in one’s control (screen scraping) to change this.

  • Goal: being able to answer semantic queries using “unstructured” natural language sources

Information extraction
Information Extraction

  • Information extraction systems

    • Find and understand relevant parts of texts.

    • Produce a structured representation of the relevant information: relations (in the DB sense)

    • Combine knowledge about language and the application domain

    • Automatically extract the desired information

  • When is IE appropriate?

    • Clear, factual information (who did what to whom and when?)

    • Only a small portion of the text is relevant.

    • Some errors can be tolerated

Task wrapper induction
Task: Wrapper Induction

  • Wrapper Induction

    • Sometimes, the relations are structural.

      • Web pages generated by a database.

      • Tables, lists, etc.

    • Wrapper induction is usually regular relations which can be expressed by the structure of the document:

      • the item in bold in the 3rd column of the table is the price

  • Handcoding a wrapper in Perl isn’t very viable

    • sites are numerous, and their surface structure mutates rapidly

  • Wrapper induction techniques can also learn:

    • If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the project X.

      • [e.g, Tom Mitchell’s Web->KB project]

Examples of existing ie systems
Examples of Existing IE Systems

  • Systems to summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments.

  • Gathering earnings, profits, board members, etc. from company reports

  • Verification of construction industry specifications documents (are the quantities correct/reasonable?)

  • Real estate advertisements

  • Building job databases from textual job vacancy postings

  • Extraction of company take-over events

  • Extracting gene locations from biomed texts

Three generations of ie systems
Three generations of IE systems

  • Hand-Built Systems – Knowledge Engineering [1980s– ]

    • Rules written by hand

    • Require experts who understand both the systems and the domain

    • Iterative guess-test-tweak-repeat cycle

  • Automatic, Trainable Rule-Extraction Systems [1990s– ]

    • Rules discovered automatically using predefined templates, using methods like ILP

    • Require huge, labeled corpora (effort is just moved!)

  • Statistical Generative Models [1997 – ]

    • One decodes the statistical model to find which bits of the text were relevant, using HMMs or statistical parsers

    • Learning usually supervised; may be partially unsupervised

Name extraction via hmms




Name Extraction via HMMs

The delegation, which included the commander of the U.N.troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic.

The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.














  • Prior to 1997 - no learning approach competitive with hand-built rule systems

  • Since 1997 - Statistical approaches (BBN, NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performance

Classified advertisements real estate
Classified Advertisements (Real Estate)


<DATE>March 02, 1998</DATE>



OPEN 1.00 - 1.45<BR>

U 11 / 10 BERTRAM ST<BR>


3 brm freestanding<BR>

villa, close to shops & bus<BR>

Owner moved to Melbourne<BR>

ideally suit 1st home buyer,<BR>

investor & 55 and over.<BR>

Brian Hazelden 0418 958 996<BR>




  • Advertisements are plain text

  • Lowest common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle

Why doesn t text search ir work
Why doesn’t text search (IR) work?

What you search for in real estate advertisements:

  • Suburbs. You might think easy, but:

    • Real estate agents: Coldwell Banker, Mosman

    • Phrases: Only 45 minutes from Parramatta

    • Multiple property ads have different suburbs

  • Money: want a range not a textual match

    • Multiple amounts: was $155K, now $145K

    • Variations: offers in the high 700s [but not rents for $270]

  • Bedrooms: similar issues (br, bdr, beds, B/R)

Machine learning
Machine learning

  • To keep up with and exploit the web, you need to be able to learn

    • Discovery: How do you find new information sources S?

    • Extraction: How can you access and parse the information in S?

    • Semantics: How does one understand and link up the information in contained in S?

    • Pragmatics: What is the accuracy, reliability, and scope of information in S?

  • Hand-coding just doesn’t scale

Question answering from text
Question answering from text

  • TREC 8/9 QA competition: an idea originating from the IR community

  • With massive collections of on-line documents, manual translation of knowledge is impractical: we want answers from textbases [cf. bioinformatics]

  • Evaluated output is 5 answers of 50/250 byte snippets of text drawn from a 3 Gb text collection, and required to contain at least one concept of the semantic category of the expected answer type. (IR think. Suggests the use of named entity recognizers.)

  • Get reciprocal points for highest correct answer.

Pasca and harabagiu 200 show value of sophisticated nlp
Pasca and Harabagiu (200) show value of sophisticated NLP

  • Good IR is needed: paragraph retrieval based on SMART

  • Large taxonomy of question types and expected answer types is crucial

  • Statistical parser (modeled on Collins 1997) used to parse questions and relevant text for answers, and to build knowledge base

  • Controlled query expansion loops (morphological, lexical synonyms, and semantic relations) are all important

  • Answer ranking by simple ML method

Question answering example
Question Answering Example

  • How hot does the inside of an active volcano get?

  • get(TEMPERATURE, inside(volcano(active)))

  • “lava fragments belched out of the mountain were as hot as 300 degrees Fahrenheit”

  • fragments(lava, TEMPERATURE(degrees(300)),

    belched(out, mountain))

    • volcano ISA mountain

    • lava ISPARTOF volcano  lava inside volcano

    • fragments of lava HAVEPROPERTIESOF lava

  • The needed semantic information is in WordNet definitions, and was successfully translated into a form that can be used for rough ‘proofs’


  • Complete human-level natural language understanding is still a distant goal

  • But there are now practical and usable partial NLU systems applicable to many problems

  • An important design decision is in finding an appropriate match between (parts of) the application domain and the available methods

  • But, used with care, statistical NLP methods have opened up new possibilities for high performance text understanding systems.

The end
The End

Thank you!