How do computers understand Texts
Download
1 / 90

Tobias Blanke - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

How do computers understand Texts. Tobias Blanke. My contact details. Name Tobias Blanke Telephone                 020 7848 1975 Email                        tobias.blanke@kcl.ac.uk Address                 51 Oakfield Road (!); N4 4LD. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Tobias Blanke' - ozzie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

My contact details
My contact details

  • Name Tobias Blanke

  • Telephone                020 7848 1975

  • Email                        tobias.blanke@kcl.ac.uk

  • Address                 51 Oakfield Road (!); N4 4LD


Outline
Outline

  • How do computers understand texts so that you don’t have to read them?

  • The same steps:

    • We stay with searching for a long time.

  • How to use text analysis for Linked Data

    • You will build your own Twitter miner


Why a simple question
Why? – A Simple question …

  • Suppose you have a million documents and a question – what do you do?

  • Solution : User can read all the documents in the store, retain the relevant documents and discard all the others – Perfect Retrieval… NOT POSSIBLE !!!

  • Alternative : Use a High Speed Computer to read entire document collection and extract the relevant documents.


Data geeks are in demand
Data Geeks are in demand

New research by the McKinsey Global Institute (MGI) forecasts a 50 to 60 percent gap between the supply and demand of people with deep analytical talent.

http://jonathanstray.com/investigating-thousands-or-millions-of-documents-by-visualizing-clusters



The problem of traditional text analysis is retrieval
The problem of traditional text analysis is retrieval

  • Goal = find documents relevant to an information need from a large document set

Information need?

Query

Magicsystem

Document

collection

Retrieval

Answer list


Example
Example

Google

Web


Search problem
Search problem

  • First applications: in libraries (1950s)

    ISBN: 0-201-12227-8

    Author: Salton, Gerard

    Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer

    Editor: Addison-Wesley

    Date: 1989

    Content: <Text>

  • external attributes and internal attribute (content)

  • Search by external attributes = Search in databases

  • IR: search by content


Text mining
Text Mining

  • Text mining is used to describe the application of data mining techniques to automated discovery of useful or interesting knowledge from unstructured text.

    • Task: Discuss with your neighbour what a system needs to

      • Determine who is a terrorist

      • Determine the sentiments


The big picture

The big Picture

IR is easy ….

Let’s stay with search for a while


Search still is the biggest application
Search still is the biggest application

  • Security applications: Search for the villain

  • Biomedical applications: Semantic Search

  • Online media applications: Disambiguate Information

  • Sentiment analysis: Find ‘nice’ movies

  • The human consumption is still key


Why is the human so important
Why is the human so important

  • Because we talk about information and understanding remains a human domain

  • “There will be information on the Web that has a clearly defined meaning and can be analysed and traced by computer programs: there will be information, such as poetry and art, that requires the whole human intellect for an understanding that will always be subjective.” (Tim Berners-Lee, Spinning the Semantic Web)

  • “There is virtually no “semantics” in the semantic web. (…) Semantic content, in the Semantic Web, is generated by humans, ontologised by humans, and ultimately consumed by humans. Indeed, it is not unusual to hear complaints about how difficult it is to find and retain good ‘ontologists’.” (https://uhra.herts.ac.uk/dspace/bitstream/2299/3629/1/903250.pdf)


The central problem the human
The Central Problem: The Human

Information Seeker

Authors

Concepts

Concepts

Query Terms

Document Terms

Do these represent the same concepts?


The black box
The Black Box

Documents

Query

Results

Slide is from Jimmy Lin’s tutorial


Inside the ir black box
Inside The IR Black Box

Documents

Query

Representation

Representation

Query Representation

Document Representation

Index

Comparison

Function

Results

Slide is from Jimmy Lin’s tutorial


Possible approaches
Possible approaches

1. String matching (linear search in documents)

- Syntactical

- Difficult to improve

2. Indexing

- Semantics

- Flexible to further improvement


Indexing based ir similarity text analysis
Indexing-based IRSimilarity text analysis

Document Query/Document

indexingindexing

(Queryanalysis)

Representation Representation

(keywords) Query (keywords)

evaluation

“How is this document similar to

the query/another document?”

Slide is from Jimmy Lin’s tutorial


Main problems
Main problems

  • Document indexing

    • How to best represent their contents?

  • Matching

    • To what extent does an identified information source correspond to a query/document?

  • System evaluation

    • How good is a system?

    • Are the retrieved documents relevant? (precision)

    • Are all the relevant documents retrieved? (recall)



Document indexing
Document indexing

  • Goal = Find the important meanings and create an internal representation

  • Factors to consider:

    • Accuracy to represent meanings (semantics)

    • Exhaustiveness (cover all the contents)

Coverage

Accuracy

String Word Phrase Concept

Slide is from Jimmy Lin’s tutorial


Text representations issues
Text Representations Issues

  • In general, it is hard to capture these features from a text document

    • One, it is difficult to extract this automatically

    • Two, even if we did it, it won't scale!

  • One simplification is to represent documents as a bag of words

    • Each document is represented as a bag of the word it contains, and each component of the bag represents some measurement of the relative importance of a single word.


Some immediate problems
Some immediate problems

  • How do we compare these bags of word to find out whether they are ‘similar’?

  • Let’s say we have three bags:

  • “House, Garden, House door”

  • “Household, Garden, Flat”

  • “House, House, House, Gardening”

  • How do we normalise these bags?

    • Why is normalisation needed?

    • What would we want to normalise?


Keyword selection and weighting
Keyword selection and weighting

  • How to select important keywords?


Luhn s ideas
Luhn’s Ideas

  • Frequency of word occurrence in a document is a useful measurement of word significance


Zipf and Luhn


Top 50 Terms

WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences)

TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences)


Scholarship and the long tail
Scholarship and the Long Tail

  • Scholarship follows a long-tailed distribution: the interest in relatively unknown items decline much more slowly than they would be if popularity were described by a normal distribution

  • We have few statistical tools for dealing with long-tailed distributions

  • Other problems include ‘contested terms’

    Graham White, "On Scholarship" (in Bartscherer ed., Switching Codes)


Stopwords stoplist
Stopwords / Stoplist

  • Some words do not bear useful information. Common examples:

    of, in, about, with, I, although, …

  • Stoplist: contain stopwords, not to be used as index

    • Prepositions

    • Articles

    • Pronouns

  • http://www.textfixer.com/resources/common-english-words.txt


Stemming
Stemming

  • Reason:

    • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them

  • Stemming:

    • Removing some endings of word

      computer

      compute

      computes

      computing

      computed

      computation

Is it always good to stem?

Give examples!

comput

Slide is from Jimmy Lin’s tutorial


Porter algorithm porter m f 1980 an algorithm for suffix stripping program 14 3 130 137
Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

http://qaa.ath.cx/porter_js_demo.html

  • Step 1: plurals and past participles

    • SSES -> SS caresses -> caress

    • (*v*) ING -> motoring -> motor

  • Step 2: adj->n, n->v, n->adj, …

    • (m>0) OUSNESS -> OUS callousness -> callous

    • (m>0) ATIONAL -> ATE relational -> relate

  • Step 3:

    • (m>0) ICATE -> IC triplicate -> triplic

  • Step 4:

    • (m>1) AL -> revival -> reviv

    • (m>1) ANCE -> allowance -> allow

  • Step 5:

    • (m>1) E -> probate -> probat

    • (m > 1 and *d and *L) -> single letter controll -> control

Slide is from Jimmy Lin’s tutorial


Lemmatization
Lemmatization

  • transform to standard form according to syntactic category. Produce vs Produc-

    E.g. verb + ing  verb

    noun + s  noun

    • Need POS tagging

    • More accurate than stemming, but needs more resources

Slide partly taken from Jimmy Lin’s tutorial


Index documents bag of words approach
Index Documents ( Bag of Words Approach)

INDEX

DOCUMENT

Document

Analysis

Text

Is

This

This is a document in text analysis


Result of indexing
Result of indexing

  • Each document is represented by a set of weighted keywords (terms):

    D1 {(t1, w1), (t2,w2), …}

    e.g. D1  {(comput, 0.2), (architect, 0.3), …}

    D2  {(comput, 0.1), (network, 0.5), …}

  • Inverted file:

    comput  {(D1,0.2), (D2,0.1), …}

    Inverted file is used during retrieval for higher efficiency.

Slide partly taken from Jimmy Lin’s tutorial


Inverted index example
Inverted Index Example

Doc 1

Dictionary

Postings

This is a sample

document

with one sample

sentence

Doc 2

This is another

sample document

Slide is from ChengXiang Zhai



Similarity models
Similarity Models

  • Boolean model

  • Vector-space model

  • Many more


Boolean model
Boolean model

  • Document = Logical conjunction of keywords

  • Query = Boolean expression of keywords

    e.g. D = t1  t2  …  tn

    Q = (t1 t2)  (t3 t4)

    Problems:

  • many documents or few documents

  • End-users cannot manipulate Boolean operators correctly

    E.g. documents about poverty andcrime


Vector space model
Vector space model

  • Vector space = all the keywords encountered

    <t1, t2, t3, …, tn>

  • Document

    D = < a1, a2, a3, …, an>

    ai = weight of ti in D

  • Query

    Q = < b1, b2, b3, …, bn>

    bi = weight of ti in Q

  • R(D,Q) = Sim(D,Q)


Cosine similarity

dj

θ

dk

Cosine Similarity

Similarity calculated using COSINE similarity between two vectors


Tf/Idf

  • tf = term frequency

    • frequency of a term/keyword in a document

      The higher the tf, the higher the importance (weight) for the doc.

  • df = document frequency

    • no. of documents containing the term

    • distribution of the term

  • idf = inverse document frequency

    • the unevenness of term distribution in the corpus

    • the specificity of term to a document

      The more the term is distributed evenly, the less it is specific to a document

      weight(t,D) = tf(t,D) * idf(t)


Exercise
Exercise

(1) Define term/document matrix

  • D1: The silver truck arrives

  • D2: The silver cannon fires silver bullets

  • D3: The truck is on fire

    (2) Compute TF/IDF from Reuters


Let s code our first text analysis engine

Let’s code our first text analysis engine

search.pl


Our corpus
Our corpus

  • A study on Kant’s critique of judgement

  • Aristotle's Metaphysics

  • Hegel’s Aesthetics

  • Plato’s Charmides

  • McGreedy’s War Diaries

  • Excerpts from the Royal Irish Society



Text analysis is an experimental science1
Text Analysis is an Experimental Science!

  • Formulate a hypothesis

  • Design an experiment to answer the question

  • Perform the experiment

  • Does the experiment answer the question?

  • Rinse, repeat…


Test collections
Test Collections

  • Three components of a test collection:

    • Test Collection of documents

    • Set of topics

    • Sets of relevant document based on expert judgments

  • Metrics for assessing ‘performance’

    • Precision

    • Recall


Precision vs recall
Precision vs. Recall

All docs

Retrieved

Relevant

Slide taken from Jimmy Lin’s tutorial


The trec experiments
The TREC experiments

  • Once per year

  • A set of documents and queries are distributed to the participants (the standard answers are unknown) (April)

  • Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July)

  • NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August)

  • TREC conference (November)


Towards

Linked Data

Beyond the Simple Stuff




Web mining
Web Mining

  • No stable document collection (spider, crawler)

  • Huge number of documents (partial collection)

  • Multimedia documents

  • Great variation of document quality

  • Multilingual problem


Exploiting inter document links

Authority

Hub

Exploiting Inter-Document Links

Description

(“anchor text”)

Links indicate the utility of a doc

What does a link tell us?

Slide is from ChengXiang Zhai



Information filtering
Information filtering

  • Instead of changing queries on stable document collection, we now want to filter an incoming document flow with stable interests (queries)

    • yes/no decision (in stead of ordering documents)

    • Advantage: the description of user’s interest may be improved using relevance feedback (the user is more willing to cooperate)

    • The basic techniques used for IF are the same as those for IR – “Two sides of the same coin”

keep

… doc3, doc2, doc1

IF

ignore

Slide taken from Jimmy Lin’s tutorial

User profile


Let s mine twitter analysis
Let’s mine Twitter analysis

  • Imagine you are a social scientist and interested in the Arab spring and the influence of social media or something else

  • You know that social media plays an important role. Even the Pope tweets with an IPad!


Twitter api
Twitter API

  • It’s easy to get information out of Twitter

  • Search API: http://twitter.com/#!/search/house

  • http://twitter.com/statuses/public_timeline.rss


Twitter exercise
Twitter exercise

  • What do we want to look for?

  • Form Groups

  • Create an account with YahooPipes:

    • http://pipes.yahoo.com/pipes/

    • (You can use your Google one)

  • Create a Pipe. What do you see?


  • I. Access Keywords source

  • Fetch CSV Module.

    • Enter the URL of the CSV file: http://dl.dropbox.com/u/868826/Filter-Demo.csv

    • Use ‘keywords’ as column names

  • II. Loop through each element in the CSV file and builds a search URL formatted for RSS output.

  • Under Operators: Fetch Loop module

    • Fetch URL’s URL Builder into the Loop’s big field

      • As base use: http://search.twitter.com/search.atom

      • As query parameters use q in the first box and then item.keywords in the second

    • Assign results to item.loop:urlbuilder

  • III. Connect the CSV and Loop modules


IV. Search Twitter

  • Under Operators: Fetch Loop module

    • Fetch Sources’s Fetch Feed into the Loop’s big field

      • As URL use item.loop:urlbuilder

    • Emit all results

      V. Connect the two Loop modules

      VI. Sort

      1. Under Operators: Fetch Sort module

    • Sort by item.y:published.utime in descending order

      VII. Connect Sort module to pipe output.The final module in every Yahoo Pipe.

      VIII. Save and Run Pipe

      More:

http://www.squidoo.com/yahoo-pipes-guide



Group together similar documents
Group together similar documents

  • Idea

    • Frequent terms carry more information about the “cluster” they might belong to

    • Highly co-related frequent terms probably belong to the same cluster

http://www.iboogie.com/


Clustering example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Clustering Example

.

How many terms do these documents have?


English novels
English Novels

  • Normalise

  • Calculate similarity according to dot product


Let s code again

Let’s code again

Compare.pl


Fresh forging restful services for e humanities

FReSH (Forging ReSTful Services for e-Humanities)

Creating Semantic Relationships


  • Digital edition of 6 newspapers / periodicals

  • Monthly Repository (1806 – 1837)

  • Northern Star (1837 – 1852)

  • The Leader (1850 – 1860)

  • English Women’s Journal (1858 – 1864)

  • Tomahawk (1867 – 1870)

  • Publisher’s Circular (1837-1959; NCSE: 1880-1890)


Semantic view
Semantic view

  • Chain of readings …


Ocr problems

Thin Compimy in fmmod to iiKu'-t tho dooiro ol'.those who seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

OCR Problems


  • N-grams seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Latent Semantic Indexing

http://www.seo-blog.com/latent-semantic-indexing-lsi-explained.php


Demo seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.


Producing structured information

Producing Structured Information seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

Information Extraction


Information extraction ie
Information Extraction (IE) seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • IE systems

    • Identify documents of a specific type

    • Extract information according to pre-defined templates

    • Current approaches to IE focus on restricted domains, for instance news wires

http://www.opencalais.com/about

http://viewer.opencalais.com/


History of ie terror fleets catastrophes and management
History of IE: Terror, fleets, catastrophes and management seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

The Message Understanding Conferences (MUC) were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. The character of this competition—many concurrent research teams competing against one another—required the development of standards for evaluation, e.g. the adoption of metrics like precision and recall.

http://en.wikipedia.org/wiki/Message_Understanding_Conference

The MUC-4 Terrorism Task

The task given to participants in the MUC-4 evaluation (1991) was to extract specific information on terrorist incidents from newspaper and newswire texts relating to South America.


Hunting for things
Hunting for Things seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Named entity recognition

    • Labelling names of things

    • An entity is a discrete thing like “King’s College London”

    • But also dates, places, etc.


The aims things and their relations
The aims: Things and their relations seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Find and understand the limited relevant parts of texts

  • Clear, factual information (who did what to whom when?)

  • Produce a structured representation of the relevant information: relations

    • Terrorists have heads

    • Storms cause damage


Independent linguistic tools
Independent linguistic tools seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

A Text Zoner, which turns a text into a set of segments.

A Preprocessor which turns a text or text segment into a sequence of sentences, each of which being a sequence of lexical items.

A Filter, which turns a sequence of sentences into a smaller set of sentences by filtering out irrelevant ones.

A Preparser, which takes a sequence of lexical items and tries to identify reliably determinable small-scale structures, e.g. names

A Parser, which takes a set of lexical items (words and phrases) and outputs a set of parse-tree fragments, which may or may not be complete.


Independent linguistic tools ii
Independent linguistic tools II seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

A Fragment Combiner, which attempts to combine parse-tree or logical-form fragments into a structure of the same type for the whole sentence.

A Semantic Interpreter, which generates semantic structures or logical forms from parse-tree fragments.

A Lexical Disambiguator, which indexes lexical items to one and only one lexical sense, or can be viewed as reducing the ambiguity of the predicates in the logical form fragments.

A Coreference Resolver which identifies different descriptions of the same entity in different parts of a text.

A Template Generator which fills the IE templates from the semantic structures. Off to Linked Data

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.6480&rep=rep1&type=pdf


Stanford nlp
Stanford NLP seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.


More examples from stanford
More examples from Stanford seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Use conventional classification algorithms to classify substrings of document as ‘to be extracted’ or not.


Let s code again1

Let seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod. ’s code again

Great ideas


  • Parliament at Stormont 1921-1972 seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Transcripts of all debates - Hansards


  • Georeferencing seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod. : basic principles

    • Informal: based on place names

    • Formal: based on coordinates, etc.

  • Benefits

    • Resolving ambiguity

    • Ease of access to data objects

    • Integration of data from heterogeneous sources

    • Resolving space and time


Dbpedia
DBpedia seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

  • Linked Data: All we need to do now is to return the results in the right format

  • For instance, extracting …

    • http://dbpedia.org/spotlight


Sponging
Sponging seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.


Thanks

Thanks seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.


Example from stanford
Example from Stanford seek, without Hpcoiilal/ioii, Hiifo and .profltublo invtwtmont for larj;o or Hinall HiiniH, at; a hi(jli<"r rulo of intoront tlian can be obtainod from tho in 'ihlio 1'uihIh, and on oh Hocuro a basin. Tho invoHlinont Hystom, whilo it olfors tho preutoHt advantages to tho public, nifordH to i(.H -moniberH n perfect Boourity, luul a hi^ hor rato ofintonmt than can bo obtained oluowhoro, 'I'ho capital of £250,000 in divided, for tho oonvonionco of invoiitmont and tninafor, into £1 bIiui-ob, of which 10a. only'wiUbe oallod.

The task given to participants in the MUC-4 evaluation (1991) was to extract specific

information on terrorist incidents from newspaper and newswire texts relating to South America.

part-of-speech taggers, systems that assign one and only one part- of-speech

symbol (like Proper noun, or Auxiliary verb) to a word in a running text and do so on the basis

(usually) of statistical generalizations across very large bodies of text.


ad