Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005

Information Extraction Language Technology(A Machine Learning Approach)24 March 2005 Antal van den Bosch and Walter Daelemans http://ilk.uvt.nl/~antalb/ltua

What is Information Extraction? • Input: unstructured text • Output: structured information, fills pre-existing template (find salient information) • Most often stored in database for futher processing (e.g. data mining)

What isn’t information extraction? • Information retrieval (we need to extract info, not only find relevant documents) • Text understanding (only specific parts of the text are interesting) • large corpora can be used • possible to score objectively

Applications of IE • Can make information retrieval more precise • Summarization of documents in well-defined subject areas • Automatic generation of databases from text

Overview • Named entity recognition • Recognizing relevant entities in text • Relation extraction • Linking recognized entities having particular relevant relations

Named Entity Recognition Named Entity Recognition (NER) is a combination of concept chunking and labeling those chunks: we wish to identify textual information units that represent people, places, organizations, companies, bands, etc. De door het Amerikaanse National Hurricane Center als 'zeer gevaarlijk' omschreven orkaan Ivan nadert Cuba. Een overzicht over wat Ivan op de Kaaimaneilanden heeft aangericht, is er nog niet. Gouverneur Bruce Dinwiddy zei maandag dat duizenden mensen dakloos zijn geworden en dat ook belangrijke regeringsgebouwen zijn getroffen.

“belvedere” Why NER? NER has many applications • prerequisite for information extraction • improving information retrieval • indexing • querying

Intuitively simple? What’s the problem? NER seems intuitively simple for humans. How do we determine whether or not a (string of) word(s) represents a name? • does the word start with a capital letter? • (orthographic characteristics) • have we seen it before? (lists of names) • contextual clues How do we teach this to a computer?

Some problems… Problems: • not every word that starts with a capital letter • is a name • ex:“Soms is dat niet mogelijk ...” • no list can ever be complete • ex:“Antbeard en zijn bemanning voeren ...” • ex:“Wil je wat te drinken?” • context can be misleading • ex:“Er was geen land met Henk te bezeilen.”

string length? 8 starts w/ capital letter? YES contains punctuation? NO first word o/t sentence? NO Feature extraction A lot of different features can be extracted for use in (inductively) learning to classify NEs. Every word can be represented with a lot of different features: “… bedrijf dat Floralux inhuurde . In ‘81 …”

focus word left context right context “… het bedrijf dat Floralux inhuurde . In ‘81 bestond…” “… het bedrijf dat Floralux inhuurde . In ‘81 bestond…” “… het bedrijf dat Floralux inhuurde . In ‘81 bestond…” Feature extraction (2) We represent the context by sliding a ‘window’ over the data which is anchored in the focus word. “… het bedrijf dat Floralux inhuurde . In ‘81 bestond…”

Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 . Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 . Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 . Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 . determine boundaries determine types To split or not to split? determine boundaries + types Wolff , op het moment een journalist in Argentinië , speelde met Del Bosque bij Real Madrid in de jaren 70 .

State of the art Lots of other different languages have been targeted as well: Chinese, French, Japanese, Portuguese, Greek, Hindi, Rumanian, Turkish, Norwegian, and so on…

Information extraction • Named-entity recognition • Relation extraction • Coreference resolution • PvdA-leider Wouter Bos gaat alleen voor minister-president. Vice-premier onder CDA-leider Balkenende is geen gedachte waar hij warm van wordt. "Hou het er maar op dat ik daar nee tegen zeg", aldus Bos woensdag voor RTL Nieuws.

Information extraction • Named-entity recognition has received a lot of attention in IE • Relation extraction is taking over as focal point of attention

Relation extraction Example Eric Schmidt is directeur van Google . N N WW N VZ N . | ----- PER ---- | | - ORG - | directeur

Why relation extraction? • Named entities can be useful to enhance information retrieval • Not enough to answer certain types of information-seeking questions • For example • Wie is de directeur van Google?

Why relation extraction? • Naïve strategy • Find documents in which [PER <unknown> ] and [ORG Google ] are within each other’s vicinity • Can produce nice results, but does not always work • Also, user still has to find answer • It would be better if the system produced the answer 'Eric Schmidt'.

Examples • Some application areas are • News domain • Relations among the most typical named entities: Person, Organisation, Location, Misc • E.g. located in, parent of, part of • Biomedical domain • Relations among biomedical entities, such as DNA, proteins, diseases, etc. • Protein-protein interaction • Gene-disease relation • Every domain-specific application needs its own set of entities

Relation extraction • Difficult • Automatic systems still perform poorly • But a few reasonable solutions • Often only works in restricted domains • Techniques operating in the news domain are lousy in other domains, e.g. biomedical texts

Relations: implicit / explicit • Explicit relations are spelled out • Joe Cummings, Chairman of Sybase, spoke for four hours. • Implicit relations imply understanding a text • Sybase was scheduled to testify, and Chairman Joe Cummings spoke for four hours. • Most current research involves explicit relations

Difficulties • A relation can be phrased in many ways • Eric Schmidt is de directeur van Google. • Eric Schmidt, de nieuwe directeur van Google, verklaart ... • Eric Schmidt zet een volgende stap in zijn carriere. Sinds kort is hij de directeur van Google. • ...

Assumptions • Delimit the task • Relations always connect two named entities • More complex relations between >2 entities are harder • Both entities are in the same sentence • Strong simplification • See next week (Veronique Hoste guest lecture)

Relation extraction • Relation detection • Is there a relation between two entities? • Relation classification • Which type does the relation between two entities have?

MUC • Message Understanding Conference • Has organized many information extraction competitions • Since 1998, relation extraction is a MUC competition

ACE • Automatic Content Extraction • More recent than MUC • The ACE data is the most popular data set for relation extraction research

ACE • Types/Subtypes relations • ROLE • Relates a person to an organisation or geopolitical entity • member, owner, affiliate, client, citizen • PART • Generalised containment • subsidiary, physical part-of, set membership • AT • permanent and transient locations • located, based-in, residence • SOC • social relations among persons • parent, sibling, spouse, grandparent, associate

Automatic RE: Pipeline • Relation extraction finds relations among pairs of named entities • Assuming that named entities have already been identified • Simple case of a pipeline, a heavily used architecture in language technology

Pipeline Tokeniser Sentence splitter Part-of-speech tagger Information Syntactic parser Named-entity recogniser Relation extractor

Pipeline • Parts of a pipeline are dependent on what is done before them • A weak point of the pipeline architecture is that errors tend to propagate as snowballs

Eric Schmidt is directeur van Google. • [PER Eric Schmidt ]werkzaam-bij[ORG Google ] • Jan de Vries is vakkenvuller bij Albert Heijn. • [PER Jan de Vries ]werkzaam-bij[ORG Albert Heijn ]

PER is directeur van ORG. • PER is vakkenvuller bij ORG.

PER is directeur PREPORG. • PER is vakkenvuller PREPORG.

PER is directeur PREPORG. • PER is vakkenvuller PREPORG. • Jan de Vries is fan van PSV. • PER is fan PREPORG. !

Semantic lexicon (e.g. WordNet) ... bewonderaar ... fan directeur ... liefhebber accountant ... vakkenvuller ... portier ...

[PER Eric Schmidt ]werkzaam-bij[ORG Google ]

[PER Jan de Vries ]werkzaam-bij[ORG Albert Heijn ]

Similar?

PER↑smain-su ↓smain-predc np-mod pp-obj1ORG

Evaluatie • Comparable to text classification and named entity recognition • Precision • Number of correctly predicted relations / Total number of predicted relations • Recall • Number of correctly predicted relations / Total number of relations in the text • F-score • 2 * precision * recall / (precision + recall)

Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005

Information Extraction Language Technology (A Machine Learning Approach) 24 March 2005

Presentation Transcript

Information Extraction Lecture 12 – More Machine Learning

Learning Hidden Markov Model Structure for Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction

Information Extraction

Information Extraction

Text Learning and Information Extraction

Recognition: A machine learning approach

Learning Effective Patterns for Information Extraction

Information Extraction from HTML: General Machine Learning Approach Using SRV

Information Extraction concluding remarks

Language Learning with Technology

CSE 573 Finite State Machines for Information Extraction

Human Language Technology

Temporal Information Extraction

Information Extraction

Selective Sampling for Information Extraction with a Committee of Classifiers

Bootstrapping Information Extraction from Semi-Structured Web Pages

Relation Extraction

Machine Learning for Information Extraction

Information Extraction from the WWW using Machine Learning Techniques

Information Technology Blogs

Language Learning with Technology