Information extraction based on extraction ontologies design deployment and evaluation
Download
1 / 36

Information Extraction Based on Extraction Ontologies: Design - PowerPoint PPT Presentation


  • 490 Views
  • Uploaded on

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation. Martin Labsk ý , Vojt ěch Svátek Dept. of Knowledge Engineering, UEP {labsky,[email protected] AI Seminar, November 13 th 2008. Agenda. Example applications of Web IE

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction Based on Extraction Ontologies: Design' - emily


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Information extraction based on extraction ontologies design deployment and evaluation

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

Martin Labský, Vojtěch Svátek

Dept. of Knowledge Engineering, UEP

{labsky,[email protected]

AI Seminar, November 13th 2008


Agenda
Agenda Design, Deployment and Evaluation

  • Example applications of Web IE

  • Difficulties in practical applications

  • Extraction Ontologies

  • Extraction process

  • Experimental results

  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies


Example apps of web ie 1 5 online products
Example apps of Web IE (1/5): online products Design, Deployment and Evaluation

AI Seminar IE based on Extraction Ontologies


Example apps of web ie 2 5 contact information
Example apps of Web IE (2/5): contact information Design, Deployment and Evaluation

AI Seminar IE based on Extraction Ontologies


Example apps of web ie 3 5 seminars events
Example apps of Web IE (3/5): seminars, events Design, Deployment and Evaluation

AI Seminar IE based on Extraction Ontologies


Example apps of web ie 4 5 bike products
Example apps of Web IE (4/5): bike products Design, Deployment and Evaluation

AI Seminar IE based on Extraction Ontologies


Example apps of web ie 4 5
Example apps of Web IE (4/5) Design, Deployment and Evaluation

  • Store the extracted results in a DB to enable structured search over documents

    • information retrieval

    • database-like querying

    • e.g. online product search engine,

    • e.g. building a contact DB

  • Support for web page quality assessment

    • involved in an EU project MedIEQ to support medical website accreditation agencies

  • Source documents

    • internet, intranet, emails

    • can be very diverse

AI Seminar IE based on Extraction Ontologies


Agenda1
Agenda Design, Deployment and Evaluation

  • Example applications of Web IE

  • Difficulties in practical IE applications

  • Extraction Ontologies

  • Extraction process

  • Experimental results

  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies


Difficulties in practical applications 1 3
Difficulties in practical applications (1/3) Design, Deployment and Evaluation

  • Requirements

    • quickly prototype IE applications

      • not necessarily with the best accuracy initially

      • often needed for a proof-of-concept application

      • then more work can be done to boost accuracy

    • the extraction model changes

      • meaning of to-be-extracted items may shift,

      • new items are often added or removed

AI Seminar IE based on Extraction Ontologies


Difficulties in practical applications 2 3
Difficulties in practical applications (2/3) Design, Deployment and Evaluation

  • Purely manual rules

    • writing extraction rules manually does not scale when more complex extraction rules need to be encoded

    • not easy to combine with trained models when training data become available in later phases

  • Training data

    • trainable IE systems often require large amounts of training data: these are typically not available for the desired task

    • when training data is collected, it is not easy to adapt it to modified or additional criteria

  • Wrappers

    • cannot rely on wrapper-only systems when extracting from multiple websites

    • non-wrapper systems often do not utilize regular formatting cues

AI Seminar IE based on Extraction Ontologies


Difficulties in practical applications 3 3
Difficulties in practical applications (3/3) Design, Deployment and Evaluation

  • Seems interesting to exploit at the same time

    • extraction knowledge from domain experts

    • training data

    • formatting regularities

AI Seminar IE based on Extraction Ontologies


Agenda2
Agenda Design, Deployment and Evaluation

  • Example applications of Web IE

  • Difficulties in practical applications

  • Extraction Ontologies

  • Extraction process

  • Experimental results

  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies


Extraction ontologies
Extraction ontologies Design, Deployment and Evaluation

  • An extraction ontology is a part of a domain ontology transformed to suit extraction needs

  • Contains classes composed of attributes

    • more like UML class diagrams, less like ontologies where e.g. relations are standalone

    • also contains axioms related to classes or attributes

  • Classes and attributes are augmented with extraction evidence

    • manually provided patterns for content and context

    • axioms

    • value or length ranges

    • links to trained models

Person

name {1}

degree {0-5}

email {0-2}

phone {0-3}

Responsible

AI Seminar IE based on Extraction Ontologies


Extraction evidence provided by domain expert 1
Extraction evidence provided by domain expert (1) Design, Deployment and Evaluation

  • Patterns

    • for attributes and classes

    • for their content and context

    • patterns may be defined at the following levels:

      • word and character-level,

      • formatting tag level

      • level of labels (e.g. sentence breaks, POS tags)

  • Attribute value constraints

    • word length constraints, numeric value ranges

    • possible to attach units to numeric attributes

  • Axioms

    • may enforce relations among attributes

    • interpreted using JavaScript scripting language

  • Simple co-reference resolution rules

AI Seminar IE based on Extraction Ontologies


Extraction evidence provided by domain expert 2
Extraction evidence provided by domain expert (2) Design, Deployment and Evaluation

Axioms

  • class level

  • attribute level

    Patterns

  • class content

  • attribute value

  • attribute context

  • class context

    Value constraints

  • word length

  • numeric value

AI Seminar IE based on Extraction Ontologies


Extraction evidence based on trained models 1
Extraction evidence based on trained models (1) Design, Deployment and Evaluation

  • Links to trainable classifiers

    • may classify attributes only

    • binary or multi-class

  • Trained models may use as features:

    • simple word level features (word itself, word type, possibly POS tags)

    • re-use all evidence provided by expert (patterns, axioms, constraints)

    • induced binary features based on word n-grams

classifier usage

classifier definition

AI Seminar IE based on Extraction Ontologies


Extraction evidence based on trained models 2
Extraction evidence based on trained models (2) Design, Deployment and Evaluation

  • Data representation for classifiers:

    • word sequence (1 word = 1 sample)

    • phrase set (sliding window method)

  • Tested trainable classifiers:

    • CRF++ (Conditional Random Fields) http://crfpp.sourceforge.net

    • algorithms from the Weka machine learning toolkit

      • SVM (Support Vector Machine)

      • JRip (rule induction)

      • http://www.cs.waikato.ac.nz/ml/weka

    • Hidden Markov Model extractor

AI Seminar IE based on Extraction Ontologies


Extraction evidence based on trained models 3
Extraction evidence based on trained models (3) Design, Deployment and Evaluation

  • Feature induction

    • candidate features are all word n-grams of given lengths occurring inside or near training attribute values

    • pruning parameters:

      • point-wise mutual information thresholds:

      • minimal absolute occurrence count

      • maximum number of features

AI Seminar IE based on Extraction Ontologies


Probabilistic model to combine evidence
Probabilistic model to combine evidence Design, Deployment and Evaluation

  • Each piece of evidence E is equipped with 2 probability estimates with respect to predicted attribute A:

    • evidence precision P(A|E) ... prediction confidence

    • evidence coverage P(E|A) ... necessity of evidence (support)

  • Each attribute is assigned some low prior probability P(A)

  • Let be the set of evidence applicable to A

  • Assume conditional independence among :

  • Using Bayes formula we compute P(A | its evidence values) as:

    where

AI Seminar IE based on Extraction Ontologies


Extraction vs domain ontologies
Extraction vs. domain ontologies Design, Deployment and Evaluation

  • When existing domain ontologies are available:

    • identify relevant parts

    • reuse classes, attributes, cardinalities, some axioms

  • Transformation rules

    • reused parts of domain ontology may require transformation to fit into extraction ontology

      • due to extraction ontologies focusing on the way of presentation rather than semantics

    • identified typical transformation rules that could be used to transform parts of OWL-encoded ontologies

AI Seminar IE based on Extraction Ontologies


Agenda3
Agenda Design, Deployment and Evaluation

  • Example applications of Web IE

  • Difficulties in practical applications

  • Extraction Ontologies

  • Extraction process

  • Experimental results

  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies


The extraction process 1 5
The extraction process (1/5) Design, Deployment and Evaluation

  • Tokenize, build HTML formatting tree, apply sentence splitter, POS tagger

  • Match patterns

  • Apply trained models

  • Create Attribute Candidates (ACs)

    • For each created AC, let PAC=

    • prune ACs below threshold

    • build document AC lattice, score ACs by log(PAC)

Washington , DC

...

...

AI Seminar IE based on Extraction Ontologies


The extraction process 2 5
The extraction process (2/5) Design, Deployment and Evaluation

  • Evaluate coreference resolution rules for each pair of ACs

    • e.g. “Dr. Burns”  “John Burns”

    • possible coreferring groups are remembered

    • in attribute’s value section:

  • Compute the best scoring path BP through AC lattice

    • using dynamic programming

  • Run wrapper induction algorithm using all AC  BP

    • wrapper induction algorithm described in next slides

    • if new local patterns are induced, apply them to:

      • rescore existing ACs

      • create new ACs

    • update AC lattice, recompute BP

  • Terminate here if no instances are to be generated

    • output all AC  BP (n-best paths supported)

AI Seminar IE based on Extraction Ontologies


The extraction process 3 5
The extraction process (3/5) Design, Deployment and Evaluation

  • Generate Instance Candidates (ICs) bottom-up

    • triangular trellis used to store partial ICs

    • when scoring new ICs, only consider axioms and patterns that already can be applied to the IC. Validity is not required.

    • pruning parameters: abs and relative beam size at trellis node, maximum number of ACs that can be skipped, min IC probability

AI Seminar IE based on Extraction Ontologies


The extraction process 4 5
The extraction process (4/5) Design, Deployment and Evaluation

  • IC generation: continued

  • When new IC is created, its P(IC) is computed from 2 components:

    where |IC| is member attribute count,

    ACskip is an non-member AC that is fully or partially inside the IC,

    PAC skip is the probability of AC being a “false positive”.

    where C is the set of evidence known for the class C, computed using the same probabilistic model as for ACs.

  • Scores are combined using the Prospector pseudo-bayesian method:

AI Seminar IE based on Extraction Ontologies


The extraction process 5 5
The extraction process (5/5) Design, Deployment and Evaluation

  • Insert valid ICs into AC lattice

    • Valid ICs were assembled during IC generation phase

    • Score of a valid IC reflects all extraction evidence of its class

    • All unpruned valid ICs are inserted into the AC lattice, scored by

  • The best path BP is calculated through the IC+AC lattice (n-best supported)

    • the search algorithm allows constraints to be defined over the extracted path(s)

      • e.g. min/max count of extracted instances

    • output all ACs and ICs on BP

IC1

AI Seminar IE based on Extraction Ontologies


Extraction evidence based on formatting
Extraction evidence based on formatting Design, Deployment and Evaluation

  • A simple wrapper inductionalgorithm

    • identify formatting regularities

    • turn them into “local” context patterns to boost contained ACs

  • Assemble distinct formatting subtrees rooted at block elements containing ACs from the best path BP currently determined by the system

  • For each subtree S, calculate

  • If both C(S,Att) and prec(Att|S) reach defined thresholds, a new local context pattern is created with its precision set to C(S,Att) and its recall close to 0 (in order not to harm potential singleton ACs.

a formatting tree learned using known names like

“John Doe” and applied to unknown names

TD

TD

B

A_href

B

A_href

John Doe

[email protected]

Argentina Agosto

[email protected]

AI Seminar IE based on Extraction Ontologies


Agenda4
Agenda Design, Deployment and Evaluation

  • Example applications of Web IE

  • Difficulties in practical applications

  • Extraction Ontologies

  • Extraction process

  • Experimental results

  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies


Experimental results seminar announcements
Experimental results: Design, Deployment and EvaluationSeminar announcements

  • 485 English seminar announcement text documents

  • Manual: extraction ontology created based on seeing 40 randomly chosen documents, evaluated using remaining 445

  • Manual+CRF: same extraction ontology equipped with a CRF classifier used as further extraction evidence. 10-fold cross-validation using test set above

AI Seminar IE based on Extraction Ontologies


Cost of the ie system seminar announcements
Cost of the IE system: Seminar announcements Design, Deployment and Evaluation

  • Creation of extraction ontology: 1-2 person weeks

    • annotate 40 training documents (expect 1-2 days)

    • inspecting examples in 40 documents

    • writing patterns, axioms, iterating

  • Training inductive model in addition to ex. ontology

    • 2-3 person weeks to annotate training data (445 docs)

    • F-measure improvement from 2 to 6%

  • ex. ontologies allow for fast & flexible prototyping (annotation design changes quickly reflected)

  • then, for parts of the ex. ontology that need accuracy improvement, obtain more training data & reuse as features all manual extraction evidence already provided

AI Seminar IE based on Extraction Ontologies


Experimental results contact information
Experimental results: Contact information Design, Deployment and Evaluation

  • 109 English contact pages, 200 Spanish, 108 Czech

  • Named entity counts: 7000, 5000, 11000, respectively, instances not labeled

  • Only domain expert’s evidence and formatting pattern induction were used

  • Domain expert saw 30 randomly chosen documents, the rest was test data

  • Instance extraction done but not evaluated

Instance grouping

  • Villain score F = 60-70%

  • Villain recall = % of correct links recovered

  • Villain precision = % of recovered links that are correct

AI Seminar IE based on Extraction Ontologies


Experimental results bicycle descriptions
Experimental results: Bicycle descriptions Design, Deployment and Evaluation

  • Hidden Markov Model

  • Trigram, naive topology

  • 103 labeled web pages, 12346 named entities,

  • Instances not labeled; instance extraction done but not evaluated

  • Single HMM for all extracted types:

    • 1 Background state

    • 1 Target, 1 Prefix and 1 Suffix state type for each extracted slot

    • =1+3*N states

B

P

T

S

P’

T’

S’

...

AI Seminar IE based on Extraction Ontologies


Bicycle structured search interface
Bicycle structured search interface Design, Deployment and Evaluation

AI Seminar IE based on Extraction Ontologies


Future work
Future work Design, Deployment and Evaluation

  • Attempt to improve a seed extraction ontology by bootstrapping using relevant pages retrieved from the Internet

  • Adapt the structure of extraction ontology according to data

    • e.g. add new attributes to represent product features

AI Seminar IE based on Extraction Ontologies


Conclusions
Conclusions Design, Deployment and Evaluation

  • Tool+tutorial available

    • http://eso.vse.cz/~labsky/ex/

  • Presented an extraction ontology approach to

    • allow for fast prototyping of IE applications

    • accommodate extraction schema changes easily

    • utilize all available forms of extraction knowledge

      • domain expert’s knowledge

      • training data

      • formatting regularities found in web pages

  • Results

    • indicate that extraction ontologies can serve as a quick prototyping tool

    • accuracy of the prototyped ontology can be improved when training data become available

AI Seminar IE based on Extraction Ontologies


Acknowledgements
Acknowledgements Design, Deployment and Evaluation

  • The research was partially supported by the EC under contract FP6-027026, Knowledge Space of Semantic Inference for Automatic Annotation and Retrieval of Multimedia Content: K-Space.

  • The medical website application is carried out in the context of the EC-funded (DG-SANCO) project MedIEQ.

AI Seminar IE based on Extraction Ontologies


ad