information extraction based on extraction ontologies design deployment and evaluation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation PowerPoint Presentation
Download Presentation
Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

Loading in 2 Seconds...

play fullscreen
1 / 36

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation - PowerPoint PPT Presentation


  • 514 Views
  • Uploaded on

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation. Martin Labsk ý , Vojt ěch Svátek Dept. of Knowledge Engineering, UEP {labsky,svatek}@vse.cz AI Seminar, November 13 th 2008. Agenda. Example applications of Web IE

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation' - emily


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction based on extraction ontologies design deployment and evaluation

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

Martin Labský, Vojtěch Svátek

Dept. of Knowledge Engineering, UEP

{labsky,svatek}@vse.cz

AI Seminar, November 13th 2008

agenda
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies

example apps of web ie 1 5 online products
Example apps of Web IE (1/5): online products

AI Seminar IE based on Extraction Ontologies

example apps of web ie 2 5 contact information
Example apps of Web IE (2/5): contact information

AI Seminar IE based on Extraction Ontologies

example apps of web ie 3 5 seminars events
Example apps of Web IE (3/5): seminars, events

AI Seminar IE based on Extraction Ontologies

example apps of web ie 4 5 bike products
Example apps of Web IE (4/5): bike products

AI Seminar IE based on Extraction Ontologies

example apps of web ie 4 5
Example apps of Web IE (4/5)
  • Store the extracted results in a DB to enable structured search over documents
    • information retrieval
    • database-like querying
    • e.g. online product search engine,
    • e.g. building a contact DB
  • Support for web page quality assessment
    • involved in an EU project MedIEQ to support medical website accreditation agencies
  • Source documents
    • internet, intranet, emails
    • can be very diverse

AI Seminar IE based on Extraction Ontologies

agenda1
Agenda
  • Example applications of Web IE
  • Difficulties in practical IE applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies

difficulties in practical applications 1 3
Difficulties in practical applications (1/3)
  • Requirements
    • quickly prototype IE applications
      • not necessarily with the best accuracy initially
      • often needed for a proof-of-concept application
      • then more work can be done to boost accuracy
    • the extraction model changes
      • meaning of to-be-extracted items may shift,
      • new items are often added or removed

AI Seminar IE based on Extraction Ontologies

difficulties in practical applications 2 3
Difficulties in practical applications (2/3)
  • Purely manual rules
    • writing extraction rules manually does not scale when more complex extraction rules need to be encoded
    • not easy to combine with trained models when training data become available in later phases
  • Training data
    • trainable IE systems often require large amounts of training data: these are typically not available for the desired task
    • when training data is collected, it is not easy to adapt it to modified or additional criteria
  • Wrappers
    • cannot rely on wrapper-only systems when extracting from multiple websites
    • non-wrapper systems often do not utilize regular formatting cues

AI Seminar IE based on Extraction Ontologies

difficulties in practical applications 3 3
Difficulties in practical applications (3/3)
  • Seems interesting to exploit at the same time
    • extraction knowledge from domain experts
    • training data
    • formatting regularities

AI Seminar IE based on Extraction Ontologies

agenda2
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies

extraction ontologies
Extraction ontologies
  • An extraction ontology is a part of a domain ontology transformed to suit extraction needs
  • Contains classes composed of attributes
    • more like UML class diagrams, less like ontologies where e.g. relations are standalone
    • also contains axioms related to classes or attributes
  • Classes and attributes are augmented with extraction evidence
    • manually provided patterns for content and context
    • axioms
    • value or length ranges
    • links to trained models

Person

name {1}

degree {0-5}

email {0-2}

phone {0-3}

Responsible

AI Seminar IE based on Extraction Ontologies

extraction evidence provided by domain expert 1
Extraction evidence provided by domain expert (1)
  • Patterns
    • for attributes and classes
    • for their content and context
    • patterns may be defined at the following levels:
      • word and character-level,
      • formatting tag level
      • level of labels (e.g. sentence breaks, POS tags)
  • Attribute value constraints
    • word length constraints, numeric value ranges
    • possible to attach units to numeric attributes
  • Axioms
    • may enforce relations among attributes
    • interpreted using JavaScript scripting language
  • Simple co-reference resolution rules

AI Seminar IE based on Extraction Ontologies

extraction evidence provided by domain expert 2
Extraction evidence provided by domain expert (2)

Axioms

  • class level
  • attribute level

Patterns

  • class content
  • attribute value
  • attribute context
  • class context

Value constraints

  • word length
  • numeric value

AI Seminar IE based on Extraction Ontologies

extraction evidence based on trained models 1
Extraction evidence based on trained models (1)
  • Links to trainable classifiers
    • may classify attributes only
    • binary or multi-class
  • Trained models may use as features:
    • simple word level features (word itself, word type, possibly POS tags)
    • re-use all evidence provided by expert (patterns, axioms, constraints)
    • induced binary features based on word n-grams

classifier usage

classifier definition

AI Seminar IE based on Extraction Ontologies

extraction evidence based on trained models 2
Extraction evidence based on trained models (2)
  • Data representation for classifiers:
    • word sequence (1 word = 1 sample)
    • phrase set (sliding window method)
  • Tested trainable classifiers:
    • CRF++ (Conditional Random Fields) http://crfpp.sourceforge.net
    • algorithms from the Weka machine learning toolkit
      • SVM (Support Vector Machine)
      • JRip (rule induction)
      • http://www.cs.waikato.ac.nz/ml/weka
    • Hidden Markov Model extractor

AI Seminar IE based on Extraction Ontologies

extraction evidence based on trained models 3
Extraction evidence based on trained models (3)
  • Feature induction
    • candidate features are all word n-grams of given lengths occurring inside or near training attribute values
    • pruning parameters:
      • point-wise mutual information thresholds:
      • minimal absolute occurrence count
      • maximum number of features

AI Seminar IE based on Extraction Ontologies

probabilistic model to combine evidence
Probabilistic model to combine evidence
  • Each piece of evidence E is equipped with 2 probability estimates with respect to predicted attribute A:
    • evidence precision P(A|E) ... prediction confidence
    • evidence coverage P(E|A) ... necessity of evidence (support)
  • Each attribute is assigned some low prior probability P(A)
  • Let be the set of evidence applicable to A
  • Assume conditional independence among :
  • Using Bayes formula we compute P(A | its evidence values) as:

where

AI Seminar IE based on Extraction Ontologies

extraction vs domain ontologies
Extraction vs. domain ontologies
  • When existing domain ontologies are available:
    • identify relevant parts
    • reuse classes, attributes, cardinalities, some axioms
  • Transformation rules
    • reused parts of domain ontology may require transformation to fit into extraction ontology
      • due to extraction ontologies focusing on the way of presentation rather than semantics
    • identified typical transformation rules that could be used to transform parts of OWL-encoded ontologies

AI Seminar IE based on Extraction Ontologies

agenda3
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies

the extraction process 1 5
The extraction process (1/5)
  • Tokenize, build HTML formatting tree, apply sentence splitter, POS tagger
  • Match patterns
  • Apply trained models
  • Create Attribute Candidates (ACs)
    • For each created AC, let PAC=
    • prune ACs below threshold
    • build document AC lattice, score ACs by log(PAC)

Washington , DC

...

...

AI Seminar IE based on Extraction Ontologies

the extraction process 2 5
The extraction process (2/5)
  • Evaluate coreference resolution rules for each pair of ACs
    • e.g. “Dr. Burns”  “John Burns”
    • possible coreferring groups are remembered
    • in attribute’s value section:
  • Compute the best scoring path BP through AC lattice
    • using dynamic programming
  • Run wrapper induction algorithm using all AC  BP
    • wrapper induction algorithm described in next slides
    • if new local patterns are induced, apply them to:
      • rescore existing ACs
      • create new ACs
    • update AC lattice, recompute BP
  • Terminate here if no instances are to be generated
    • output all AC  BP (n-best paths supported)

AI Seminar IE based on Extraction Ontologies

the extraction process 3 5
The extraction process (3/5)
  • Generate Instance Candidates (ICs) bottom-up
    • triangular trellis used to store partial ICs
    • when scoring new ICs, only consider axioms and patterns that already can be applied to the IC. Validity is not required.
    • pruning parameters: abs and relative beam size at trellis node, maximum number of ACs that can be skipped, min IC probability

AI Seminar IE based on Extraction Ontologies

the extraction process 4 5
The extraction process (4/5)
  • IC generation: continued
  • When new IC is created, its P(IC) is computed from 2 components:

where |IC| is member attribute count,

ACskip is an non-member AC that is fully or partially inside the IC,

PAC skip is the probability of AC being a “false positive”.

where C is the set of evidence known for the class C, computed using the same probabilistic model as for ACs.

  • Scores are combined using the Prospector pseudo-bayesian method:

AI Seminar IE based on Extraction Ontologies

the extraction process 5 5
The extraction process (5/5)
  • Insert valid ICs into AC lattice
    • Valid ICs were assembled during IC generation phase
    • Score of a valid IC reflects all extraction evidence of its class
    • All unpruned valid ICs are inserted into the AC lattice, scored by
  • The best path BP is calculated through the IC+AC lattice (n-best supported)
    • the search algorithm allows constraints to be defined over the extracted path(s)
      • e.g. min/max count of extracted instances
    • output all ACs and ICs on BP

IC1

AI Seminar IE based on Extraction Ontologies

extraction evidence based on formatting
Extraction evidence based on formatting
  • A simple wrapper inductionalgorithm
    • identify formatting regularities
    • turn them into “local” context patterns to boost contained ACs
  • Assemble distinct formatting subtrees rooted at block elements containing ACs from the best path BP currently determined by the system
  • For each subtree S, calculate
  • If both C(S,Att) and prec(Att|S) reach defined thresholds, a new local context pattern is created with its precision set to C(S,Att) and its recall close to 0 (in order not to harm potential singleton ACs.

a formatting tree learned using known names like

“John Doe” and applied to unknown names

TD

TD

B

A_href

B

A_href

John Doe

jdoe@web.ca

Argentina Agosto

aa@web.br

AI Seminar IE based on Extraction Ontologies

agenda4
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

AI Seminar IE based on Extraction Ontologies

experimental results seminar announcements
Experimental results: Seminar announcements
  • 485 English seminar announcement text documents
  • Manual: extraction ontology created based on seeing 40 randomly chosen documents, evaluated using remaining 445
  • Manual+CRF: same extraction ontology equipped with a CRF classifier used as further extraction evidence. 10-fold cross-validation using test set above

AI Seminar IE based on Extraction Ontologies

cost of the ie system seminar announcements
Cost of the IE system: Seminar announcements
  • Creation of extraction ontology: 1-2 person weeks
    • annotate 40 training documents (expect 1-2 days)
    • inspecting examples in 40 documents
    • writing patterns, axioms, iterating
  • Training inductive model in addition to ex. ontology
    • 2-3 person weeks to annotate training data (445 docs)
    • F-measure improvement from 2 to 6%
  • ex. ontologies allow for fast & flexible prototyping (annotation design changes quickly reflected)
  • then, for parts of the ex. ontology that need accuracy improvement, obtain more training data & reuse as features all manual extraction evidence already provided

AI Seminar IE based on Extraction Ontologies

experimental results contact information
Experimental results: Contact information
  • 109 English contact pages, 200 Spanish, 108 Czech
  • Named entity counts: 7000, 5000, 11000, respectively, instances not labeled
  • Only domain expert’s evidence and formatting pattern induction were used
  • Domain expert saw 30 randomly chosen documents, the rest was test data
  • Instance extraction done but not evaluated

Instance grouping

  • Villain score F = 60-70%
  • Villain recall = % of correct links recovered
  • Villain precision = % of recovered links that are correct

AI Seminar IE based on Extraction Ontologies

experimental results bicycle descriptions
Experimental results: Bicycle descriptions
  • Hidden Markov Model
  • Trigram, naive topology
  • 103 labeled web pages, 12346 named entities,
  • Instances not labeled; instance extraction done but not evaluated
  • Single HMM for all extracted types:
    • 1 Background state
    • 1 Target, 1 Prefix and 1 Suffix state type for each extracted slot
    • =1+3*N states

B

P

T

S

P’

T’

S’

...

AI Seminar IE based on Extraction Ontologies

bicycle structured search interface
Bicycle structured search interface

AI Seminar IE based on Extraction Ontologies

future work
Future work
  • Attempt to improve a seed extraction ontology by bootstrapping using relevant pages retrieved from the Internet
  • Adapt the structure of extraction ontology according to data
    • e.g. add new attributes to represent product features

AI Seminar IE based on Extraction Ontologies

conclusions
Conclusions
  • Tool+tutorial available
    • http://eso.vse.cz/~labsky/ex/
  • Presented an extraction ontology approach to
    • allow for fast prototyping of IE applications
    • accommodate extraction schema changes easily
    • utilize all available forms of extraction knowledge
      • domain expert’s knowledge
      • training data
      • formatting regularities found in web pages
  • Results
    • indicate that extraction ontologies can serve as a quick prototyping tool
    • accuracy of the prototyped ontology can be improved when training data become available

AI Seminar IE based on Extraction Ontologies

acknowledgements
Acknowledgements
  • The research was partially supported by the EC under contract FP6-027026, Knowledge Space of Semantic Inference for Automatic Annotation and Retrieval of Multimedia Content: K-Space.
  • The medical website application is carried out in the context of the EC-funded (DG-SANCO) project MedIEQ.

AI Seminar IE based on Extraction Ontologies