semi automatic content extraction from specifications n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Semi-Automatic Content Extraction from Specifications PowerPoint Presentation
Download Presentation
Semi-Automatic Content Extraction from Specifications

Loading in 2 Seconds...

play fullscreen
1 / 66

Semi-Automatic Content Extraction from Specifications - PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on

Semi-Automatic Content Extraction from Specifications. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation. Extraction : Summarize in a prescribed vocabulary. Spec: Text. Spec: SDR.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Semi-Automatic Content Extraction from Specifications' - esme


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
semi automatic content extraction from specifications

Semi-Automatic Content Extraction from Specifications

Krishnaprasad Thirunarayan

Department of Computer Science & Engineering

Wright State University

Aaron Berkovich and Dan Sokol

Cohesia Corporation

extraction summarize in a prescribed vocabulary
Extraction : Summarize in a prescribed vocabulary

Spec: Text

Spec: SDR

Domain Library

participants
Participants
  • Sponsor: National Science Foundation
      • SBIR: Phase I and Phase II
  • Industry: Cohesia Corporation
      • Developer of (B2B) content and lower-level infrastructure
  • University: Wright State University
      • User-level tools: conceptualization and design
  • Others: Geometric Software Solutions, …
      • Tool/Product development and integration
outline
Outline
  • Background and Goal (What?)
  • Motivation (Why?)
  • Details (How?)
  • Conclusions
manual content extraction
Manual Content Extraction
  • Input:
      • Paper-based specifications of a manufacturing task describing composition, processing, and testing of materials
      • Additional constraints imposed by customers and vendors
      • Appropriate Ontology and Domain Library defining standard vocabulary
slide7
Output:
    • An “equivalent” formalized description of specs in Specification Definition Representation (SDR)
  • Observation:
    • Specs originating from a common source (ASTM, SAE, GE) share vocabulary and structure.
    • Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.
assistance for extraction
Assistance for Extraction

Document Paper

Document Text

Mark-Up Editor (Wizard)

Document SDR

Document Proofer

original

semi automatic content extraction
Semi-automatic Content Extraction
  • Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR.
      • Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text.
          • Spec: Human-sensible
          • Mark-up: Computer-sensible
    • Automate routine mechanical tasks.
slide10

Semantic Mark-up

Spec Name

Spec Title

Revision

Procedure

Revision Date

Qualifier

Characteristic

Values

Value

ontology
Ontology
  • (Gruber)
    • An ontology is an explicit specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.
sdl ontology
SDL Ontology

1 or many

Document

Domain

Library

1 or many

Revision

Reference

Ref: 0, 1 or many

0, 1 or many

Procedure

0, 1 or many

Ref: 0, 1 or many

Layer

0, 1 or many

Ref: 0, 1 or many

Characteristic

Value

slide13

Extraction: Spec to SDR

Spec: Text

Spec: SDR

fundamental obstacles
Fundamental Obstacles
  • The relation between the spec and its SDR rendition is “not linear”.
      • Same spec information duplicated in SDR in different contexts.
      • Contiguous block of information in SDR spread out in spec.
  • Equivalence of phrases hard to formalize.
  • Tables and footnotes abbreviate information in irregular and complicated ways.
linearizing through abstraction introducing specification definition language
Linearizing through Abstraction: Introducing Specification Definition Language

Manual (original)

Original

Spec

SDL

SDR

Manual (Ph-I)

Compiled (Ph-I)

Literal, Integrated,

Semi-automatic (Ph-II)

Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages.

Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.

business background supply chain

Drawing

Drawing

Spec

Spec

Business Background (Supply Chain)

Engine

Forger

Metal

quality issues
Quality Issues
  • Transcription Errors
      • From spec to hand-written sheet to computer
  • Completeness
      • Info in spec but missing in SDR
  • Soundness
      • Info in SDR but not in spec
  • Uniformity of Form
  • Uniformity in Interpretation
      • Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)
efficiency issues
Efficiency Issues
  • Minimize time/effort required.
    • Automate routine mechanizable tasks
      • Eliminate “cut-paste-modify” cycle
  • Minimize duplication of information.
    • Concise representation
      • Size of translation = O(Size of spec).
    • Update consistency
  • Flexible rendition into various external forms.
essence of our approach literal translation
Essence of our Approach : Literal Translation
  • Conceptually, every piece of info in SDR owes its existence to phrases in spec.
  • Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec.
    • Requires compilation into SDL/SDR.
    • Cf. XML/XSL Technology
slide26
Semi-automatic approach is feasible onlyif the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible.
      • Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.
sdl studio and its extension
SDL Studio and its Extension
  • SDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR.
  • This can be further enriched using
      • Improved Domain Library Search
      • Extraction and composition of SDL fragments
      • Providing templates for commonly occurring “procedures”
      • Table processor
      • etc …
domain library
Domain Library
  • Currently, it contains technical phrases pertinent to materials and processing requirements
  • Cohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc.
  • Typical size: 10,000 phrases
improving domain library search
Improving Domain Library Search
  • Goal: Mapping “equivalent” phrases to same Domain Library Term
  • Uses:
    • Techniques for prefix removal, stemming, and dealing with other variations for root recognition
    • Stop words elimination
    • Abbreviation expander and alias normalization
algorithm sketch
begin

dl := readAndBuildDomainLibrary();

dlwm := buildWordMapAndBackLinks(dl);

% delete stop words, link words to DLTs

(in,mt) := readInputPhraseAndMatchThreshold();

inwm := buildWordMap(in);

dlts :=

buildDLTsListContainingMatchedWords(dlwm,inwm);

dlts := evaluateAndFilterDLTs(dlts,mt);

end;

Algorithm Sketch

List[Phrase] dl;

Phrase ip; Int mt;

List[Word] dlwm, inwm; % with back references

List[Phrase] dlts;

matching words
Matching words

Int wordMatch(w1,w2)

begin

% normalized = vowels deleted, i.e., only consonants present

if caseUniformAndCleanedMatch(w1,w2)

return 100;

if normalizedMatch(w1,w2)

return 90;

if orderedNormalizedMatch(w1,w2)

return 70;

% analyze for differences due to prefix and suffix

if normalizedDifferenceInPrefixSuffixTables(w1,w2)

return 90;

end;

design rationale
Design Rationale
  • Input phrase may contain multiple DLTs.
  • DLT words may not appear contiguous in input.
  • Consonants are significant, and "correct" spellings may differ in vowels.
  • Robustness with respect to spelling errors such as transposition of letters or missing vowels.
  • Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically.
  • Err on the side of recall rather than precision.
  • Number of words < Number of DLTs
overall approach
Overall Approach
  • Preprocessing: Obtain spec in plain text form (from MSWord format).
      • This is a practical alternative to scanning and OCR-ing a paper-based spec.
      • Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags.
  • Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.
two possible avenues from document to sdl
Two possible Avenues(From Document to SDL)
  • Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology.
    • Generate various views of the document and SDL from this single XML Master.
  • Iteratively generate a sequence of progressively detailed SDL document from spec text.
first avenue via xml
First Avenue : Via XML
  • Semi-automatic extraction is accomplished in two phases:
    • Initial automatic markup phase: Systematically recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment.
    • Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL.
      • Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.
cont d
(cont’d)
  • Advantages:
    • Focus is on a single persistent XML Master that tries to maintain a link between the spec and the extractions.
    • All the processing is orchestrated on this XML file.
      • Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.
cont d1
(cont’d)
  • Disadvantages:
    • There is a need to manage a separate SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.
semantic markup algorithm
Semantic-Markup Algorithm

Insert

Structure

Tags

Insert

Ontology

Tags

Infer

Missing

Char.

Group

Char.

& Values

Group

C-Vs into

Procedures

functional components
Functional Components

Text file

Structure Tagger

XML file

DLT Tagger

Domain

Library

XML file

Group Tagger

XML file

SDL Converter

SDL file

tagging and transforming
Tagging and Transforming
  • flex structTagger.flex
  • gcc lex.yy.c -lfl
  • a < GE.txt > GE.xml
  • java org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl
  • java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdl
  • java org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt
second avenue sdl all along
Second Avenue: SDL all along
  • As there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along.
  • Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor.
  • Disadvantage: This form does not retain correspondence with the original document explicitly.
prototype operation
Prototype Operation

Extraction Tool – Prototype Operation

slide49
Views: In the context of Spec

Plain text view

Text view with “requirement” phrases color coded and highlighted

View of domain library terms found in the spec

Views: In the context of SDL

Spec identity view + Large Note : Method D Extraction

Method C Extraction

Procedure view

Characteristic-value pair view

additional standalone tools
Additional Standalone Tools
  • Domain Library Browser
    • Given a word or a phrase, display all the domain library information related to it.
  • SDL Fragment Generator
    • Given a sentence, generate an SDL fragment that captures its essence.

These tools can assist an extractor in composing SDL document incrementally.

longer term vision
Longer-term Vision
  • Marketplace continues to confirm the need for tools to capture the semantic interpretation of document content
  • Cohesia plans to productize the results of the research into a viable commercial product
example engineering tasks
Example Engineering Tasks
  • How to express and represent templates for well-known “procedures”?
    • Alternative to cut-paste-modify cycle
      • Tensile Test
      • Heat Treatment
      • Melt Method
      • Chemistry
      • Packaging
slide55
How to express and represent heterogeneous tables and non-trivial footnotes in a spec in a convenient and uniform way?
  • How to create, manipulate, and store specs in SDR and SDL among other forms and maintain interoperability?
example research questions
Example Research Questions
  • What are the forms of extraction rules?
    • Phrase pattern matching
    • Theory of equivalence/subsumption
      • Example: Aliases / Equivalent Phrases
        • Creep = Plastic Strain
        • Delivery Condition = Surface Finish
        • Cause for Rejection = Rejection Criteria
        • Imperfections detrimental to usage of product = Free of injurious defects
slide57
Rules for interpreting “logic words”
    • Connectives: and, or, …
    • Quantifiers: all, every, each, …
    • Modifiers: over, under, more, less, …
    • Negation: not, no, unless, except, “free of” ...
      • Mismatch?
        • A, B, and C => {A,B,C} union/OR-logic
      • Distributive Laws?
        • Lot and order number => lot number and order number
another example scenerio
Another Example Scenerio

Buyers’

Purchase Order

Sellers’

Inventory

Melt Atmosphere = Inert Gas

Sulphur < 2.0%

Niobium < 0.5%

Melt Atmosphere = Argon

Sulphur < 1.7%

Columbium < 0.2%

Match?

slide59
What are the strategies for searching and matching?
    • Top-down: Template-driven expectations
    • Bottom-up: Identifying requirements present
    • Closure: Manual addition / modification / disambiguation
slide60
Relevant Information Extraction Research and Technologies
    • References
      • Message Understanding Conferences.
      • Work on NLP an IE at UMass, NYU, SRI, etc.
      • Search and Filtering tools.
slide62

NSF SBIR Phase II

Spec Text as

Spec Text in

Electronic

HTML/XML

Image

Optical

Paper

Extraction

Character

Scanning

Wizard

Recognition

Spec

Text on Paper

SDL (XML)

SDR

Read,

SDL

SDL

Interpret,

Editor

Compiler

& Type

NSF SBIR Phase I

Before