Semi-Automatic Content Extraction from Specifications

Semi-Automatic Content Extraction from Specifications Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation

Extraction : Summarize in a prescribed vocabulary Spec: Text Spec: SDR Domain Library

Participants • Sponsor: National Science Foundation • SBIR: Phase I and Phase II • Industry: Cohesia Corporation • Developer of (B2B) content and lower-level infrastructure • University: Wright State University • User-level tools: conceptualization and design • Others: Geometric Software Solutions, … • Tool/Product development and integration

Outline • Background and Goal (What?) • Motivation (Why?) • Details (How?) • Conclusions

Background and Goal

Manual Content Extraction • Input: • Paper-based specifications of a manufacturing task describing composition, processing, and testing of materials • Additional constraints imposed by customers and vendors • Appropriate Ontology and Domain Library defining standard vocabulary

Output: • An “equivalent” formalized description of specs in Specification Definition Representation (SDR) • Observation: • Specs originating from a common source (ASTM, SAE, GE) share vocabulary and structure. • Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.

Assistance for Extraction Document Paper Document Text Mark-Up Editor (Wizard) Document SDR Document Proofer original

Semi-automatic Content Extraction • Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR. • Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text. • Spec: Human-sensible • Mark-up: Computer-sensible • Automate routine mechanical tasks.

Semantic Mark-up Spec Name Spec Title Revision Procedure Revision Date Qualifier Characteristic Values Value

Ontology • (Gruber) • An ontology is an explicit specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.

SDL Ontology 1 or many Document Domain Library 1 or many Revision Reference Ref: 0, 1 or many 0, 1 or many Procedure 0, 1 or many Ref: 0, 1 or many Layer 0, 1 or many Ref: 0, 1 or many Characteristic Value

Extraction: Spec to SDR Spec: Text Spec: SDR

Fundamental Obstacles • The relation between the spec and its SDR rendition is “not linear”. • Same spec information duplicated in SDR in different contexts. • Contiguous block of information in SDR spread out in spec. • Equivalence of phrases hard to formalize. • Tables and footnotes abbreviate information in irregular and complicated ways.

Linearizing through Abstraction: Introducing Specification Definition Language Manual (original) Original Spec SDL SDR Manual (Ph-I) Compiled (Ph-I) Literal, Integrated, Semi-automatic (Ph-II) Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages. Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.

Introducing Extraction Wizard

Motivation (Why?)

Drawing Drawing Spec Spec Business Background (Supply Chain) Engine Forger Metal

Diverse and Large number of specs and spec users

Quality Issues • Transcription Errors • From spec to hand-written sheet to computer • Completeness • Info in spec but missing in SDR • Soundness • Info in SDR but not in spec • Uniformity of Form • Uniformity in Interpretation • Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)

Efficiency Issues • Minimize time/effort required. • Automate routine mechanizable tasks • Eliminate “cut-paste-modify” cycle • Minimize duplication of information. • Concise representation • Size of translation = O(Size of spec). • Update consistency • Flexible rendition into various external forms.

Details (How?)

Essence of our Approach : Literal Translation • Conceptually, every piece of info in SDR owes its existence to phrases in spec. • Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec. • Requires compilation into SDL/SDR. • Cf. XML/XSL Technology

Semi-automatic approach is feasible onlyif the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible. • Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.

SDL Studio and its Extension • SDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR. • This can be further enriched using • Improved Domain Library Search • Extraction and composition of SDL fragments • Providing templates for commonly occurring “procedures” • Table processor • etc …

Domain Library Search Engine

Domain Library • Currently, it contains technical phrases pertinent to materials and processing requirements • Cohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc. • Typical size: 10,000 phrases

Improving Domain Library Search • Goal: Mapping “equivalent” phrases to same Domain Library Term • Uses: • Techniques for prefix removal, stemming, and dealing with other variations for root recognition • Stop words elimination • Abbreviation expander and alias normalization

begin dl := readAndBuildDomainLibrary(); dlwm := buildWordMapAndBackLinks(dl); % delete stop words, link words to DLTs (in,mt) := readInputPhraseAndMatchThreshold(); inwm := buildWordMap(in); dlts := buildDLTsListContainingMatchedWords(dlwm,inwm); dlts := evaluateAndFilterDLTs(dlts,mt); end; Algorithm Sketch List[Phrase] dl; Phrase ip; Int mt; List[Word] dlwm, inwm; % with back references List[Phrase] dlts;

Matching words Int wordMatch(w1,w2) begin % normalized = vowels deleted, i.e., only consonants present if caseUniformAndCleanedMatch(w1,w2) return 100; if normalizedMatch(w1,w2) return 90; if orderedNormalizedMatch(w1,w2) return 70; % analyze for differences due to prefix and suffix if normalizedDifferenceInPrefixSuffixTables(w1,w2) return 90; end;

Design Rationale • Input phrase may contain multiple DLTs. • DLT words may not appear contiguous in input. • Consonants are significant, and "correct" spellings may differ in vowels. • Robustness with respect to spelling errors such as transposition of letters or missing vowels. • Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically. • Err on the side of recall rather than precision. • Number of words < Number of DLTs

Extraction Tool

Overall Approach • Preprocessing: Obtain spec in plain text form (from MSWord format). • This is a practical alternative to scanning and OCR-ing a paper-based spec. • Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags. • Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.

Two possible Avenues(From Document to SDL) • Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology. • Generate various views of the document and SDL from this single XML Master. • Iteratively generate a sequence of progressively detailed SDL document from spec text.

First Avenue : Via XML • Semi-automatic extraction is accomplished in two phases: • Initial automatic markup phase: Systematically recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment. • Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL. • Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.

(cont’d) • Advantages: • Focus is on a single persistent XML Master that tries to maintain a link between the spec and the extractions. • All the processing is orchestrated on this XML file. • Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.

(cont’d) • Disadvantages: • There is a need to manage a separate SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.

Semantic-Markup Algorithm Insert Structure Tags Insert Ontology Tags Infer Missing Char. Group Char. & Values Group C-Vs into Procedures

Functional Components Text file Structure Tagger XML file DLT Tagger Domain Library XML file Group Tagger XML file SDL Converter SDL file

Tagging and Transforming • flex structTagger.flex • gcc lex.yy.c -lfl • a < GE.txt > GE.xml • java org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl • … • java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdl • java org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt

Second Avenue: SDL all along • As there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along. • Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor. • Disadvantage: This form does not retain correspondence with the original document explicitly.

Prototype Operation Extraction Tool – Prototype Operation

Views: In the context of Spec Plain text view Text view with “requirement” phrases color coded and highlighted View of domain library terms found in the spec Views: In the context of SDL Spec identity view + Large Note : Method D Extraction Method C Extraction Procedure view Characteristic-value pair view

Semi-Automatic Content Extraction from Specifications

Semi-Automatic Content Extraction from Specifications

Presentation Transcript

Semi-Automatic Image Annotation

Content Extraction from HTML Documents

Semi-automatic Fact Extraction and Organization of Persons in Genealogical Records

Bootstrapping information extraction from semi-structured web pages

Automatic Extraction of Subcategorization Frames From Corpora

Automatic Extraction of Hierarchical Relations from Text

AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS

Semi-Automatic Handguns

Aims : (Semi-)automatic MM metadata specification process Information extraction techniques

Proteus – Semi-Automatic Interactive Structure-from-Motion

Bootstrapping Information Extraction from Semi-Structured Web Pages

Content Specifications

Automatic Test Generation from Formal Specifications

Automatic Content Filtering

Semi Automatic Liquid Filling Machine | Semi Automatic Twin

Semi Automatic Stitching Machine

Semi Automatic Handguns

Semi Automatic Liquid Fillers

Automatic term extraction from domain corpora

Semi Automatic Washing Machine

Semi automatic vs. fully automatic washing machine

Automatic and Semi-Automatic Auger Fillers Machine