1 / 27

On Embedding Machine-Processable Semantics into Documents

On Embedding Machine-Processable Semantics into Documents. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA. Talk Outline. Background and Motivation ( Why ?) Goals ( What? ) Details ( How ?) Conclusions.

salim
Download Presentation

On Embedding Machine-Processable Semantics into Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA

  2. Talk Outline • Background and Motivation (Why?) • Goals (What?) • Details (How?) • Conclusions

  3. Background and Motivation

  4. Content Extraction: Formalize doc, using controlled vocabulary Heterogeneous Doc. Spec. Defn. Rep.

  5. Problems with this approach to content extraction • Archiving spec (for human comprehension) separately from its formalization is not conducive traceability. • Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors.

  6. Observation • Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. • So, explore techniques to maintain correspondence between a spec fragment and its formalization.

  7. Goal

  8. General Problem • Embed domain-specific mark-up (annotations) into human sensible document • to make explicit semantics of “content” text and complex data, and • to augment an interpretation in a modular fashion. • Document text: Human comprehensible • Semantic Mark-up: Machine processable

  9. Details (How?)

  10. Nature of Specs • Semi-structured • Heterogeneous • Text • Tables • Images • Constrained technical vocabulary • Available as MS Word document

  11. Pre-processing Spec • Abstract content from spec document by removing display oriented information • Save text • Save tabular data, preserving grid layout • Retain links to images • … • Note: “Save Astext” option in MS Word inadequate

  12. Heterogeneous Document

  13. XML generated by Majix

  14. ASCII Output

  15. Annotating Pre-processed Spec • Embedding Machine Processable Semantics • Recognizing and tagging text using controlled vocabulary • By product of: Document Indexing and Semantic Search • Tagging tabular data to make explicit its semantics : Same grid layout, but different interpretation and dependencies based on headings • Explore: XML-based programming languageWater for defining data and its behavior (semantics)

  16. Locating Controlled Vocabulary Terms

  17. Example Table

  18. Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table.<setHeading thickness strength.tensile strength.yield/> 0.50 and under 165 155 table.<addRow 0 0.50 165 155 /> 0.50 - 1.00 160 150 table.<addRow 0.50 1.00 160 150 /> 1.00 - 1.50 155 145 table.<addRow 1.00 1.50 155 145 /> ...

  19. Example of Processing Code <defclass table rows=required=vector heading=optional=vector> <defmethod setHeading t=required ts=required ys=required> <set heading=<vector t ts ys/>/> </> <defmethod addRow smin smax ts ys> <set rows= table.rows.<insert <vector smin smax ts ys/>/>/> </> <defmethod computeYieldStrength> … </> <defmethod computeTensileStrength> … </> … </>

  20. (cont’d) <defclass table rows=required=vector heading=optional=vector> … <defmethod computeTensileStrength> <set temp=fluid.Thickness/> <set i=0/> <do> <until <and temp.<less table.rows.<get i/>.1/> temp.<more_or_equal table.rows.<get i/>.0/> /> > table.rows.<get i/>.2 </until> <set i=i.<plus 1/>/> </do> </> </>

  21. (cont’d) <defclass table rows=required=vector heading=optional=vector> … </> fluid.<set Thickness=0.60> <try <set TensileStrength=table.<computeTensileStrength/>/> TensileStrength > "TABLE: out of range error occurred" </try>

  22. Water • XML-based OO Scripting Language • Facilitates creating Web Services • Run methods remotely via web-browser • Generalizes dynamic typing to constraint checking • Conformance of actuals to formals

  23. Pros and cons • Encoding Improvement • Amount of tagging can be controlled by suitably delimiting table data and annotating it with corresponding “string-processing” method • Master Copy Update • Changes to spec requires manual modification to archived annotated version. • Irregular Tables in Specs • Different units, etc

  24. Some Related Work • Microsoft Smart Tags • Recognize “controlled” words in Office 2003 documents and associate predefined list of actions with each occurrence • SHOE • Table data in a declarative (logic) language

  25. Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L =< Thickness, U > Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).

  26. Conclusions

  27. A Step towards Holy Grail • Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”.

More Related