Document Engineering of Complex Software Specifications
June 4, 2007, MSc Thesis in Computer Science

Motivation and Goal. Problems triggering our motivation:

Document Engineering of Complex Software Specifications

Mehrdad Nojoumian

Supervisor: Professor T. C. Lethbridge

University of Ottawa

School of Information Technology and Engineering

June 4, 2007, MSc Thesis in Computer Science

Motivation and Goal

Problems triggering our motivation:

Software Specifications:

are dense and intricate (Numerous materials)

have complicated structures (lots of tables, figures, lists, codes, etc)

are difficult for browsing and navigating

are mostly available in the PDF format or just a single hypertext page

Major goal:

Re-engineer PDF based documents (Specifications, Conf. Proceedings, e-Books, etc)

Illustrate how to make more usable version of documents

Data Analyses

Headings and the document index carry the most important words in a document

UML Superstructure Specifications

The most frequent words among headings

Frequency of the previous words as found in the entire document

The most frequent words in the doc. index

Other OMG Specifications

Sorted document and heading tokens based on their frequency in two separate lists

Defined position of heading tokens among document tokens: P1, P2, …, PN

MP: Mean of [P1…PN]

NDT: Total number of document tokens

Percentage = (MP * 100) / NDT

Most frequent headings (# of occurrence > 2) are among the most frequent words in the entire doc

Document Transformation

Transforming the raw input into a format more amenable to analysis (XML)

Extracting and refining the structure

Conversion Experiments:


Adobe Acrobat Professional 7.8

Microsoft Word 2003

Stylus Studio XML Enterprise Suite

ABBYY PDF Transformer 1.0



Low Volume

Clean & Understandable

Similarity to XML

Having Good Clues

Logical Structure Extraction

Java parsers

Solved the mis-tagging problem which had been created during previous phase

Extracted entire headings existing in the document bookmark

Removed some information and XML tags

Formed the document logical structure in a clean XML format

Hypertext pages & Text Extraction

Produced multiple outputs for each Chapter, Section, Subsection, etc

(1.html, 2.html, 2.1.html, etc)

Generated table of contents for headings (use it as a frame)

Connected hypertext outputs sequentially

XPath expressions

Programming approach

Formed major document elements

Anchors in long pages

Figures and their captions

Simple & Nested Lists

Dynamic Tables

Concept Extraction

UML Superstructure Specification

UML class & package hierarchies extraction

If the first child of a <Section> element contains the ‘Class Descriptions’ string then you can detect UML classes & packages in grandchildren of that <Section> element

Other specifications:

Common Warehouse Meta-model (CWM)

UML Infrastructure (UML Inf.)

Meta Object Facility (MOF)


How can we detect such a logical relation among heading elements automatically?

Cross Referencing

Developed an XSLT program to extract heading phrases and their corresponding hyperlinks

Filtered some phrases which had common substrings such as Association & AssociationClass

Removed phrases which had many independent hypertext pages (different entries in user interfaces)

Also applied package names just for UML Superstructure Specification in cross referencing as anchors

Finally, developed a Java program to replace hyperlinks in generated HTML pages

Usability of User Interfaces

Reasons for generating small hypertext pages:

A better sense of location (navigating)

Less chance of getting lost (scrolling)

Less overwhelming sensation (learning)

Statistical analyzing (interesting topics)

Faster downloading (entire document!)

Easier printing, Cross referencing among diverse specifications, etc

User Interfaces Demo


A generic approach to reengineer complex documents

A data analyses showing that words in headings provide a sufficient basis for the document reengineering

Extraction of the document logical structure in XML format

Various techniques for text & concept extractions using W3C technologies

Major software components for an “Integrated Document Engineering Tool”

Engineering Lessons & Challenges

Engineering Lessons:

Generating a clean XML file from PDF images requires complicated features to recognize each document element correctly and deal with mis-tagging, page boundary, etc

Remarkable role of latest technologies in engineering tasks: e.g. XPath 2.0 vs. parsing packages which is a high level interaction close to human’s language

Comprehensive data analysis can facilitate the DocEng process, form a better understanding, and construct robust rules & regulations for such a processing

Low Level Challenges:

Generating multiple hypertext pages by Saxon

Detecting errors in XSTL programming

Creating complicated XPath expressions, etc

Future Work

Extracting the initial XML document independently from Adobe Acrobat

Automating the concept extraction procedure or creating some HCI features

Developing an automatic document analyzer for comprehensive data analyses

Investigating usability of current user interfaces to discover users’ demands

Generating interaction features in UIs: online query submission to XML files


Refereed Conference Paper:

M. Nojoumian & T. C. Lethbridge, “Extracting document structure to facilitate a KB creation for UML specifications”, in proceedings of the 4th IEEE International Conference on Information Technology: New Generations (ITNG), pp. 393-400, Las Vegas, USA, 2007.

Invited to publish in the Journal of Computers (JOC):

M. Nojoumian & T. C. Lethbridge, “Document engineering of complex software specifications”, Academy Publisher.

Thank you very much


