Knowledge Modeling from Software Documentation By Madhuri Gopal, G.S Mahalakshmi V.Vani Vijayan
Agenda: • Objective • Project overview • Design Principles • Technology Stack • Approach and Methodology • Execution Framework • Modules Covered • Results
Objective The objective of this presentation is to understand the nuances of converting existing software documentation to an intelligent knowledge representation
Project Overview: Background • Traditional development , deployment & maintenance of conventional software applications require higher quality with shorter time to market cycles to reap the benefits of customer delight. • This involves a formal , explicit and conventional representation of the knowledge base shared across stakeholders • Existing SDLC documents do not cater to any intelligent extraction and interpretation either for downstream applications or enhancements. • There is a growing need for effective and efficient utilization of software artifacts to deliver enhanced traceability to changing future needs.
Challenges in the existing systems • More than 90% of existing software documentation is in the form of text • Knowledge Engineers create knowledge representations from the scratch making reuse and enhancements difficult to existing representations • Existing Knowledge representation techniques require domain knowledge and have a steep learning curve. • Difference in conceptualization of the domain model leads to inconsistencies in its representation
Design Principles Open Close Principle Software entities like classes, modules and functions should be open for extension but closed for modifications. Dependency Inversion Principle • High-level modules should not depend on low-level modules. Both should depend on abstractions. • Abstractions should not depend on details. Details should depend on abstractions.
Design Principles Contd.. Single Responsibility Principle A class should have only one reason to change. Liskov's Substitution Principle Derived types must be completely substitutable for their base types.
Technology Stack The architecture followed is a 2 tier architecture. Front-End : Java Back-end : Files
Development Hardware Processor: Intel(R) Core™ 2 Duo CPU T6400 @ 2.00 GHZ Memory(RAM) : 4 GB System type: 32-bit Operating System Tools used CoreNLP – Stanford package for Natural Language Processing(NLP) ConExp - Open Source for creation of Formal Concept Lattice.
Approach and Methodology • Software prototyping (Incremental prototyping) methodology is used for development. • The final product is built as separate prototypes. • At the end the separate prototypes are merged in an overall design • Steps are: a)Identification of basic requirements. b) Development of the initial prototype c) Review of prototype d)Revision and Enhancement of the Prototype
Modules covered • Part Of Speech Tagging (POS) using a Maximum Entropy based Tagger algorithm • Lemmatization to reduce the relevant terms extracted by POS Tagging to their Lemma forms. • Named Entity Resolution(NER) using Conditional Random Fields(CRF) with Gibbs sampling for entity identification & extraction. 4. Parsing to determine the grammatical structure w.r.t Formal Parsed Grammar using a Factored model.
Modules covered contd…. 5. Co-reference Resolution by using tiers of deterministic models to determine the relative importance of different terms. 6. Querying and Manipulation of Natural Language Text 7. Formal Concept analysis to derive the relationship between the attributes & the objects and also between attributes 8.Conversion of formal concept lattice to XML for extraction of Knowledge representation.
Input Sources Software Engineering documents that are part of MIL STD 498 Software Development Standard are used as input consisting of: • Computer Operation Manual (COM) • Computer Programming Manual (CPM) • Database Design Description (DBDD) • Firmware Support Manual (FSM) • Interface Design Description (IDD) • Interface Requirements Specifications (IRS) • Operational Concept Description (OCD) • Software Centre Operator Manual(SCOM) • Software Design Description (SDD) • Software User Manual (SUM) • Software Version Description (SVD)
Input Sources Contd.. • Software Development Plan (SDP) • Software Input/ Output Manual (SIOM) • Software Installation Plan (SIP) • Software Product Specification (SPS) • Software Requirements Specification (SRS) • System/Subsystem Design Description • System/Subsystem Specification • Software Test Description (STD) • Software Test Plan • Software Test Report (STR) • Software Transition Plan (STrp)
Algorithm Step 1 : Tagger 1= POS_Tagging_Function(SRS ) Tagger 2= POS_Tagging_Function(SDD ) Tagger 3= POS_Tagging_Function(STD) Step 2: Lemma_Form1 = Lemma_construction(Tagger1) Lemma_Form2 = Lemma_construction(Tagger2) Lemma_Form 3= Lemma_construction(Tagger3) Step 3: NER1 =CRF_Gibbs_Function(Lemma_Form1 ) NER2 =CRF_Gibbs_Function(Lemma_Form2 ) NER3 =CRF_Gibbs_Function(Lemma_Form3 ) Step 4: Parse1 = Parser(NER1) Parse2 = Parser(NER1) Parse3 = Parser(NER1)
Input Sources Contd.. Step 5: CoRef1 = Coreference_Resolution(Parse1) CoRef2 = Coreference_Resolution(Parse2) CoRef3 = Coreference_Resolution(Parse3) Step 6: TREE_NODE= Query_Manipulation_function(CoRef1, CoRef2, CoRef3) Step 7: Concept_Lattice= FCA (context, concept,TREE_NODE) Step 8: XML_DOC = XML_Convert(Concept_Lattice)
Implementation Steps The algorithm is mapped to the following series of steps: • Collection of existing software documents a) Software Requirements Specification(SRS) This document contains a set of use cases that describe system – user interaction & non functional requirements as design constraints and quality standards. b) Software Design Document (SDD) The SDD shows how the software system will be structured to represent software components, interfaces, and data necessary for the implementation phase. c) Software Testing Document (STD) It specifies the form of a set of documents for use in different stages of software testing
Implementation Steps contd… • Extraction of relevant knowledge from the SRS, SDD, SDT by using a sequence of natural language processing steps as follows: • POS tagging • Lemmatization • Named Entity Resolution • Syntactic Parsing • Coreference Resolution Input: SRS, SDD , STD Output: Annotated Text Corpora
Implementation Steps contd… Querying and Manipulation of annotated text corpora and conversion to tree data structures • This step uses query manipulation tools to extract the relevant knowledge from the annotated text corpora . • The verb subject , object and PP complement pairs are extracted and the syntactic dependencies between verb subject – verb- verb object and verb- PP complement are exploited to derive a meaningful hierarchical relationship Input: Annotated SRS, SDD , STD Output: Tree Data Structure Representation
Implementation Steps contd… Formation of Concept Lattice using Formal Concept Analysis • The hierarchical information and syntactic dependencies obtained by NLP gives a relationship between the set of verbs that act as objects and the verb-subject , verb-object & verb-PP Complement act as the set of attributes. • This relationship is written in the form of a matrix given as input to ConExp that transforms the matrix to a concept lattice. Input: Tree Data structure Representation Output: Formal Concept Lattice
The top most element indicates the object that has no attributes • The bottom most element indicates the object that has all attributes. • The node in blue indicates the objects • The node in orange depicts the attributes
Implementation Steps contd… Conversion of formal concept lattice to XML • The set of all attributes and their values is extracted for each object . • This provides an intermediate representation of the Concept hierarchy before it is transformed to a knowledge representation. Input: Formal Concept Lattice Output: XML Format
Implementation Steps contd… Pseudocode for Conversion of formal concept lattice to XML • Let n be the total number of objects and m be the total number of attributes For j =1 to n For k= 1 to m For each object Ij and attribute Ak that is is an attribute of Ij , Form the XML element with head =Ij and list of attributes Ak
Conclusion • Software documentation practices vary among different organizations. • 53% of the organizations deliver consistent software to maintenance phase • 16% update their documentation at all levels • 53% of organizations have their user manuals consistent with system state • 42% revise and modify regression test case repositories • 11% achieve full traceability amongst system documents and only 5% have achieved traceability of change . On an average, a software Cost savings of 10- 15% is expected to be achieved depending on the size and complexity of software documentation