Software Development

Software Development • The team • FSU: David Swofford, Mark Holder, Fredrik Ronquist • UBC: Wayne Maddison • UConn: Paul Lewis • Arizona: David Maddison • AMNH: Ward Wheeler • UT-Austin: Tandy Warnow • UNM: Bernard Moret, David Bader, Tiffani Williams • SDSC/UCSD: Mark Miller, Fran Berman, Philip Bourne, John Huelsenbeck

Software Development • Activities • Design of system architecture with APIs • Development of internal/external documentation system • Wrapping/interoperability of existing software • Implementation of new “solution modules” for existing and novel methods • Testing/usability assessment • Integration of database, tree estimation, and post-tree analysis functions • Implementation of scalar, parallel, and distributed versions

Overall design of system architecture • Modularity, communication, distributed processing: • Protocols for communication among components (modules) need to be defined in general, and in particular for key categories of modules (database, tree search engine, GUI) • Different styles of communication may be needed for different links: the database-tree search link may need to be very high speed; the link to GUI not so high. • The system needs to be designed to accommodate different styles of use (theoreticians feeding simulated data repeatedly into the tree search engines versus heavily interactive use by single user interested in secondary conclusions about character evolution.).

Overall design of system architecture • Modularity, communication, distributed processing (cont.): • The system needs to be “commandable” in different ways -- command line, GUI, or some more direct pipeline. • The system must be designed to be convenient for developers to add/try out their own modules. • The system should manage the distribution of computations to local machines or on a grid; this should be easy both for the user (i.e. it happens automatically depending on resources available and needs of the computation) and for the developer (code can be compiled for grid-ready computations with little modification). • Error handling must generate messages informative to the user and useful to the developer.

Overall design of system architecture • Data standards, logging: • Development of XML standards for communication among modules & data storage. Given NEXUS experience, this will not be trivial (e.g., not just simple data, but including assumptions, which grade into a programming language for computation, as in Hy-Phy batch files and in PAUP and Mesquite blocks). • Sophisticated logging of analyses, including allowing the "freezing" or snapshotting of calculations midstream and undo; the log could be presented as a user-friendly notebook.

Floor provides a functional program with a simple, portable, command line interface Base class providing ability to read a NEXUS data file (manages a linked list of NxsBlock objects) Floor NxsReader TreesManager treesMgr Canopy TaxaManager taxaMgr NxsBlock CharManager charMgr OutputManager outMgr Base class providing the ability to read a private NEXUS block (list of commands and associated handlers) Derived class providing GUI (plots, dialogs, output window, etc.) Overview of Phorest classes

. . . (PHP scripts used to write out XML file, which can then be converted to HTML, PDF, or source code, etc. using the eXtensible Stylesheet Language) PHP PHP MySQL (information about all commands stored in database) XML XSL PDF (e.g. command reference manual) C++/Java (source code to implement command) HTML (e.g. online documentation) Web Forms (all information about command entered via web form, including command name, description, available options, and names of classes and/or functions responsible for handling the command)

Modules • Tree search and database modules should have top priority. • Core tree search engines • At least initially, Phorest/MrBayes/Poy can serve as modules to use for tree searches. Modifications will or may be needed to link them into databases and GUI's, including bringing them to speak a common language for data structures and communication (XML/SOAP?). This common language/ communication protocol needs to be designed. • Development of new tree search modules • Modifications to the tree search modules will be needed to incorporate improvements to algorithms from the theoreticians, and to accommodate new criteria or assumptions.

Modules (cont.) • Database: • Issues in the database design include: • aligned versus unaligned sequence storage & retrieval • storage of assumptions and methods (see above re: XML) • whether the database serves to store not just input to analyses, not just output from analyses, but also information sufficient to replicate step by step (snapshotting etc.). • Our main concern is how the software interacts with the databases and making sure databases are designed in accordance with software needs

Modules (cont.) • User-interfaces, including editing tools • A basic data matrix/sequence editor to prepare data for submission to the database or tree search engines. • Perhaps this could be achieved by adapting an existing tool, but we should be prepared to build one from scratch to be able to handle the flexibility we'll demand. An appealing and well-done editor integrated into a system to submit for tree searches could do a lot to attract users. • Tree viewers (possibly for large trees) and GUI front ends for the various commands in the system. • Some/all of this work by SDSC “professionals?”

Modules (cont.) • Post-tree analyses • Modules to determine implications of trees for conclusions on character evolution, speciation, etc. should be linkable into the tree search and database, especially as the trees themselves may be viewed as the intermediate, not the ultimate, product of an analysis. • Modules to aid design of the system • Linking Mesquite & other existing programs to our architecture could provide some of these services. • Simulation engines: • These should appear to the system much like the database, as a source of character data. • The software development team will be primarily responsible for defining the protocols for supplying character data; others will write the simulations code.

Testing/usability assessment • Communication with users will be an integral part of the software development process. • Participation in workshop courses will both serve as outreach and give us valuable feedback about design. • Journaling system will allow us to track how users interact with the system, and allow problems to be reproduced, analyzed, and debugged

Resolving conflicting priorities • Some goals are common across the software suite • interoperability • ease of use • flexibility • Others will inevitably conflict, e.g.: • Mesquite: emphasizes modularity, extensibility, visualization • Phorest: emphasizes efficient use of memory, speed, native GUI feel, scalability • MrBayes: emphasizes rapid exploration and implementation of new ideas • This is the challenge: how to bundle everything together into a coherent package, without sacrificing the strengths of different approaches?

“Notion of overall schedule” • Year 1 • Develop a concrete set of use cases that document goals and focus further efforts • Define a system architecture, with a collection of simple APIs (Swofford, W. Maddison, Lewis, Holder, working with SDSC professional team) • Write wrappers around existing software from team members (e.g., PAUP*, MrBayes, Mesquite, Poy, GRAPPA, etc.) to temporary use on our systems (Swofford, W. Maddison, D. Maddison, Lewis, Holder, Huelsenbeck, Ronquist) • Continue implementation of phorest with the ultimate goal of applying lessons learned there to improve the design of the ITR project and to maintain compatibility with the developing IT resource (Swofford, Lewis, Holder, Ronquist?) • Work with the database group to develop standards for data input and exchange formats, drawing from experience with Nexus and other existing methods, incorporating XML and possibly other emerging metalanguages. (Swofford, W. Maddison, Lewis, others) • Work with other team members on issues related to algorithm engineering and high-performance computing; set up a prototype analysis ("stunt run") of a few million-sequence datasets to demonstrate feasibility and gather some preliminary computational data. (Moret, Bader, Berman, Wheeler, plus other members listed above)

“Notion of overall schedule” • Year 2 • Fully implement the architecture designed in the first year, providing a framework for installation of software modules for phylogeny reconstruction, post-tree analyses, performance evaluation, and simulations of evolution. • Begin populating the framework with solutions modules (replacing old code with newly developed modules) • Implement a rough user interface for temporary use, with partial integration of the database, solution modules, and current simulation tools. • Release an alpha version of the software suite and make available for testing. • Conduct a new stunt run, with improved simulated data and new modules; use results from this to develop new plans for application and platform scalability

“Notion of overall schedule” • Year 3 • Continue populating the computational framework with reconstruction algorithms, evaluation modules, etc., with the goal of replacing earlier re-wrapped software with new modules that (among other things) smoothly integrate database functions with software modules. • Release a beta version that for use by outside users ("official" collaborators,plus other ATOL investigators, students, and participants in our annual workshops); this version should run on laptops through small SMPS. • Perform a formal evaluation effort by outside users based on this beta version, and plan revisions based on the results of this evaluation. • Work to make all of the code base run efficiently on large machines, including clusters of SMPs • Work with algorithms group to implement, test, and refine the best algorithms devised to date.

“Notion of overall schedule” • Year 4 • Implement post-tree analysis modules and integrate them with the database • Implement accepted recommendations from the user panel and experimental findings of Year 3. • Perform large-scale testing on large datasets produced by the simulation team • Work closely with ATOL partners on analysis of their data, identifying and learning from the problems exposed in that process

“Notion of overall schedule” • Year 5 • Incorporate modules contributed by international collaborators and others into the framework • Enable software suite for Grid usage. • Identify new development targets • Work closely with the SDSC team to produce a “final” package that will include all of our work and it will run tests with this package on a large variety of platforms. • Document and report the rate of software development over the years of the project and its success in attracting outside contributors, in the interest of improving the efficiency of large open-source software development projects.

Software Development