Large-Scale Collection of Application Usage Data to Inform Software Development

Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert Information and Computer Science University of California, Irvine Irvine, California 92697-3425 dhilbert@ics.uci.eduhttp://www.ics.uci.edu/~dhilbert/

Overview • Background and Motivation • Dissertation and Evaluation • Insights and Hypotheses • Progress and Schedule • Dissertation Outline • Future Research

Background and Motivation • Expectations influence designs, designs embody expectations • Mismatches between expectations and how applications are actually used can lead to breakdowns • Identification and resolution of mismatches can help improve fit between design and use • Behavior of applications, users, and usage environments complex and unpredictable enough that observation required • Research area: theories, methods, techniques to enable large-scale incorporation of application usage data in development

Impact of the Internet • On the positive side • cheap, rapid, large-scale distribution of software for evaluation • simple transport mechanism for usage information and feedback • use and development becoming increasingly concurrent • should make incorporating usage information easier • On the negative side • reduces opportunities for traditional user testing • increases variety and distribution of users and usage situations • lack of scalable techniques and methods for incorporating usage information on a large scale

Current Approaches • Current approaches suffer from significant limitations • usability testing => scale (size, scope, location, duration) • beta testing => data quality (incentives, knowledge, detail) • The user feedback paradox • users not having problems => provide feedback, negative reactions • users having problems => withhold feedback, positive reactions • The impact assessment problem • impact on user population of suspected or reported problems and potential changes

Research Goals • Address issues of scale • enable larger scale evaluations (size, scope, location, duration) than currently possible with existing usability testing techniques • Address issues of data quality • enable higher quality data to be collected than currently possible with beta testers alone or existing automated techniques • Provide a complementary source of information • help address the feedback paradox and impact assessment problem in making design and effort allocation decisions

Research Direction • Explore the use of automated software monitoring techniques • capture information about user interactions on a large scale • compare actual use against developers’ expectations • help automate mismatch identification and resolution process • make incorporating information about users more palatable to developers

Dissertation • Technical issues • Abstraction Problem (data quality) • Selection Problem (data quality/scale) • Context Problem (data quality) • Reduction Problem (scale) • Evolution Problem (scale) • Hypothesis • all these problems can be addressed by embedding the right kinds of data collection mechanisms within an appropriate data collection architecture

Dissertation (cont’d) • Theoretical/methodological issues • aside from “technical issues”, it isn’t clear what data to collect and why, and how to incorporate results in development • since data collection and analysis can be expensive, guidance can increase the chances that the cost/benefit ratio will be favorable • Hypothesis • a theory and method based on usage expectations can be elaborated to provide motivation and guidance for incorporating data collection and analysis in development

Contributions • Identification of key issues limiting scalability and data quality inherent in current techniques • Solutions to the abstraction, selection, context, reduction, and evolution problems within a single data collection architecture • A reference architecture to provide design guidance regarding key components and relationships • Theory to motivate the significance of usage expectations in development and importance of collecting usage information • Methodological guidance regarding collection, analysis, interpretation, and incorporation of results in development

Evaluation • Prototype • demonstrate solutions to the abstraction, selection, context, reduction, and evolution problems within a single data collection architecture • Informal empirical evaluation • assess usability and utility of approach based on feedback from independent developers who integrated the prototype in a research demonstration scenario • Participant observation of an industrial project • foundation for an analytical evaluation of the techniques, reference architecture, theory, and method

The Abstraction Problem • Observation • questions about usage typically occur in terms of concepts at higher levels of abstraction than represented in data provided by application components • questions of usage can occur at multiple levels of abstraction • Hypothesis • simple “data abstraction” mechanisms (based on grammatical techniques) can be constructed to allow low-level data to be related to higher-level concepts such as UI and application features as well as users’ tasks and goals • this can impact the results of human and automated analyses

The Selection Problem • Observation • the amount of data necessary to answer usage questions will typically be a relatively small subset of the much larger set of data that might be recorded at any given time • collecting too much data can make it difficult to separate events and patterns of interest from the “noise” • Hypothesis • simple “data selection” mechanisms (based on events, event sequences, values, and value vectors) can be constructed to allow important data to be captured - and unimportant data filtered - prior to reporting • this can impact the results of human and automated analyses, not to mention scalability

The Context Problem • Observation • information required to interpret the significance of events may not be available in the events produced by application components • contextual information may be spread across multiple events or missing altogether, but is frequently available “for the asking” from the application, artifacts, or user • Hypothesis • simple “context-capture” mechanisms (that provide access to application, artifact, and user state information) can be exploited to allow context to be used in interpreting the significance of events • this can also help in capturing important information not available in events

The Reduction Problem • Observation • much of the analysis that will ultimately be performed to answer usage questions can actually be performed during data collection resulting in greatly reduced data reporting and post-hoc analysis needs • when analysis is left as last step it is often not performed • Hypothesis • simple “data reduction” mechanisms (e.g., for performing counts and other simple analyses during collection) can be constructed to reduce the amount of data that must ultimately be reported and analyzed • this can impact scalability and likelihood that data will be analyzed

The Evolution Problem • Observation • data collection needs will typically evolve over time (perhaps due to results of earlier data collection) more rapidly than the application • unnecessary coupling of data collection and application code can increase cost and even cripple evolution of data collection • Hypothesis • “evolvable” data collection mechanisms (based on encapsulating abstraction, selection, context-capture, and reduction decisions) can be constructed to allow data collection to evolve over time without impacting application deployment or use • this can impact the practicality of performing data collection

Approach • Expectation-Driven Event Monitoring (EDEM)

Agent Specs saved w/ URL Development Computer DevelopmentComputer Java Virtual Machine Java Virtual Machine AgentSpecs CollectedData Top Level Window& UI Events Top Level Window& UI Events ApplicationUI Components ApplicationUI Components EDEMActive Agents EDEMActive Agents Property Queries Property Queries HTTPServer EDEMServer Property Values Property Values User Computer Agent Reports sent via E-mail Agent Specs loaded via URL EDEM Architecture

Reference Architecture DataCapture Abstraction, Selection, Context, Reduction DataPackaging DataTransport DataPrep DataAnalysis SystemModel ofUI & App: Components Events Properties Methods AnalystModel ofUI & App: Features, Dialogs, Controls, User-Supplied Values, User Tasks Mapping

Reference Architecture (Word IV) Instrumentation intertwined w/ app DataCapture Abstraction, Selection, Context, Reduction DataPackaging DataTransport DataPrep DataAnalysis SystemModel ofUI & App: Components Events Properties Methods AnalystModel ofUI & App: Features, Dialogs, Controls, User-Supplied Values, User Tasks Mapping

Reference Architecture (Office IV) Event monitoring infrastructure DataCapture Abstraction, Selection, Context, Reduction DataPackaging DataTransport DataPrep DataAnalysis TestWizard Database of Office UI SystemModel ofUI & App: Components Events Properties Methods AnalystModel ofUI & App: Features, Dialogs, Controls, User-Supplied Values, User Tasks Mapping

Reference Architecture (EDEM) Event monitoring infrastructure DataCapture Abstraction, Selection, Context, Reduction DataPackaging DataTransport DataPrep DataAnalysis “Pluggable” Data Abstraction, Selection, Context-Capture, and Reduction Expectation Agents SystemModel ofUI & App: Components Events Properties Methods AnalystModel ofUI & App: Features, Dialogs, Controls, User-Supplied Values, User Tasks Mapping

Product Status Comments Prototype Needs Work Prototype requires porting and other extensions Theory and Method Needs Work Theory and method require further elaboration Reference Architecture Near Done Design guidance requires further elaboration Survey Done N/A Informal evaluation Done N/A Participant observation Near Done Further analysis of observations required Dissertation Progress

Description Venue Status Prototype Theory/Method Reference Arch. Techniques Survey Conf. Demo ICS97 Accepted X Conf. Demo IUI98 Accepted X X Conf. Paper ICSE98 Accepted X X X Conf. Paper Agents98 Accepted X X X Work. Paper CSCW98 Accepted X X Journ. Paper IEEE TSE In Review X X X X Journ. Paper ACM Surveys In Review X Dissemination Progress

Product Schedule Comments Prototype extension Dec-Jan ‘99 port; update event model; explicit support for 5 techniques Theoretical elaboration Jan-Feb ‘99 elaborate theory/method based on “participant observation” Document results Feb ‘99 should already be well into writing Final defense May ‘99 schedule ahead of time w/ Grudin Buffer period May-Jul ‘99 wrap up any loose ends Schedule for Work Remaining

Dissertation Outline • Introduction (General Introduction) • Expectations in Software Development (highlight theory) • Impact of the Internet (problems and opportunities) • Problems with Current Practice (usability and beta testing) • Proposed Solution (foreshadow insights, approach, contributions) • Extracting Usage Data from User Interaction Events (State of the Art) • Synch and Search • Abstraction, Filtering, and Recoding • Counts and Summary Statistics • Sequence Detection • Sequence Comparison • Sequence Characterization • Visualization • Integrated Support

Dissertation Outline (cont’d) • Key Problems and Insights (Problem Statement) • The Abstraction Problem (meaningfulness) • The Selection Problem (meaningfulness) • The Context Problem (meaningfulness) • The Reduction Problem (scalability/practicality) • The Evolution Problem (scalability/practicality) • Interdependencies and Interactions • Need for Theoretical and Methodological Guidance

Dissertation Outline (cont’d) • Expectation-Driven Event Monitoring (Solution Statement) • Theory and Method (based on research and Microsoft experience) • Expectations in development • Identifying expectations • Integrating data collection in the development process • Analyzing data and interpreting results • A sample usage data collection process • Techniques for Addressing Current Limitations (description of prototype) • Data Abstraction • Data Selection • Context Capture • Data Reduction • Evolution • Reference Architecture (based on prototype and Microsoft experience) • Architectural components and relationships • Supporting large-scale data collection

Dissertation Outline (cont’d) • Experience and Evaluation (Evaluation of Solution) • The GTN scenario • Study Goals • Description • Results • Participant observation of an industrial project • Study Goals • Description • Results • Collection, analysis, and reporting goals • Challenges and limitations (addressed by this research) • Lessons learned (informing this research)

Dissertation Outline (cont’d) • Conclusions • Conclusions • Summary of Contributions • Future Research • References • Appendices

Future Research • Large-scale evaluation of research in practice • nature of usage information • issues in interpretation and incorporation of results • evolution and maintenance issues • Other possible extensions • exploit relationships between expectations and other requirements-related artifacts, e.g. use cases, cognitive walkthroughs, task analysis • explore issues of adaptability and reuse of infrastructure and default analyses • analysis of changes in usage over time • analysis of usage involving multiple cooperating users

Other Possible Applications • Support for adaptive UI/application behavior based on long-term information about user (or users’) actions • Support for "smarter" delivery of help/suggestions/assistance based on long-term information about user (or users’) actions • Support for monitoring of other component-based software systems • low-level data must be related to higher level concepts of interest • available information exceeds that which can practically be collected • data collection needs evolve over time more quickly than application

Theory/Method Motivation Insight Prototype Motivation Insight ReferenceArchitecture Evaluation Insight Evaluation Insight Evaluation Insight Evaluation Insight Survey GTNScenario Microsoft Experience Evaluation Insight Research Process

Large-Scale Collection of Application Usage Data to Inform Software Development

Large-Scale Collection of Application Usage Data to Inform Software Development

Presentation Transcript

Effecting Change: Coordination in Large-Scale Software Development

Model Checking Large-Scale Software

Large-scale application security

Large scale genomic data mining

Data collection for scale mapping

Large-scale Data Processing Challenges

Discussion of a Large-Scale Open Source Data Collection Methodology

Large scale genomic data mining

Large- scale Linked Data Management

An Introduction to Large-Scale Software Development

Charectristics of Large Scale Software Systems

Large-scale projects to enhance development

Large scale data processing

Large-Scale Collection of Application Usage Data to Inform Software Development

An Approach to Large-Scale Collection of Application Usage Data Over the Internet

Efficient Data Collection for Large-Scale Mobile Monitoring

Large Scale Data Integration

Systems of Large-Scale Professional Development

Open Source Software Development and Very Large-Scale Software Engineering

Large Scale Data Analytics

Enabling Reuse-Based Software Development of Large-Scale Systems

Data Collection Software