1 / 35

Better Logging to Improve Interactive Data Analysis Tools

Better Logging to Improve Interactive Data Analysis Tools. Sara Alspaugh . . . . . . . . . alspaugh@eecs.berkeley.edu Archana Ganapathi . . . . . . . aganapathi@splunk.com Marti Hearst . . . . . . . . . . . . . . hearst@berkeley.edu

starbuck
Download Presentation

Better Logging to Improve Interactive Data Analysis Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Better Logging to Improve Interactive Data Analysis Tools Sara Alspaugh . . . . . . . . . alspaugh@eecs.berkeley.edu Archana Ganapathi . . . . . . . aganapathi@splunk.com Marti Hearst . . . . . . . . . . . . . . hearst@berkeley.edu Randy Katz . . . . . . . . . . . . . randy@eecs.berkeley.edu

  2. 09-28-2012 18:28:01.134 -0700 INFOAuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,user=splunk-system-user, action=search, info=granted, search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,search=‘search index=_internal metrics per_sourcetype_thruput| head 100’,autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample scheduled search for dashboards (existing job case)”] event

  3. timestamp 09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,user=splunk-system-user, action=search, info=granted, search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,search=‘search index=_internal metrics per_sourcetype_thruput| head 100’,autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample scheduled search for dashboards (existing job case)”] event

  4. timestamp user 09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,user=splunk-system-user, action=search, info=granted, search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,search=‘search index=_internal metrics per_sourcetype_thruput| head 100’,autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample scheduled search for dashboards (existing job case)”] event

  5. timestamp user 09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,user=splunk-system-user, action=search, info=granted, search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,search=‘search index=_internal metrics per_sourcetype_thruput| head 100’, autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample scheduled search for dashboards (existing job case)”] event action

  6. timestamp user 09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,user=splunk-system-user, action=search, info=granted, search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,search=‘search index=_internal metrics per_sourcetype_thruput | head 100’,autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample scheduled search for dashboards (existing job case)”] event parameters execution environment configuration and version stack trace action

  7. Motivation Why do we need better logging?

  8. Visualizing records of user activity to help optimize the user experience using GoogleAnalytics Goal Flow Tool

  9. Applications of Good User Activity Records recommenders predictive interfaces task guidelines activity visualizations traffic analysis UX optimization JaideepSrivastava, Robert Cooley, MakundDeshpande and Pang-Ning Tan. “Web usage mining: discovery and applications of usage patterns from web data.” SIGKDD Explorations Newsletter. 2000.

  10. Examples of this in IDEA tools • SYF: Systematic yet flexible (Perer and Shneiderman) • social network analysis tool • task guidelines for exploring social network data • users can provide feedback on task usefulness • records when users have completed tasks • SeeDB (Parameswaran, Polyzotis, Garcia-Molina) • recommend visualizations for a given SQL query Adam Perer and Ben Shneiderman. “Systematic yet flexible discovery: guiding domain experts through exploratory data analysis.” Conference on Intelligent User Interfaces (IUI). 2008. AdityaParameswaran, NeoklisPolyzotis, and Hector Garcia-Molina. “SeeDB: visualizing database queries efficiently.” International Conference on Very Large Databases (VLDB). 2013.

  11. “Understanding the domain experts’ tasks is necessary to defining the systematic steps for guided discovery. Although some professions such as physicians, field biologists, and forensic scientists have specific methodologies defined for accomplishing tasks, this is rarer in data analysis. Interviewing analysts, reviewing current software approaches, and tabulating techniques common in research publications are important ways to deduce these steps.”

  12. Some problems with logging • ICSE 2012 study of logging best practices • looks at four top OSS projects, finds logging is: • “often a subjective and arbitrary practice” • “seldom a core feature provided by the vendors” • “written as ‘after-thoughts’ after a failure” • “arbitrary decisions on when, what and where to log” Ding Yuan, Soyeon Park, and Yuanyuan Zhou. “Characterizing logging practices in open-source software.” International Conference on Software Engineering (ICSE). 2012.

  13. “. . . it is critical to gain access to a stream of user actions. Unfortunately, systems and applications have not been written with an eye to user modeling." Eric Horvitz, Jack Breese, David Heckerman, David Hovel, and Koos Rommelse. “The Lumière project: Bayesian user modeling for inferring the goals and needs of software users.” Conference on Uncertainty in Artificial Intelligence. 1998.

  14. Recommendations Plan ahead to capture high-level user actions when designing the system. Track detailed provenance for all events. Observe intermediate user actions that are not “submitted” to the system. Record the metadata and statistics of the data set being analyzed. Collect user goals and feedback. Work towards a standard for logging data analysis activity records.

  15. Recommendation #1 Plan ahead to capture high-level user actions when designing the system.

  16. High-level task: clustering in Excel

  17. Examples of this in IDEA tools • HARVEST (Gotz and Zhou) • visual analytics tool that incorporates action semantics not events as core design element • based on catalogue of common analytics actions derived through review of many analytics systems • exposes high-level actions that retain rich semantics as way of interacting with data David Gotz and Michelle Zhou. “Characterizing users’ visual analytic activity for insight provenance.” Symposium on Visual Analytics Science and Technology (VAST). 2008.

  18. “...work in this area has relied on either manually recorded provenance (e.g., user notes) or automatically recorded event-based insight provenance (e.g., clicks, drags, and key-presses), both approaches have fundamental limitations.”

  19. Recommendation #2 Track detailed provenance for all events.

  20. sources of data transformation activity triggered by dashboard reload issued from external user script interactively entered at search bar bad if same event is logged 09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134, user=salspaugh, action=search, info=granted , search_id=`scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256', search=`search source=*access_log* | evalhttp_success = if(status=200, true, false) | timechart count by http_success’, autojoin=`1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=`1', extra_fields=`', apiStartTime=`ZERO_TIME', apiEndTime=`Fri Sep 28 18:28:00 2012', savedsearch_name=“”]

  21. “...the log files do not differentiate between Show Meand Show Me Alternatives. These commands are implemented with the same code and the log entry is generated when the command is successfully executed.” Visualization recommendation in Tableau’s Show Me.

  22. Recommendation #3 Record the metadata and statistics of the data set being analyzed.

  23. data action • Toy Example Influence Diagram • action • data • P( action | data ) • Toy Example Conditional Probability Table

  24. Initial recommendation ranking • Recommendation ranking based on the data • Wolfram Predictive Interface in Mathematica

  25. Recommendation #4 Collect user goals and feedback.

  26. Recommendation #5 Work towards a standard for logging data analysis activity records.

  27. Conclusion • Goal: improve interactive data exploration and analysis (IDEA): interfaces, recommender systems, task guidelines, predictive suggestions • Problem: need better data to mine • Recommendations for logging IDEA activity • When you build your next system for IDEA, will you consider how you log user activity?

More Related