1 / 52

Efficiency and Reliability of the Transit Data Lifecycle

Efficiency and Reliability of the Transit Data Lifecycle. A study of multimodal migration, storage, and retrieval techniques for public transit data. Presented by: Matthew Ahrens Faculty Mentor: Dr. Uma Shama. Overview- Background. GeoGraphics Lab

daxia
Download Presentation

Efficiency and Reliability of the Transit Data Lifecycle

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiency and Reliability of the Transit Data Lifecycle • A study of multimodal migration, storage, and retrieval techniques for public transit data Presented by: Matthew Ahrens Faculty Mentor: Dr. Uma Shama

  2. Overview- Background • GeoGraphics Lab • Maintain public transit data for Regional Transit Authorities (RTAs) in the Commonwealth of Massachusetts. • Services • Digitizing of static schedule data • Dynamic and real-time vehicle location data • Consultation and expert advice role

  3. Overview- Background • This project • Interdisciplinary between Mathematics and Computer Science • Focus on real-world / business applications of data analysis • Time Span • Spring 2013 • exploratory analysis • Summer 2013 and ATP summer grant • Modeling experiments • Fall 2013 • Implementation and integration

  4. Overview- Background • This project – cont. • Evolved through several iterations • Original Purpose: Spatial analysis on ridership and vehicle location data • Four areas of focus occurred, changing focus of project over time • 1. Concepts were unclear among Authorities • 2. Inconsistent data collection tools for historical analysis purposes • 3. development on systems affected core features • 4. documentation for systems was in code, no clear point of injection

  5. Overview- Outline • Four sections • Abstraction and modeling of transit data • Analysis of design patterns and algorithms with comparison to existing systems • The design and implementation of a context free data model • The design and implementation of a multimodal, application-level interface

  6. Abstraction • Research Questions • How can the different transit data protocols be described to compromise between conflicting definitions and structures? • Is there a compromise that can be reached that is still purposeful and clear? • Purpose • Comparison of three authorities • GTFS / GTFS-realtime • TCIP • Proprietary (various).

  7. Abstraction • GTFS Example • Pros: • Descriptive, data type or storage inclusive. • Separation of required for definition and optional metadata • Cons: • Perspective of transit user • Many definitions do not have explicit relationships

  8. Abstraction • GTFS-Realtime Example • Pros: • Descriptive, data type or storage inclusive. • Separation of required for definition and optional metadata • Cons: • Defined as a feed, no distinction or limitation of rate • Optional fields not purposeful for minimum definition or structure.

  9. Abstraction • TCIP Example • Pros: • Complete, covers every aspect of transit • Cons: • Vague • Concerned with relationships between data systems • Specifies medium over message, requires XML/XSD format but does not clearly define data elements

  10. Abstraction • Proprietary Example - ERSI • Pros: • Shows relationships between geospatial definitions • Standard Leader for GIS protocols (GML, OpenGeo ) • Cons: • Concerned with GIS and use definitions over technical definitions • Missing most transit data concepts

  11. Abstraction • Methodology • Create an understandable, unambiguous definition for common transit concepts • Use as few primitives as possible to ease implementation • Use composition to aggregate data • Two options considered • Define a object – method relationship • Define a set-theoretical model of transit data structures

  12. Abstraction • Methodology • Remove implementation and use specific contextfrom transit data structures • Find minimum required composition • Acknowledge commonly attributed metadata • Define data by production mechanism rate

  13. Abstraction • Disambiguation • Real-time • Produced frequently in real-time • Best represented as a signal or a message stream • Dynamic • Infrequent but unknown rate of production • Best represented as a feed • Static • Infrequent, known interval rate of production • File system or other static resource

  14. Abstraction • Results • Data flow model influenced the decision

  15. Abstraction • Results • Set Theoretical Model • Description • Define implementation independent definition of primitives • Compose transit data structure from those primitives • Define complex data structures as supersets of simple structures

  16. Abstraction • Commonly used examples • Primtives • Geolocation • Datetime • Unique, Index-friendly ID (numeric, simple text) • Simple structure • Stop • Trip • Composite Structures • AVL • ETA

  17. Abstraction • Composition Example

  18. Data Migration • Research Questions • What technologies, techniques, or models most efficiently and reliably move transit data from producer to consumer? • Which of those best embody the concepts of reuse, extendibility, and reusability? • Which ones are resistant to need modification and internal maintenance?

  19. Data Migration • Purpose • Perform exploratory work to set standards for handling data transit • Which of those best embody the concepts of reuse, extendibility, and reusability? • Which ones are resistant to need modification and internal maintenance?

  20. Data Migration • Methodology • Study of BusLocator– current data migration technology of AVL and Route specific data • Duplication of Timer-event concurrency model for real-time data • Pull design pattern vs. Push design pattern • Approximation Algorithms

  21. Data Migration • BusLocator • C# Microsoft Solution in two parts • Windows Service using Timer-event concurrency • Pulls AVL data every 30 minutes • Pulls route data every 5 minutes • Sends via SOAP to WCF service • WCF • Webservice endpoint • Accepts data • Parses and stores in SQL tables

  22. Data Migration • Graphical Depiction

  23. Data Migration • Major bottlenecks • Event timer • Problems • Pulls too slow to deliver real-time produced data to be consumed in real-time • Pulls over timeframe, sends duplicate over the wire • Does not scale or load balance • SOAP XML message is large, metadata heavy • Not optimal for real-time

  24. Data Migration • Effort to duplicate for ETA • Pull from ETA feed as Rest service via XML

  25. Data Migration • Effort to duplicate for ETA • Purposes • Analytical use of AVL data as static resource, not real-time • Made easier to organize by set-theory model • Able to composite ETA from other sources • Able to automate analysis

  26. Data Migration • Effort to duplicate for ETA • Problems • AVL not complete for historical use • Lead to development of clear definition of AVL and other transit data structures • Showed need for new system • Replace BusLocator • Define development framework for transit applications • Eliminate pull or approximate push design pattern

  27. Data Migration • Pull vs. Push • Pull design pattern • A.k.a. Request-response, on-demand • Client (unknown) sends request to Server/Source (known) • Server processes and responds • Push design pattern • Subscription pattern • Client establishes connection to Server • Server pushes response to client upon local event

  28. Data Migration • Pull vs. Push • Pull design pattern • A.k.a. Request-response, on-demand • Client (unknown) sends request to Server/Source (known) • Server processes and responds • Push design pattern • Subscription pattern • Client establishes connection to Server • Server pushes response to client upon local event

  29. Data Migration • Pull best use cases • When data is not consumed as a string • Need the most recent data once or on demand • Example

  30. Data Migration • Push approximating • Push is appropriate for real-time produced data • Goal • minimize time between production and availability for use • Problem • Push not supported by all web communication • Solution • Pull approximation

  31. Data Migration • Appx. 1 – timer event approximation • Goal • Predict the rate of production using historical data • Method • Exponential Moving Average • Use previous history and predictions to make future predictions • Keep tabs of average interval between data updates • Take proportion of history for accuracy • Take proportion of predictions for smothing

  32. Data Migration • Exponential Moving Average example • Real data hard to monitor, simulation was created • Simulate 10 vehicles • 10% chance of packet drop • Measurement criteria • Minimize difference between production time and consumption time • Minimize redundant data packets • Minimize dropped packets

  33. Data Migration • Exponential Moving Average example • Cache free model was developed • Emulating current system • Adaptable to batch query and changing vehicle configuration • Measure average previous interval

  34. Data Migration • Exponential Moving Average example • Psuedocode

  35. Data Migration • Exponential Moving Average example • Results

  36. Implementation: GLaaS Model and API • Goals • Taking the knowledge gained so far, implement and document a framework that exhibits best practices • Avoid anti-patterns • Choose the best medium for the job • Separate data, metadata, and implementation data • Keep business logic separate from data management • Migrate data near production rate • Multimodal retrieval and consumption mechanisms

  37. Implementation: GLaaS Model and API • Considerations • Security • Closed Pipe vs. Open Pipe • Authentication • Access level • Differential Privacy • Analysis protection • Reusability • Maintenance • Scalability • Documentation and Training

  38. GLaaS Model • Database Schema • Feature oriented • Consider transit data primitives as features • Make set defined elements required fields • Make metadata Optional fields • Design iterations • Trigger based trickle down model • Purpose • Fight over-index anti-pattern • Minimize select time purposefully • Output chain, batch-oriented

  39. GLaaS Model • Structure • Tables • Primary • Insert Entry point • Guaranteed for analysis use • Acts as contract and definition of feature • Trigger • On insert, pushes and updates specific tables • Specific • Select / update point • Only accessible by stored procedure • Info • Metadata chainable by indexed fields

  40. GLaaS Model • Refactoring • Triggers did not work the way intended • Appearance • Separate files, separate queries • Resemble event handling • Simple and Concurrent in imperative languages • Function • Append to insert query • Not concurrent • Artificial dependency • Traced • One failure invalidates entire insert -- including original

  41. GLaaS Model • Output variable • Represents inserted data similar to trigger • Called from and insert into primary stored procedures • Calls down the chain, separated by query delimiter • Enforces statically declared batching • Concurrent, let SQL environment make dependency decisions • Responsible for populating specific tables

  42. GLaaS Model • Results, integrity and protocol

  43. GLaaS Model • Explicit use of API and Stored Procedures • No direct application level queries • API only approved access point • Explicit enforcement of authentication by function not by data type • Eliminates need for application specific tables • Fights Sql injection

  44. GLaaS API • Multimodal approach to consumption • Mechanism for static, on-demand, and real-time consumption • File system and known URI • Similar to GTFS-realtime implementation • Application specific feed format • Request-Response • REST in several mediums • Binds to specific URI and HTTP Verb • Eliminates need for expensive header • SOAP backwards compatibility • Subscription model via push pattern • Websocket

  45. GLaaS API • Soap vs Rest • Soap • XML defined package • URIs surrogate for Endpoints • 1 URI per service • Message header contains definitions and method bindings • RPC • Message data contains payload

  46. GLaaS API • Soap vs Rest • Soap definition example for AVL

  47. GLaaS API • Soap vs Rest • Rest • URI multiplexing via routes • URI structure relative to root bound to request definition • Request object definition and HTTP verb binds to method and response • Request messages • Only contain data needed for functionality • No header, light-weight • JSON, XML, URI-embedded, any custom data organization

  48. GLaaS API • Soap vs Rest • Rest

  49. GLaaS API • Goals • Maintenance • Dynamically generated use documentation • Compartmentalized object definition • Requests • Response • Global Entry Point • Configuration • Application level authentication • Service Definition

  50. GLaaS API • Goals • Extensibility • Add data functionality to feature • Add specific tables • Add metadata specific data columns • Add application level functionality • Add request, response DTOs • Add service method bindings • Replication • Feature encapsulates protocol defined parts • Replicate abstraction model and appropriate retrieval mechanisms for new feature

More Related