1 / 18

Cassandra in xPatterns

Cassandra in xPatterns. Seattle Java User’s Group May 2014. Claudiu Barbura Sr. Director Engineering. Agenda. xPatterns Architecture Export to NoSql API & GUI (Demo) xPatterns dashboard application (Demo) xPatterns monitoring and instrumentation (Demo ) Data model optimization

zeno
Download Presentation

Cassandra in xPatterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cassandra in xPatterns Seattle Java User’s Group May 2014 Claudiu Barbura Sr. Director Engineering

  2. Agenda • xPatterns Architecture • Export to NoSql API & GUI (Demo) • xPatterns dashboard application (Demo) • xPatterns monitoring and instrumentation (Demo) • Data model optimization • Publishing from HDFS/Hive/Shark to Cassandra • Generated REST API’s • Instrumentation • Throttling & auto-retries • Geo-Replication • Cross-data-center replication, encryption & failover • Lessons Learned since 0.6 till 2.0.6

  3. Export to NoSql API • Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse • Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumentation dashboard embedded, logs, progress and instrumentation events pushed though SSE) • Data Modeling is driven by the read access patterns provided by an application engineer building dashboards and visualizations: lookup key, columns (record fields to read), paging, sorting, filtering • The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards • Configuration API provided for creating export jobs and executing them (ad-hoc or scheduled).

  4. Mesos/Spark cluster

  5. Cassandra multi DC ring – write latency

  6. Nagios monitoring

  7. Referral Provider Network • One of the many applications that we built for our largest healthcare customers using the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws, Export to NoSqlAPI. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. • The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty, with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths. • The dataset behind the app consists of 8billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra) • While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations (latency in the 20-50ms range)

  8. Graphite – Cassandra multi DC ring

  9. VPC-to-VPC IPSEC Tunnel

  10. Lessons learned 0.6 - 2.0.6 • NTP: synchronize ALL clocks (servers and clients) • Reduce the number of CFs (avoid OOM) • Rows not too skinny and not too wide (avoid OOM) • Less memory pressure during high-throughput writes • Reduced network I/O, less rows, more column slices • Key cache & bloom filter index size affects perf • Efficient compaction, avoid hot spots • Custom serialization and dynamic columns for maximum perf gain • Do not drop CFs before emptying them (truncate/compact first) • Monitoring, instrumentation, automatic restarts • ConsistencyLevel: ONE is best … for our use cases • Key cache, Snappy compression, vnodes

  11. Q & AOh btw … we’re hiring! • claudiu.barbura@atigeo.com • @claudiubarbura

More Related