multi data center hadoop in a snap n.
Skip this Video
Loading SlideShow in 5 Seconds..
Multi-Data-Center Hadoop in a Snap PowerPoint Presentation
Download Presentation
Multi-Data-Center Hadoop in a Snap

Loading in 2 Seconds...

  share
play fullscreen
1 / 18
Download Presentation

Multi-Data-Center Hadoop in a Snap - PowerPoint PPT Presentation

connie
142 Views
Download Presentation

Multi-Data-Center Hadoop in a Snap

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development

  2. My background • 15 years Sun Microsystems veteran: JVM, distributed systems • Vice President, Apache Bigtop • Committer, PMC & contributor to various ASF projects • Member of Apache IPMC • Early Hadoop committer

  3. WANdisco Background • WANdisco: Wide Area Network Distributed Computing • Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability • Leader in tools for software engineers – Subversion • Apache Software Foundation sponsor • Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) • US patented active-active replication technology granted, November 2012 • Global locations • San Ramon (CA) • Chengdu (China) • Tokyo (Japan) • Boston (MA) • Sheffield (UK) • Belfast (UK)

  4. Customers

  5. Non-Stop Hadoop Non-Intrusive Plugin Provides Continuous Availability In the LAN / Across the WAN Active/Active

  6. 3 Key Problems For Multi Cluster Hadoop LAN / WAN

  7. Enterprise Ready Hadoop Characteristics of Mission Critical Applications • Require 100% Uptime of Hadoop • SLA’s, Regulatory Compliance • Require HDFS to be Deployed Globally • Share Data Between Data Centers • Data is Consistent and Not Eventual • Ease Administrative Burden • Reduce Operational Complexity • Simplify Disaster Recovery • Lower RTO/RPO • Allow Maximum Utilization of Resource • Within the Data Center • Across Data Centers

  8. Breaking Away from Active/Passive What’s in a NameNode Single Standby Active / Active All resources utilized Only NameNode configuration Scale as the cluster grows All NameNodes active Load balancing Set resiliency (# of active NN) Global Consistency • Inefficient utilization of resource • Journal Nodes • ZooKeeper Nodes • Standby Node • Performance Bottleneck • Still tied to the beeper • Limited to LAN scope

  9. Breaking Away from Active/Passive What’s in a Data Center Standby Datacenter Active / Active DR Resource Available Ingest at all Data Centers Run Jobs in both Data Centers Replication is Multi-Directional active/active Absolute Consistency Single HDFS spans locations ‘N’ Data Center support Global HDFS allows appropriate data to be shared • Idle Resource • Single Data Center Ingest • Disaster Recovery Only • One way synchronization • DistCp • Error Prone • Clusters can diverge over time • Difficult to scale > 2 Data Centers • Complexity of sharing data increases

  10. Multiple Clusters One Cluster Aproach • Example Applications • HBASE • RT Query • Map Reduce • Poor Resource Management • Data Locality Issues • Network Use • Complex

  11. Multiple Clusters Creating Multiple Clusters • Example Applications • HBASE • RT Query • Map Reduce • Need to share data between clusters • DistCp / Stale Data • Inefficient use of storage and or network • Some clusters may not be available

  12. Cluster Zones Zoning for Optimal Efficiency 1 100% HDFS Consistency

  13. Multi Datacenter Hadoop Disaster Recovery Absolute Consistency Maximum Resource Use Lower Recovery Time/Point WAN REPLICATION Replicate Only What You Want Better Utilization of Power/Cooling Lower TCO LAN Speed Performance

  14. Architecture of a Non-Stop Hadoop

  15. Technical Use Cases • Eliminate Performance Bottleneck • HBASE issues • Multi Data-Center Ingest • Information doesn't need to be sent to one DC and then copied back to the other using DistCP • Parallel ingest methods don’t require redirected data streams • Ingest data at, or close to the source • Global Analysis (Logs, Click Streams, etc…) • Cluster Zones • Efficient use of resource based on application profile • HBASE, MapReduce, SPARK, etc… • Maximize Data Center Resource Utilization • All datacenters can be used to run different jobs concurrently • Disaster Recovery • Data is as current as possible (no periodic synchs) • Virtually zero downtime to recover from regional data center failure • Regulatory compliance

  16. Non-Stop Hadoop Demonstration

  17. Q & A

  18. Thank you