1 / 27

Tree and Graph Processing On Hadoop

Tree and Graph Processing On Hadoop. Ted Malaska. Schedule. Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery

ketan
Download Presentation

Tree and Graph Processing On Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tree and Graph Processing On Hadoop Ted Malaska

  2. Schedule • Intro • Overview of Hadoop and Eco-System • Summarize Tree Rooting • MR Overview/Implementation Options • HbaseOverview/Implementation Options • Giraph Overview/Implementation Options • Spark Overview/Implementation Options • Summery • Quesitons

  3. Intro • Hi there

  4. Overview of Hadoop and Eco-System Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS

  5. In Scope for Tonight Machine Learning NoSql Search Batch Ingestion Streaming RTQ LFP Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka NFS Storm Spark Streaming Impala Spark Mahout Oryx R Python Streaming SAS HBase Accumulo Search SolR Auditing and Monitoring Security and Access Controls HDFS

  6. Summarize Tree Rooting • Basic Tree 3 3 3 Leafs Vertex 2 2 2 Edge 2 1 1 Branches Depth 0 True Root

  7. Summarize Tree Rooting • More Complex Tree Circular Link 3 2 3 2 2 Multiple Parents 2 2 1 1 0

  8. Summarize Tree Rooting • Merging Trees • Borderline True Graph Problem Multi Rooted Vertex 3 2 3 2 2 2 0 2 1 1 0 0 True Root True Root

  9. Summarize Tree Rooting • Know your data

  10. Basic Storage Format • <NodeID>|<EdgeID> • Example • 101 • 101|201 • 101|202 • 201 • 202|301 • 301

  11. Preprocessing • Terming Data • Nodes and edges have data • Data has weight • Normally linkage information is under 10% of true data size • Organize Data by Partitioning

  12. Basic Solution • Step 1: Identify Roots • Echo to all edges • Vertexes with that receive no echoes are roots • Root the root • Step 2: Walk the tree • Echo from last newly rooted Vertex to all edges • If vertex is not already rooted then root it. • 101 • 101|201 • 101|202 • 201 • 202|301 • 301 • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:Null • 202|301|R:Null • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:Null • 101|R:101 • 101|201|R:101 • 101|202|R:101 • 201|R:101 • 202|301|R:101 • 301|R:101

  13. Map Reduce • Massive parallel processing on Hadoop • Based on the Google 2004 MapReduce white paper • Able to process PBs of data

  14. Map Reduce Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Sort & Shuffle Data Blocks Mapper Data Blocks Mapper Sort & Shuffle Data Blocks Mapper

  15. Map Reduce • Self Joins • Always dumping two output: • Newly Rooted • Still Un-Rooted Un-Rooted Un-Rooted Un-Rooted All Data Newly Rooted MR – Stage1 Rooting MR – Stage2 Rooting Newly Rooted MR - Stage0 Root Identifying Old Rooted 1 Old Rooted 0 Old Rooted 0 Newly Rooted

  16. Map Reduce • Great for large batch operations • No memory limit • Not good at iterations

  17. HBase • Largest and Most used NoSql Implementation in the World • Based on the Google 2006 BigTable white paper • Imagine it like a giant HashMap with keys and values • Handles 100k of operations a second on even a small 10 node cluster

  18. HBase Getting Client HBase Master HBase Region Server HBase Region Server HBase Region Server Block Cache Block Cache Block Cache

  19. HBase Putting Client HBase Master HBase Region Server HBase Region Server HBase Region Server WAL WAL WAL MemStore MemStore MemStore HFile HFile HFile

  20. HBase • Good for graph traversing • Bad for large batch processing • Scan rate about 8x slower then HDFS • Good for end of a long tail

  21. Giraph • System built for Large Batch Graph Processing • Based on Pregel 2009 white paper • Hardened by LinkedIn and FaceBook • Recorded to handle up to a Trillion edges

  22. Giraph Loading Data Blocks Master Worker Worker Data Blocks Data Blocks Worker Worker

  23. Giraph (Bulk Synchronous Parallel) Communication Barrier synchronization Worker Worker Worker Local vertex computing Local vertex computing Local vertex computing

  24. Giraph • Most mature bulk graph processing out there • Of all the solutions, most graph focused

  25. Spark • At Berkeley around 2011 some asked is we could do better then MR • Take advantage of lower cost memory • Building on everything before

  26. Spark Task Scheduler RDD Objects Worker Dag Scheduler (Like a queue planner Spark Worker Cluster Manager Threads Task Threads Block Manager Block Manager Rdd1.join(rdd2). groupBy(…) .filter(…)

  27. Spark • Implementations • Onion MR approach with Basic Spark • Pregel approach with Bagel or GraphX • Bagel is a Façade over Generic Spark Functionality • GraphX is an effort extend to Spark • Less code • Learning curve • Its Raw will be changing a lot in the next year

More Related