1 / 19

Hadoop at ContextWeb

Hadoop at ContextWeb. February 2009. ContextWeb: Traffic. Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:. ContextWeb Architecture highlights. Pre – Hadoop aggregation framework Logs are generated on each server and aggregated in memory to 15 minute chunks

howe
Download Presentation

Hadoop at ContextWeb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop at ContextWeb February 2009

  2. ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

  3. ContextWeb Architecture highlights Pre – Hadoop aggregation framework Logs are generated on each server and aggregated in memory to 15 minute chunks Aggregation of logs from different servers into one log Load to DB Multi-stage aggregation in DB About 20 different jobs end-to-end Could take 2hr to process through all stages 3

  4. Hadoop Data Set • Up to 100GB of raw log files per day. 40GB compressed • 40 different aggregated data sets 15TB total to cover 1 year (compressed) • Multiply by 3 replicas …

  5. Architectural Challenges How to organize data set to keep aggregated data sets fresh. Logs constantly appended to the main Data Set. Reports and aggregated datasets should be refreshed every 15 minutes Mix of .NET and Java applications. (80%+ .Net, 20%- Java) How to make .Net application write logs to Hadoop? Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3rd party or internal Legacy applications to read data from Hadoop ?

  6. Hadoop Cluster • Today: • 26 nodes/208 Cores DELL 2950, 1.8TB per node • 43TB total capacity • NameNode high availability using DRBD Replication. • Hadoop 0.17.1 -> 0.18.3 • In-house developed Java framework on top of hadoop.mapred.* • PIG and Perl Streaming for ad-hoc reports • ~1,000 MapReduce jobs per day • OpdWise scheduler • Exposing data to Windows: WebDav Server with WebDrive clients • Reporting Application: Qlikview • Cloudera support for Hadoop • Archival/Backup: Amazon S3 • By end of 2009 • ~50 nodes/400 Cores • ~85TB total capacity

  7. Internal Components Disks 2x 300 GB 15k RPM SAS. Hardware RAID 1 mirroring. SMART monitoring.  Network Dual 1Gbps on-board NICs. Linux bonding with LACP.

  8. Redundant Network Architecture Linux bonding See bonding.txt from Linux kernel docs. LACP, aka 802.3ad, aka mode=4. (http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol) Must be supported by your switches. Throughput advantage Observed at 1.76Gb/s  Allows for failure of either NIC instead of a single heartbeat connection via crossover.  8

  9. The Data Flow

  10. Partitioned Data Set: approach • Date/Time as dimension for Partitioning • Segregate results of MapReduce jobs into Daily and Hourly Directories • Each Daily/Hourly directory is regenerated if input into MR job contains data for this Day/Hour • Use revision number for each directory/file. This way multi-stage jobs could overlap during processing 10

  11. Partitioned Data Set: processing flow

  12. Workflow • Opswise scheduler

  13. Getting Data in and out • Mix of .NET and Java applications. (80%+ .Net, 20%- Java) • How to make .Net application write logs to Hadoop? • Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) • How make 3rd party or internal Legacy applications to read data from Hadoop ?

  14. Getting Data in and out: distcp • Hadoop Distcp <src> <trgt> • <src> - hdfs • <trgt> - /mnt/abc – network share • Easy to start – just allocate storage on network share • But… • Difficult to maintain if there are more than 10 types of data to copy • Need extra storage. Outside of HDFS. (oxymoron!) • Extra step in processing • Clean up 14

  15. Getting Data in and out: WebDAV driver • WebDAV server is part of Hadoop source code tree • Needed some minor clean up. Was co-developed with IponWeb. Available http://www.hadoop.iponweb.net/Home/hdfs-over-webdav • There are multiple commercial Windows WebDav clients you can use (we use WebDrive) • Linux • Mount Modules available from http://dav.sourceforge.net/ 15

  16. Getting Data in and out: WebDav 16

  17. WebDAV and compression • But your results are compressed… • Options: • Decompress files on HDFS – an extra step again • Refactor your application to read compressed files… • Java – Ok • .Net – much more difficult. Cannot decompress SequenceFiles • 3rd party- not possible

  18. WebDAV and compression • Solution – extend WebDAV to support compressed SequenceFiles • Same driver can provide compressed and uncompressed files • If file with requested name foo.bar exists – return as is foo.bar • If file with requested name foo.bar does not exist – check if there is a compressed version foo.bar.seq. Uncompress on the fly and return as if foo.bar • Outstanding issues • Temporary files are created on Windows client side • There are no native Hadoop (de)compression codecs on Windows

  19. QlikView Reporting Application • Load from TXT files is supported • In-memory DB • AJAX support for integration into WEB portals

More Related