1 / 15

Building Web Analytics on Hadoop at CBS Interactive

Building Web Analytics on Hadoop at CBS Interactive. Michael Sun michael.sun@cbsinteractive.com Big Data Workshop, Boston 03/10/2012. Brands and Websites of CBS interactive, Samples. ENTERTAINMENT . GAMES & MOVIES . SPORTS. TECH, BIZ & NEWS . MUSIC. CBSi Scale.

verena
Download Presentation

Building Web Analytics on Hadoop at CBS Interactive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Web Analytics on Hadoop at CBS Interactive Michael Sun michael.sun@cbsinteractive.com Big Data Workshop, Boston 03/10/2012

  2. Brands and Websites of CBS interactive, Samples ENTERTAINMENT GAMES & MOVIES SPORTS TECH, BIZ & NEWS MUSIC

  3. CBSi Scale • Top 20 global web property • 235M worldwide monthly unique users • Hadoop Cluster size: • Currently workers: 40 nodes (260 TB) • This month: add 24 nodes, 800 TB total • Next quarter: ~ 80 nodes, ~ 1 PB • DW peak processing: > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011

  4. Web Analytics Processing • Collect web logs for web metrics analysis • Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc • Provide internal metrics for web sites monitoring • A/B testing • Billers apps, external reporting • Ad event tracking to support sales • Provide data service • Support marketing by providing data for data mining • User-centric datastore (stay tuned) • Optimize user experience 1 - Source: comScore, March 2011

  5. 2105595680218152960 125.83.8.253 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002-e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x1050&brlang=zh-CN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&title=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1schemas.append(Schema(( # schemas[0]SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64),SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'),SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'),SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'),SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'),SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'),SchemaField('http_status', 'int', nullable=True, signed=True),SchemaField('bytes_sent', 'int', nullable=True, signed=True),SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'),SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'),SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'),SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2))))

  6. Modernize the platform • The web log processing using a proprietary platform ran into the limit • Code base was 10 years old • The version we used vendor is no longer supporting • Not fault-tolerant • Upgrade to the newer version not cost-effective • Data volume is increasing all the time • 300+ web sites • Video tracking increasing the fastest • To support new initiatives of business • Use open source systems as much as possible

  7. Hadoop to the Rescue / Research • Open-source: scalable data processing framework based on MapReduce • Processing PB of data using Hadoop Distributed files system (HDFS) • high throughput • Fault-Tolerant • Distributed computing model • Functional programming model based • MapReduce (M|S|R) • Execution engine • Used as a cluster for ETL • Collect data (distributed harvester) • Analyze data (M/R, streaming + scripting + R, Pig/Hive) • Archive data (distributed archive)

  8. The Plan • Build web logs collection (codename Fido) • Apache web log piped to cronolog • Hourly M/R collector job to • Gzip hourly log files & checksum • Scp from web servers to Hadoop datanodes • Put on HDFS • Build Python ETL framework (codename Lumberjack) • Based stdin/stdout streaming, one process/one thread • Can run stand-alone or on Hadoop • Pipeline • Filter • Schema • Build web log processing with Lumberjack • Parse • Sessionize • Lookup • Format data/Load to DB

  9. Web Analytics Billers Data mining Sites DW Database Web metrics Hadoop Apache Logs Python-ETL MapReduce Hive Distribute log by Fido External data sources CMS Systems HDFS

  10. Clickmap

  11. Web log Processing by Hadoop Streaming and Python-ETL • Parsing web logs • IAB filtering and checking • Parsing user agents by regex • IP range lookup • Look up product key etc • Sessionization • Prepare Sessionize • Sessionize • Filter-unpack • Process huge dimensions, URL/Page Title • Load Facts • Format Load data/Load data to DB

  12. Benefits to Ops • Processing time to reaching SLA, saving 6 hours • Running 2 years in production without any big issues • Withstood the test of 50% / year data volume increase • Architecture by design made easy to add new processing logic • Robust and Fault-Tolerant • Five dead datanodes, jobs still ran OK • Upgraded JVM on a few datanodes while jobs running • Reprocessing old data while processing data of current day

  13. Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want • Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work • Python ETL Framework • Home grown, under review for open-source release • Rich functionalities by Python • Extensible • NLS support • Put on top of another platform, eg Hadoop • Distributed/Parallel • Sorting • Aggregation

  14. Conclusions II – Power and Flexibility for Processing Big Data • Hadoop - scale and computing horsepower • Robustness • Fault-tolerance • Scalability • Significant reduction of processing time to reach SLA • Cost-effective • Commodity HW • Free SW • Currently: • Build Multi-tenant Hadoop clusters using Fair Scheduler

  15. Questions?michael.sun@cbsinterative.comFollow up Lumberjacklumberjack@cbsinteractive.com

More Related