1 / 29

Storage and Analysis of Tera -scale Data : 2 of 2

Storage and Analysis of Tera -scale Data : 2 of 2. 415 Database Class 11/24/09 delip@jhu.edu. Previously …. (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation

jam
Download Presentation

Storage and Analysis of Tera -scale Data : 2 of 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

  2. Previously … • (Traditional) Databases are not Swiss-Army knives • Large data problems require radically different solutions • Exploit the power of parallel I/O and computation • MapReduce as a framework for building reliable distributed data processing applications • Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)

  3. Previously … • HDFS : A reliable open source distributed file system • HBase : A sorted multi-dimensional map for record oriented data • Not Relational • No query language other than map semantics (Get and Put)

  4. MapReduce is great but … Got to write all this for a WordCount!!!

  5. MapReduce • Development cycles too long • Writing code • Packaging code • JOINs on large data too hard to implement in MapReduce • Today’s class: Keeping it Simple • Can we abstract users from MapReduce?

  6. Pig • Started in Fall 2007 at Yahoo! • Simplify MapReduce by capturing common data processing patterns • Results in improved productivity • Lowers barrier to entry for large data processing • Today: Runs 40% of Yahoo!’s large data jobs • Who else: Twitter, LinkedIn, AOL, … • Similar efforts elsewhere: Sawzall (Google), Hive (Facebook)

  7. Pig = Query Language + Interpreter • Language: Pig Latin • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Interpreter: Grunt • An execution environment to convert Pig Latin to MapReduce • Two modes • Local : JVM • Distributed: via Hadoop

  8. Pig Latin Example from Pittsburg Hadoop Users Group

  9. Equivalent MapReduce code

  10. Pig Latin from an Example (Example courtesy: Yahoo! Research) • Find users who visit “good” pages

  11. Conceptual Dataflow

  12. Pig Latin script

  13. Pig Latin: The Language • Structure • Collection of STATEMENTS • Statement has an OPERATOR and ends in ‘;’

  14. Summary of Pig Latin Operators

  15. LOAD/STORE and Schemas grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> records = LOAD ‘input/sample.txt’; grunt> STORE records INTO ‘output/sample.out`;

  16. FILTER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt>bad_records = FILTER records BY quality < 0; grunt>bad_years = FOREACH bad_records GENERATE year;

  17. STREAM grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> projected = FOREACH records GENERATE $0, $2; grunt> projected = STREAM records THROUGH `cut -f0,2`

  18. JOIN grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> sales = LOAD ‘input/sales.txt’ >>AS (year:int, profit:float); grunt> combined = JOIN records BY year, sales BY year; grunt>profit_year = FOREACH combined GENERATE profit, year;

  19. GROUP grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality; grunt> combined = GROUP sales BY quality < AVG(quality);

  20. ORDER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = ORDER records BY year, quality DESC;

  21. Parallelism grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality PARALLEL 50; Can use PARALLEL keyword in any statement

  22. User Defined Functions • Unlike SQL, can invoke custom defined functions in query • Proprietary solutions like PL/SQL allow that grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> REGISTER mypackage.jar; grunt> DEFINE MyFuncmypackage.MyFuncImpl.myFunc(); grunt> combined = GROUP records BY MyFunc(quality);

  23. PIG LATIN Review

  24. Revisiting WordCount grunt> sentences = LOAD ‘input/*.txt’ >>USING TextLoader() AS (sentence: chararray); grunt> words =FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word; grunt>word_kinds=GROUP words BY word; grunt>word_count=FOREACHword_kinds >> GENERATE group, COUNT(words) grunt>STORE word_countINTO ‘output/wordcount’;

  25. No more this …

  26. Related Project: Hive • Started in Facebook, now open source • Like PIG but supports SQL • Trend : Move towards in-database MapReduce • Allows existing DB applications to scale up • Makes MapReduce capabilities easily accessible • Business opportunity: www.vertica.com

  27. Summary (this and last class) • MapReduce as a radically different solution to large data problems • Exploit the power of parallel I/O and computation • Need to think from the “ground up” • Filesystem: HDFS • Table store: HBase • Basic MapReduce too complicated DB end users

  28. Summary (this and last class) • Efforts to simplify MapReduce based data processing • PIG from Yahoo! • Pig Latin a-not-so-SQL like language • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Facebook Hive supports direct SQL interface • Emerging trend: Fusion of MapReduce and DB technologies

  29. Happy Thanksgiving!

More Related