1 / 8

Better Performance for Big Data

Shuya Zhang; Shyam Sundar Somasundaram [10/03/13]. Better Performance for Big Data. Reference. [1] Bhasker Allene, Marco Righini, “ Better Performance for Big Data ” Intel White Paper , 2013 Intel Corporation. 1. What is Hadoop. Apache Hadoop an open-source software framework

jennis
Download Presentation

Better Performance for Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shuya Zhang; ShyamSundarSomasundaram [10/03/13] Better Performance for Big Data Reference [1] Bhasker Allene, Marco Righini, “Better Performance for Big Data” Intel White Paper, 2013 Intel Corporation. 1

  2. What is Hadoop • Apache Hadoop an open-source software framework • Supports data-intensive distributed applications • Enables the running of applications on large clusters of commodity Hardware • Derived from Google's MapReduce and Google File System (GFS) papers Hadoop is one of the poster children of Big Data, especially the most beloved one! 2

  3. The Intel Distribution for Hadoop framework • Oozieis a workflow scheduler • Hiveenables SQL queries on Hadoop • Hbaseis a non-relational, distributed database

  4. Oozie Workflow • Step1Convert EBCDIC to ASCII • Step 2Scan for New Columns • Step3Move Columns List to Metadata • Step4 Optimize Data for Hive • Step5 Move Data to Hive Warehouse • Step6 Drop and Create Hive Table Structure

  5. Data Flow & Data Optimization • Benefits: - Data is stored in a normalized way - Hive queries quite similar to RDBMS queries - The learning curve is minimized, with fewer computations and less disk space - Data is easily consumable for data analysis tools

  6. Comparing SQL and Hive Queries Query 1 • RDBMS: select distinct tp_ndg as N010 from scontrpf50m0130331 • Hive: SELECT DISTINCT N010 FROM O0A011 LIMIT10; Query 2 • RDBMS: SELECT DISTINCT B. COD_UO, b. Tp_conto, a. Tp_ndg as N010 FROM A JOIN SCONTRPF50M0130331 CUBOM0100M0130331 B ON A. NDG = B. NDG WHERE POSIZ_SOFF_INCAGLT = '1 'and a. TP_NDG in ('DIN', 'IOC', 'SPF') and b. dt_accs_rapprt> = '2013-03-01 'and b. dt_accs_rapprt <= '2013-03-15' ORDER BY B. COD_UO, b. Tp_conto, a. TP_NDG • HIVE: SELECT DISTINCT B.DOOR, B.TP_INCOME, A.N010 FROM O0A011 A JOIN CUBOM0100M0130331 B ON A.NDG = B.NDG WHERE N011 ='1' AND A.N010 in ('DIN', 'IOC', 'SPF') AND B.R021 > '130100' AND B.R021 < '130316'; ORDER BY DOOR, TP_INCOME, N010; Query 3 • RDBMS: select b. cod_uo, b. forma_tec, TP_NDG as N010, substring (a. sae_rae, 1, 3) as N003, a. u_segmgest_2004 as N088, a. u_modserv_gest as N089, sum (b. qc_rata_scd) as D505 from scon- trpf50m0130331 to join cubom0100m0130331 b on a. ndg = b. ndg where a. TP_NDG in ('DIN', 'IOC', 'SPF') and b. forma_tec in ('MW500', 'MW100', 'MW200') group by b. cod_uo, b. forma_tec, a. TP_NDG, substring (a. sae_rae, 1, 3), a. u_segmgest_2004, a. u_modserv_gest order by cod_uo, forma_tec, TP_NDG, sub- string (a. sae_rae, 1, 3), a. u_seg- mgest_2004, a. u_modserv_gest

  7. Tables Cubes Collections Structured Data Structured Data Structured/ Unstructured What is related with our course • Hadoop VS SQL http://www.youtube.com/watch?v=3Wmdy80QOvw#t=16 • New Trend: Big Data, Cloud Computing Database Management Systems RDBMS (Relational DBMS) NoSQL OLAP

  8. Questions?

More Related