1 / 26

HadoopDB

HadoopDB. Inneke Ponet. Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components. Introduction. More and more data needs to be stored and processed .

hieu
Download Presentation

HadoopDB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopDB Inneke Ponet

  2. Introduction • Technologies for data analysis • HadoopDB • Desired properties • Layers of HadoopDB • HadoopDB Components

  3. Introduction • More and more data needstobestoredandprocessed. • People want to do more and more complex calculations on theircollected data. • Analytical databases on high-end machines are movingtowardscheaperlower-end machines. • The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.

  4. Technologies for data analysis Parallel databases: • good performance, • good efficiency. MapReduce-based systems: • superior scalability, • goodfaulttolerance, • good flexibility to handle unstructered data.

  5. Parallel databases • Support for standard relationaltablesand SQL. • Implementstechniquesfor a better performance: • Indexing, compression, materialized views, resultcaching, I/O sharing. • Data is partitioned (shared-nothingarchitecture) transparentto the end-user.

  6. Shared-nothingarchitecture The DBMS of the most analytical databases are deployed on a shared-notingarchitecture: • A collection of machines that • are independent, • are possible virtual, • have theirownlocal disk andlocalmain memory, • are connectedby a high-speed network. Scalability of machines.  Analysis tasks are easy to parallellize.

  7. MapReduce A technologyfrom Google: • processes (un)structured data that is distributed on manynodes in a shared-nothing cluster; • works at enormousscale. MapandReduce:  parallel without communicating;  Map-repartition-Reducecycles.

  8. MapReduce: advantages No detailed query execution plan in advanceat runtime: adjustto node failuresand slow nodes(re)assigningtaskstofasternodes. Checkpoints the output tolocal disk minimizing of the workin case of a failure.

  9. HadoopDB Hybrid database:  acombination of: • traditional DBMS, • MapReduce-technology. Developedby Yale University students: AzzaAbouzeidandKamilBajDa-Pawlikowski  It is free and open source.

  10. Desiredproperties • Performance • Faulttolerance • Heterogeneous environment • Flexible query interface • Scalability

  11. A. Performance • Primarycharacteristictodistinguish. • MapReduce: first modelingandloading data before processing slower performance than parallel databases. • Costsaving: faster software product cheaperthan a hardware upgrade or buyingadditional hardware.

  12. B. Faulttolerance • Succesfullycommit transactions. • Make progress on a workload. • Heterogeneityandscalibility more faultsBUT MapReducegoodfaulttolerance: • reassigningtasks; • sub-tasksminimize the effect of faults. • Parallel databases: assumptionfailures are rare more testing => slower performance.

  13. C. Heterogeneous environment • Nodesdon’talways run on • identical hardware, • anidentical virtual machine.  Different performance. • Parallel databases: nottested on more than 100 nodes.

  14. D. Flexible query interface • Easy to make queries:  SQL and non-SQL interface languages,  Use of tools. • Robust mechanisme forwritingUDFs. • Parallel databases: SQL, ODBC andUDFs. • MapReduce-based systems: it is possible(Hive), but notalways (Hadoop).

  15. E. Scalability Traditional DBMS: • onlyscalableto 100 nodes. MapReduce-based systems: • designedtoscaletothousands of nodes in a shared-nothingarchitecture.

  16. Desiredproperties

  17. Layers of HadoopDB • Communication: Hadoop • Database: PostgreSQL • Translation: Hive

  18. Hadoop • Communication layer of HadoopDB. • Hadoopframeworktwolayers: • Hadoop Distributed File System (HDFS), • MapReduceframework. Cost: free/open source MapReduce.

  19. PostgreSQL • Relational DBMS. • (Possible) database layer of HadoopDB.  Cost: free/open source.

  20. Hive • Translationlayer. • Processing of a SQL query: • Query  Abstract Syntax Tree. • MetaStore: schema of the table(s). • Logical query plan: DAG of relational operators. • Optimized plan. • Physicalexecutable plan: MapReduce job(s). • XML plan: DAG serialized. • Hive Driver executes a Hadoop job.

  21. HadoopDB components • Database Connector: • Interface between independent database systems; • Extends the InputFormat class (of Hadoop); • Connect toany JDBC-compliant database. • Catalog: • Meta-information about the databases: • connection parameters, • metadata. • XML file in HDFS accessedby: • Master node, • Worker/Slavenodes.

  22. HadoopDB Components (2) • Data loader: • Global hasher: • CustomMapReduce job  files in HDFS; • Repartioning data uponloading. • Localhasher: • Copiespartitionfrom HDFS tolocal file system; • Partitions the file in smaller sizedchunks.

  23. HadoopDB Components (3) • SQL toMapReduce: • Parallel database front-end toprocess SQL queries. HiveQL ↓ Transform MapReduce jobs: • Connect totablesstored in HDFS; • Consists of DAGs of relational operators thatoperate as iterators. • Assumption no collection of tables: • Operations on multiple tablesReducefunction.  NOT in HadoopDB: a joinoperationcanbepushedto the databselayer.

  24. HadoopDB Components (4) • SQL/SMS planner: • ModifiesHive: • Updates the MetaStore • Two passes over the physical plan: • Determine the partitionkeysfor the ReduceSink Operators. • Operators are: • converted in SQL querie(s); • pushedinto the database layer. • Only filter, select andaggregation operators.

  25. HadoopDB Components (5)

  26. Questions?

More Related