1 / 31

Apache Drill status

Apache Drill status. Michael Hausenblas , Chief Data Engineer EMEA, MapR HUG Munich, 2013- 04-19. Kudos to http://cmx.io /. Workloads. Batch processing (MapReduce) Light-weight OLTP ( HBase , Cassandra, etc.) Stream processing (Storm, S4) Search ( Solr , Elasticsearch )

shelley
Download Presentation

Apache Drill status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Drill status Michael Hausenblas, Chief Data Engineer EMEA, MapRHUG Munich, 2013-04-19

  2. Kudos to http://cmx.io/

  3. Workloads Batch processing (MapReduce) Light-weight OLTP (HBase, Cassandra, etc.) Stream processing (Storm, S4) Search (Solr, Elasticsearch) Interactive, ad-hoc query and analysis (?)

  4. Interactive Query at Scale Impala low-latency

  5. Use Case I • Jane, a marketing analyst • Determine target segments • Data from different sources

  6. Use Case II { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } … • Logistics – supplier status • Queries • How many shipments from supplier X? • How many shipments in region Y?

  7. Today’s Solutions • RDBMS-focused • ETL data from MongoDBand Hadoop • Query data using SQL • MapReduce-focused • ETL from RDBMS and MongoDB • Use Hive, etc.

  8. Requirements Support for different data sources Support for different query interfaces Low-latency/real-time Ad-hoc queries Scalable, reliable

  9. Google’s Dremel* *) http://research.google.com/pubs/pub36632.html

  10. Apache Drill Overview Inspired by Google’s Dremel Standard SQL 2003 support Other QL possible Plug-able data sources Support for nested data Schema is optional Community driven, open, 100’s involved

  11. High-level Architecture

  12. High-level Architecture node node node node Drillbit Drillbit Drillbit Drillbit Storage Process Storage Process Storage Process Storage Process Each node: Drillbit- maximize data locality Co-ordination, query planning, execution, etc, are distributed By default Drillbits hold all roles Any node can act as endpoint for a query

  13. High-level Architecture Curator/Zk node node node node Drillbit Drillbit Drillbit Drillbit Distributed Cache Distributed Cache Distributed Cache Distributed Cache Storage Process Storage Process Storage Process Storage Process Zookeeper for ephemeral cluster membership info Distributed cache (Hazelcast) for metadata, locality information, etc.

  14. High-level Architecture Curator/Zk node node node node Drillbit Drillbit Drillbit Drillbit Distributed Cache Distributed Cache Distributed Cache Distributed Cache Storage Process Storage Process Storage Process Storage Process Originating Drillbit acts as foreman, manages query execution, scheduling, locality information, etc. Streaming data communication avoiding SerDe

  15. Principled Query Execution Source Query Logical Plan Physical Plan Parser Optimizer Execution SQL 2003 DrQL MongoQL DSL parser API topology scanner API query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

  16. RPC Endpoint Drillbit Modules SQL Logical Plan Physical Plan Storage Engine Interface Scheduler HiveQL DFS Engine Foreman Optimizer Pig HBase Engine Operators Mongo Parser Distributed Cache

  17. Key Features Full SQL 2003 Nested data Optional schema Extensibility points

  18. Full SQL – ANSI SQL 2003 • SQL-like is often not enough • Integration with existing tools • Datameer, Tableau, Excel, SAP Crystal Reports • Use standard ODBC/JDBC driver

  19. Nested Data • Nested data becoming prevalent • JSON/BSON, XML, ProtoBuf, Avro • Some data sources support it natively(MongoDB, etc.) • Flattening nested data is error-prone • Extension to ANSI SQL 2003

  20. Optional Schema • Many data sources don’t have rigid schemas • Schema changes rapidly • Different schema per record (e.g. HBase) • Supports queries against unknown schema • User can define schema or via discovery

  21. Extensibility Points Source Query Logical Plan Physical Plan Parser Optimizer Execution Source query parser API Custom operators, UDF logical plan Serving tree, CF, topology  physical plan/optimizer Data sources &formats scanner API

  22. … and Hadoop? *) https://cloud.google.com/files/BigQueryTechnicalWP.pdf • HDFS can be a data source • Complementary use cases* • … use Apache Drill • Find record with specified condition • Aggregation under dynamic conditions • … use MapReduce • Data mining with multiple iterations • ETL

  23. Example { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 } data source: donuts.json query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … result: out.json logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

  24. Status • Heavy development by multiple organizations • Available • Logical plan (ADSP) • Reference interpreter • Basic SQL parser • Basic demo • Basic HBase back-end

  25. Status April 2013 Extend SQL syntax Physical plan In-memory compressed data interfaces Distributed execution

  26. Contributing • Learn where and how to contributehttps://cwiki.apache.org/confluence/display/DRILL/Contributing • Jira, Git, Apache build and test tools • Preparing for dependencies • Hazelcast • Netflix Curator

  27. Contributing General contributions appreciated: Supersonic(?) Test data & test queries Use case scenarios (textual desc./SQL queries) Documentation

  28. Contributing • Dremel-inspired columnar format • Twitter’s Parquet • Hive’s ORC file • Integration with Hive metastore (?) • DRILL-13 Storage Engine: Define Java Interface • DRILL-15 Build HBase storage engine implementation

  29. Contributing • DRILL-48 RPC interface for query submission and physical plan execution • DRILL-53 Setup cluster configuration and membership mgmt system • Further schedule • Alpha Q2 • Beta Q3

  30. Kudos to … Julian Hyde, Pentaho Lisen Mu Tim Chen, Microsoft Chris Merrick, RJMetrics David Alves, UT Austin SreeVaadi, SSS/NGData Jacques Nadeau, MapR Ted Dunning, MapR

  31. Engage! Follow @ApacheDrill on Twitter Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html Standing G+ hangouts every Tuesday at 18:00 CET Keep an eye on http://drill-user.org/

More Related