1 / 28

James Taylor jtaylor@salesforce.com

James Taylor jtaylor@salesforce.com. Phoenix. We put the SQL back in NoSQL. Agenda. What is Phoenix? Why SQL? What is next? Q&A. Completed. What is Phoenix?. SQL layer on top of HBase Delivered as an embedded JDBC driver Targets low latency queries over HBase data

faunia
Download Presentation

James Taylor jtaylor@salesforce.com

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. James Taylor jtaylor@salesforce.com Phoenix We put the SQL back in NoSQL

  2. Agenda • What is Phoenix? • Why SQL? • What is next? • Q&A Completed

  3. What is Phoenix? • SQL layer on top of HBase • Delivered as an embedded JDBC driver • Targets low latency queries over HBase data • Columns modeled as multi-part row key and key values • Versioned schema repository • Query engine transforms SQL into puts, delete, scans • Uses native HBase APIs instead of Map/Reduce • Brings the computation to the data: • Aggregate, insert, delete datathrough coprocessors • Push predicates through custom filters • 100% Java • Open source here: https://github.com/forcedotcom/phoenix Completed

  4. Why SQL? • Broaden HBase adoption • Give folks an API they already know • Reduce the amount of code users need to write • SELECT TRUNC(date,'DAY’), AVG(cpu_usage) • FROM web_stat • WHERE domain LIKE 'Salesforce%’ • GROUP BY TRUNC(date,'DAY') • Performance optimizations transparent to the user • Aggregation • Stats gathering • Secondary indexing • Leverage existing tooling • SQL client Completed

  5. But I can’t surface x,y,z in SQL… Completed

  6. But I can’t surface x,y,z in SQL… Completed

  7. But I can’t surface x,y,z in SQL… • Define multi-part row keys Completed

  8. But I can’t surface x,y,z in SQL… • Define multi-part row keys • CREATE TABLE web_stat ( • domain VARCHAR NOT NULL, • feature VARCHAR NOT NULL, • date DATE NOT NULL, • usage BIGINT, • active_visitor INTEGER, • CONSTRAINT pk PRIMARY KEY (domain, feature, date) • ); Completed

  9. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function Completed

  10. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function • Derive class from ScalarFunction • Add annotation to define name, args, and types • Implement evaluate method • Register function • (blog on this coming soon: http://phoenix-hbase.blogspot.com/) Completed

  11. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries Completed

  12. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Set CURRENT_SCN property on connection to earlier timestamp • Queries will see only rows before timestamp • Schema in-place at that point in time will be used Completed

  13. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row Completed

  14. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Declare new new child entity as nested table • Prefix column qualifier of nested entities with: • table name + child primary key + child column name • Restrict join to be only through parent/child relation • Execute query by scanning nested child rows • TBD: https:/github.com/forcedotcom/phoenix/issues/19 Completed

  15. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes Completed

  16. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • “Salt” row key on upsert by mod-ing with cluster size • Query for fully qualified key by inserting salt byte • Range scan by concatenating results of scan over all possible salt bytes • Or alternately • Define column used for hash to derive row key prefix • TBD: https://github.com/forcedotcom/phoenix/issues/74 Completed

  17. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter

  18. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Surface the HBase put-and-increment functionality through the standard SQL sequence support • TBD: https://github.com/forcedotcom/phoenix/issues/18

  19. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data

  20. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Support the standard SQL TABLESAMPLE clause • Implement filter that uses a skip next hint • Base next key on the table stats “guide posts” • TBD: https://github.com/forcedotcom/phoenix/issues/22

  21. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time

  22. But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time • SELECT col1,col2,col3 • FROM my_table(col2 VARCHAR, col3 INTEGER) • WHERE col3 > 10 • TBD: https://github.com/forcedotcom/phoenix/issues/9

  23. Conclusion • Phoenix fits the 80/20 use case rule • Let us know what you’d like to see added • Get involved – we need your help! • Think about how your new feature can be surfaced in SQL

  24. Thank you! Questions/comments?

  25. Query Processing Product Metrics HTable ORG_ID DATE FEATURE Row Key TXNS Key Values IO_TIME RESPONSE_TIME • Scan • Start key: ORG_ID (:1) + DATE (:2) • End key: ORG_ID (:1) + DATE (:3) • Filter • Filter: IO_TIME > 100 • Aggregation • Intercepts scan on region server • Builds map of distinct FEATURE values • Returns one row per distinct group • Client does final merge SELECT feature, SUM(txns) FROM product_metrics WHERE org_id = :1 AND date >= :2 AND date <= :3 AND io_time > 100 GROUP BY feature

  26. Phoenix Query Optimizations • Start/stop key of scan based on AND-ed columns • Through SUBSTR, ROUND, TRUNC, LIKE • Parallelized on client by chunking over start/stop key of scan • Aggregation on region-servers through coprocessor • Inline for GROUP BY over row key ordered columns • In memory map per group otherwise • WHERE clause executed through custom filters • Incremental evaluation with early termination • Evaluated through byte pointers • IN and OR over same column (in progress) • Becomes batched get or filter with next row hint • Top N queries (future) • Through coprocessor keeping top N rows • TABLESAMPLE (future) • Becomes filter with next row hint Completed

  27. Phoenix Performance

  28. Phoenix Performance Completed

More Related