1 / 21

Add A Billion Row Data Warehouse To Your App

Add A Billion Row Data Warehouse To Your App. with Redshift, sql and duct tape. James Crisp, Tech Principal @ Getup. Context. GetUp ! is an independent movement to build a progressive Australia and bring participation back into our democracy .

andres
Download Presentation

Add A Billion Row Data Warehouse To Your App

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Add A Billion Row Data Warehouse To Your App with Redshift, sql and duct tape James Crisp, Tech Principal @ Getup

  2. Context GetUp! is an independent movement to build a progressive Australia and bring participation back into our democracy. Lots of online campaigns, field actions, social media. Supported by small donations. 600K+ members. Budget, medicare, uni fees, barrier reef, forests etc.

  3. Big Rails App • CMS + Petitions, emailing MPs, donations etc • Email blasting & segmenting • Back office & mini-crm • 2 X [3 app, 2 worker, 1 DB servers], Au and Sg

  4. Data Warehouse… why?? • Reporting & exports • Experimental data science • Stop locking up transactional DB!! • More data sources (logs, CRM, ..) => customer • Different schema & faster queries

  5. Options • Read-only replica of transactional DB • Mongo / Cassandra / .. • Hadoop, Pig, Hive • Elastic search • BIG Sql, eg Redshift

  6. Why BIG Sql? • Team skills: tech & data scientists • Easy integration from SQL DB • Good hosted options • Fast performance, column based • Sets and aggregations • Can do JSON for less structured data

  7. Why Redshift? • Fully hosted & managed multi-node • Fast & Column based, semi-compressed • Relatively cheap and easy to try • Good import options • Massively expandable • (Security & backup options)

  8. What is Redshift… really? • Heavily modified fork of PostgreSQL 8 • Specialised data storage & query engine • Can use normal ODBC/JDBC/Postgres clients to connect

  9. string connString = "Driver={PostgreSQL Unicode};" + String.Format("Server={0};Database={1};" + "UID={2};PWD={3};Port={4};SSL=true;Sslmode=Require", server, DBName, masterUsername, masterUserPassword, port); OdbcConnectionconn = new OdbcConnection(connString); conn.Open(); OdbcDataAdapterda = new OdbcDataAdapter(sql, conn); da.Fill(ds); dt = ds.Tables[0]; foreach (DataRow row in dt.Rows) { // Do something useful }

  10. $DBConnectionString = "Driver={PostgreSQL UNICODE}:Server=$MyServer;Port=$MyPort;Database=$MyDB;Uid=$MyUid;Pwd=$MyPass;" $DBConn = New-Object System.Data.Odbc.OdbcConnection; $DBConn.ConnectionString = $DBConnectionString; $DBConn.Open(); $DBCmd = $DBConn.CreateCommand(); $DBCmd.CommandText = "SELECT * FROM mytable;"; $DBCmd.ExecuteReader(); $DBConn.Close();

  11. psql.exe -h $DBSERVER -U $DBUSER -d $DBName-f script.sql

  12. How much does it cost? • Min cluster size 2, leader is free • 2 X 2TB HDD, 15G RAM: Syd $2.50/h, US $1.70/h (reserved 1yr Syd $13.2K, US $8.8K) • Up to 2.56TB flash or 16TB HDD and 244GB RAM per node. Cluster up to 1.6 PB.

  13. Data Sources so far… Transactional DB Application request logs

  14. Hooking up DB Data DB Server Redshift MySQL S3 Map & Dump CSVs LOAD CSVs Fire Drop/Create tables, Load data

  15. Hooking up Request Logs App Servers Fire data load Upload JSON logs S3 Redshift Map & Load JSON

  16. Demos • Scripts & SQL for hooking up • AWS Redshift console • Connect & query

  17. What have we used it for? • Faster data science / reporting without locking up transactional DB • Combining sharded tables • Request logs (browser agent, params, ..) • Marking/tracking segments + debugging info

  18. Other uses • Data exploration with tools like Tableau • BI tools • Denormalised data • Dump in more data for single view of user – FB info, CRM, etc

  19. Conclusion • Easy to set up and use (for Win/.NET too) • Super fast • Reasonable price • Met our needs well so far We are hiring atm!

More Related