210 likes | 282 Views
Add A Billion Row Data Warehouse To Your App. with Redshift, sql and duct tape. James Crisp, Tech Principal @ Getup. Context. GetUp ! is an independent movement to build a progressive Australia and bring participation back into our democracy .
E N D
Add A Billion Row Data Warehouse To Your App with Redshift, sql and duct tape James Crisp, Tech Principal @ Getup
Context GetUp! is an independent movement to build a progressive Australia and bring participation back into our democracy. Lots of online campaigns, field actions, social media. Supported by small donations. 600K+ members. Budget, medicare, uni fees, barrier reef, forests etc.
Big Rails App • CMS + Petitions, emailing MPs, donations etc • Email blasting & segmenting • Back office & mini-crm • 2 X [3 app, 2 worker, 1 DB servers], Au and Sg
Data Warehouse… why?? • Reporting & exports • Experimental data science • Stop locking up transactional DB!! • More data sources (logs, CRM, ..) => customer • Different schema & faster queries
Options • Read-only replica of transactional DB • Mongo / Cassandra / .. • Hadoop, Pig, Hive • Elastic search • BIG Sql, eg Redshift
Why BIG Sql? • Team skills: tech & data scientists • Easy integration from SQL DB • Good hosted options • Fast performance, column based • Sets and aggregations • Can do JSON for less structured data
Why Redshift? • Fully hosted & managed multi-node • Fast & Column based, semi-compressed • Relatively cheap and easy to try • Good import options • Massively expandable • (Security & backup options)
What is Redshift… really? • Heavily modified fork of PostgreSQL 8 • Specialised data storage & query engine • Can use normal ODBC/JDBC/Postgres clients to connect
string connString = "Driver={PostgreSQL Unicode};" + String.Format("Server={0};Database={1};" + "UID={2};PWD={3};Port={4};SSL=true;Sslmode=Require", server, DBName, masterUsername, masterUserPassword, port); OdbcConnectionconn = new OdbcConnection(connString); conn.Open(); OdbcDataAdapterda = new OdbcDataAdapter(sql, conn); da.Fill(ds); dt = ds.Tables[0]; foreach (DataRow row in dt.Rows) { // Do something useful }
$DBConnectionString = "Driver={PostgreSQL UNICODE}:Server=$MyServer;Port=$MyPort;Database=$MyDB;Uid=$MyUid;Pwd=$MyPass;" $DBConn = New-Object System.Data.Odbc.OdbcConnection; $DBConn.ConnectionString = $DBConnectionString; $DBConn.Open(); $DBCmd = $DBConn.CreateCommand(); $DBCmd.CommandText = "SELECT * FROM mytable;"; $DBCmd.ExecuteReader(); $DBConn.Close();
How much does it cost? • Min cluster size 2, leader is free • 2 X 2TB HDD, 15G RAM: Syd $2.50/h, US $1.70/h (reserved 1yr Syd $13.2K, US $8.8K) • Up to 2.56TB flash or 16TB HDD and 244GB RAM per node. Cluster up to 1.6 PB.
Data Sources so far… Transactional DB Application request logs
Hooking up DB Data DB Server Redshift MySQL S3 Map & Dump CSVs LOAD CSVs Fire Drop/Create tables, Load data
Hooking up Request Logs App Servers Fire data load Upload JSON logs S3 Redshift Map & Load JSON
Demos • Scripts & SQL for hooking up • AWS Redshift console • Connect & query
What have we used it for? • Faster data science / reporting without locking up transactional DB • Combining sharded tables • Request logs (browser agent, params, ..) • Marking/tracking segments + debugging info
Other uses • Data exploration with tools like Tableau • BI tools • Denormalised data • Dump in more data for single view of user – FB info, CRM, etc
Conclusion • Easy to set up and use (for Win/.NET too) • Super fast • Reasonable price • Met our needs well so far We are hiring atm!