Real-Time Data Movement into Amazon Redshift: A Comprehensive Guide

Moving Data in real-time into Amazon Redshift By Matthew Lang, Director of Professional Services May 30, 2019

Welcome! “Continuent, the MySQL Availability Company”

Most people call me Matt • Started as Programmer/DBA for Progress 4GL database • Wrote real time replicator from/to MySQL and Progress Database called Pro2MySQL • Part time Musician • Live in Miami • Just learned how to do Animations in PowerPoint  Who are you anyway?

Topics In this session, we will discuss: • What is Amazon’s Redshift? • Traditional methods of getting data out of MySQL* databases and into data warehouses • Introducing Tungsten Replicator • Extra tricks • Filters * MySQL is understood in a broad context, including MySQL, MariaDB, Percona Server, RDS MySQL, RDS Aurora and Google Cloud SQL

RedShift

Why Amazon RedShift? • High Performance, clustered analytics data warehouse • Easy to start/stop and expand • PostgresSQL JDBC interface • Easy to Acccess • Column based store, parallel processing • Automated Backups • Integrates with S3 • Online Query Editor

ETL

ETL, again What really happens with ETL (Extract) • Need to query tables based on changes since last read • Tables in query need keys based usually based on timestamp • Read all necessary columns • If successful, note success time, or position of primary key for each table • If it breaks, find the error, fix it, and rerun job

Your MySQL Database on ETL • Database is spending time on large queries • Because of the above, you can run it on a slave • We need to do a nightly backup on that slave too • buffer_pool pollution • Meaningful OTLP data is evicted from the buffer pool • Replaced with ETL results, reducing cache hits (37%) • Can control with innodb_old_blocks_pct and innodb_old_blocks_time • Data is stale as soon as extract process is done • If it breaks, admin fixes the error. • It might be too late to rerun a nightly job. Too much database load during the day • So, run a ”double” nightly job • Stale analytics for today • Repeat from step 1 

ETL (Transform) • Second phase of ETL process • Sometimes is combined with Extract process, creating yet additional load on our database • Filters • Add column (source db identifier) • Drop columns, like blobs, CC numbers • Combine or denormalize table • Can be (another) time consuming task

ETL (Load) • Finally! Load into target (Redshift and/or others) • What about rows that are updated? Target contains: New Data Extracted: • We must remove the old row and insert new row • Multiple updates to the same row mean multiple deletes and inserts for that key

Real Time Replication into RedShift

Data Warehouse Integration is Changing • Need to query tables based on changes since last read • Traditional data warehouse usage was based on dump from transactional store, loads into data warehouse • Data warehouse and analytics were done off historical data loaded • Data warehouses often use merged data from multiple sources, which is difficult to combine and manage • Data warehouses are now frequently sources as well as targets for data, i.e.: • Export data to data warehouse • Analyze data • Feed summary data back to application to display stats to users

A Use Case for Real Time

Tungsten Replicator: Data Warehouses Tungsten Replicator is a fast database replication engine • Designed for speed and flexibility • Read directly from MySQL Binary logs • Replicate to many targets:

How Redshift Replication Works

MySQL Extraction • Read directly from MySQL Binary logs, or remotely request the binary logs • Write into THL (Transaction History Log) • Our own format • Contains transactional data • Contains additional metadata • Extract once MySQL DBMS Logs MySQL Binary Logging Master Replicator: Extractor THL THL = Events + Metadata

Apply into Redshift • Extract data from THL • Batch incoming rows into CSV • CSV is loaded into S3 using s3utils • Redshift imports the CSV files into staging tables • A “materialization” is performed to merge staging data into existing production tables merge S3 Copy s3cmd CSV JS JDBC Slave Replicator: Applier THL

How Materialization Works

Fun stuff available for Redshift with Tungsten Replicator • Long Term CDC Collection • Save the CSV files in S3 even after merging changes • Provides complete audit trail of all changes. For instance, track all price changes for a particular item • CSV file compression • Adds some overhead to merge • Can save on storage costs • Upload more quickly • Fan-in • Merge databases into a single schema • Perfect for sharded datasets • DDL Translation • Filters, Filters, and more Filters!

Replicating into a single Schema USA Replicator Replicator UK Add DB name to each row Japan

Filters • Over 40 filters included • Add filters in any stage of replication • Want even more? Write your own in JavaScript!

Popular Filters • Database Transform – change names of extracted schemas or tables • Dropcolumn – drops specified columns from THL • Replicate – include or exclude tables to be replicated • Rowadddbname – add the database name to each row (just like in the example) • Timedelay – delay writing events to THL. Not useful for data warehouse, but for real time targets, can specify a time delay to purposely keep them behind • ddltranslate – replicate DDL changes

ddlscan CREATE TABLE `mytable` ( ìd` bigint(20) NOT NULL AUTO_INCREMENT, `stuff` varchar(128) DEFAULT NULL, `mytime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, ùnits` decimal(8,5) DEFAULT NULL, `morestuff` text, PRIMARY KEY (ìd`) CREATE TABLE matt.mytable ( id BIGINT, stuff VARCHAR(512) /* VARCHAR(128) */, mytime TIMESTAMP, units DECIMAL(8,5), morestuff VARCHAR(65535) /* WARN: MySQL TEXT translated to max VARCHAR */, PRIMARY KEY (id) );

Next Steps Sign up for a private demo for your team, setup a POC, email us at sales@continuent.com Learn more at your own pace • Training and webinar library at www.continuent.com/videos/ • White papers at www.continuent.com/white-papers/ • Read the documentation at http://docs.continuent.com/

Real-Time Data Movement into Amazon Redshift: A Comprehensive Guide

Real-Time Data Movement into Amazon Redshift: A Comprehensive Guide

Presentation Transcript

GIS Data in Real-Time Transit Management Systems

Enabling Real Time Data Analysis

Building LinkedIn’s Real-time Data Pipeline

Getting Maximum Performance from Amazon Redshift: Complex Queries

Enabling Real Time Data Analysis

Arduino Real Time Data

Real-time Waveform Data Exchange

Real-Time Big Data Analytics

REAL-TIME and NEAR REAL-TIME DATA SOURCE

Real-Time Data Warehousing

Using Real-time Data

Real-Time BGP Data Access

Image Subtraction for Real Time Moving Object Extraction

Assimilating Data Into Ionospheric Models: The Real-Time Revolution

Real time data in Statoil

Real-Time Visualizations on Streaming Data

Real-Time Speech Analytics: Transforming Voice Data into Opportunities

Moving the NAWQA habitat data through time

Amazon Web Services Training with Real-Time Projects

Amazon Redshift

Amazon Redshift Tutorial | Amazon Redshift Architecture | AWS Tutorial For Beginners | Simplilearn

Real-Time Data Integration