1 / 25

Moving Data in real-time into Amazon Redshift

Moving Data in real-time into Amazon Redshift. By Matthew Lang, Director of Professional Services May 30, 2019. Welcome!. “ Continuent, the MySQL Availability Company ”. Most people call me Matt Started as Programmer/DBA for Progress 4GL database

abailey
Download Presentation

Moving Data in real-time into Amazon Redshift

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Moving Data in real-time into Amazon Redshift By Matthew Lang, Director of Professional Services May 30, 2019

  2. Welcome! “Continuent, the MySQL Availability Company”

  3. Most people call me Matt • Started as Programmer/DBA for Progress 4GL database • Wrote real time replicator from/to MySQL and Progress Database called Pro2MySQL • Part time Musician • Live in Miami • Just learned how to do Animations in PowerPoint  Who are you anyway?

  4. Topics In this session, we will discuss: • What is Amazon’s Redshift? • Traditional methods of getting data out of MySQL* databases and into data warehouses • Introducing Tungsten Replicator • Extra tricks • Filters * MySQL is understood in a broad context, including MySQL, MariaDB, Percona Server, RDS MySQL, RDS Aurora and Google Cloud SQL

  5. RedShift

  6. Why Amazon RedShift? • High Performance, clustered analytics data warehouse • Easy to start/stop and expand • PostgresSQL JDBC interface • Easy to Acccess • Column based store, parallel processing • Automated Backups • Integrates with S3 • Online Query Editor

  7. ETL

  8. ETL, again What really happens with ETL (Extract) • Need to query tables based on changes since last read • Tables in query need keys based usually based on timestamp • Read all necessary columns • If successful, note success time, or position of primary key for each table • If it breaks, find the error, fix it, and rerun job

  9. Your MySQL Database on ETL • Database is spending time on large queries • Because of the above, you can run it on a slave • We need to do a nightly backup on that slave too • buffer_pool pollution • Meaningful OTLP data is evicted from the buffer pool • Replaced with ETL results, reducing cache hits (37%) • Can control with innodb_old_blocks_pct and innodb_old_blocks_time • Data is stale as soon as extract process is done • If it breaks, admin fixes the error. • It might be too late to rerun a nightly job. Too much database load during the day • So, run a ”double” nightly job • Stale analytics for today • Repeat from step 1 

  10. ETL (Transform) • Second phase of ETL process • Sometimes is combined with Extract process, creating yet additional load on our database • Filters • Add column (source db identifier) • Drop columns, like blobs, CC numbers • Combine or denormalize table • Can be (another) time consuming task

  11. ETL (Load) • Finally! Load into target (Redshift and/or others) • What about rows that are updated? Target contains: New Data Extracted: • We must remove the old row and insert new row • Multiple updates to the same row mean multiple deletes and inserts for that key

  12. Real Time Replication into RedShift

  13. Data Warehouse Integration is Changing • Need to query tables based on changes since last read • Traditional data warehouse usage was based on dump from transactional store, loads into data warehouse • Data warehouse and analytics were done off historical data loaded • Data warehouses often use merged data from multiple sources, which is difficult to combine and manage • Data warehouses are now frequently sources as well as targets for data, i.e.: • Export data to data warehouse • Analyze data • Feed summary data back to application to display stats to users

  14. A Use Case for Real Time

  15. Tungsten Replicator: Data Warehouses Tungsten Replicator is a fast database replication engine • Designed for speed and flexibility • Read directly from MySQL Binary logs • Replicate to many targets:

  16. How Redshift Replication Works

  17. MySQL Extraction • Read directly from MySQL Binary logs, or remotely request the binary logs • Write into THL (Transaction History Log) • Our own format • Contains transactional data • Contains additional metadata • Extract once MySQL DBMS Logs MySQL Binary Logging Master Replicator: Extractor THL THL = Events + Metadata

  18. Apply into Redshift • Extract data from THL • Batch incoming rows into CSV • CSV is loaded into S3 using s3utils • Redshift imports the CSV files into staging tables • A “materialization” is performed to merge staging data into existing production tables merge S3 Copy s3cmd CSV JS JDBC Slave Replicator: Applier THL

  19. How Materialization Works

  20. Fun stuff available for Redshift with Tungsten Replicator • Long Term CDC Collection • Save the CSV files in S3 even after merging changes • Provides complete audit trail of all changes. For instance, track all price changes for a particular item • CSV file compression • Adds some overhead to merge • Can save on storage costs • Upload more quickly • Fan-in • Merge databases into a single schema • Perfect for sharded datasets • DDL Translation • Filters, Filters, and more Filters!

  21. Replicating into a single Schema USA Replicator Replicator UK Add DB name to each row Japan

  22. Filters • Over 40 filters included • Add filters in any stage of replication • Want even more? Write your own in JavaScript!

  23. Popular Filters • Database Transform – change names of extracted schemas or tables • Dropcolumn – drops specified columns from THL • Replicate – include or exclude tables to be replicated • Rowadddbname – add the database name to each row (just like in the example) • Timedelay – delay writing events to THL. Not useful for data warehouse, but for real time targets, can specify a time delay to purposely keep them behind • ddltranslate – replicate DDL changes

  24. ddlscan CREATE TABLE `mytable` (   `id` bigint(20) NOT NULL AUTO_INCREMENT,   `stuff` varchar(128) DEFAULT NULL,   `mytime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,   `units` decimal(8,5) DEFAULT NULL,   `morestuff` text,   PRIMARY KEY (`id`) CREATE TABLE matt.mytable (   id BIGINT,   stuff VARCHAR(512) /* VARCHAR(128) */, mytime TIMESTAMP,   units DECIMAL(8,5), morestuff VARCHAR(65535) /* WARN: MySQL TEXT translated to max VARCHAR */,   PRIMARY KEY (id) );

  25. Next Steps Sign up for a private demo for your team, setup a POC, email us at sales@continuent.com Learn more at your own pace • Training and webinar library at www.continuent.com/videos/ • White papers at www.continuent.com/white-papers/ • Read the documentation at http://docs.continuent.com/

More Related