1 / 29

Moving Data

Moving Data. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Sqoop Flume. Apache Sqoop. S qoop - SQL to Hadoop . Sqoop is a tool designed to transfer data between Hadoop and relational databases Top-level Apache project developed by Cloudera

carson
Download Presentation

Moving Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Moving Data CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • Sqoop • Flume

  3. Apache Sqoop

  4. Sqoop - SQL to Hadoop • Sqoop is a tool designed to transfer data between Hadoop and relational databases • Top-level Apache project developed by Cloudera • Use Sqoop to move data between an RDBMS and HDFS • Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported • Sqoop uses MapReduce to import and export the data • Uses a JDBC or custom interface

  5. What Can You Do With Sqoop . . . • The input to the import process is a database table • Sqoop will read the table row-by-row into HDFS • The import process is performed in parallel. The output of this process is a set of files containing a copy of the imported table • These files may be text files, binary, Avro or SequenceFiles • Generates Java class which can encapsulate one row of the imported data that can be reused in subsequent MapReduce processing of the data • Used to serialize / deserialize Sequence File formats • Parse the delimited-text form of a record

  6. What Else Can You Do With Sqoop . . . • You can export an HDFS file to an RDBMS • Sqoop’s export process reads a set of delimited text files from HDFS in parallel • Parses them into records • Inserts them as rows into an RDBMS table • Incremental imports are supported • Sqoop includes commands which allow you to inspect the RDBMS you are connected to

  7. Sqoop Is Flexible • Most aspects of the import, code generation, and export processes can be customized • You can control the specific row range or columns imported • You can specify particular delimiters and escape characters for the file-based representation of the data • You can specify the file format used. • Sqoop provides connectors for MySQL, PostgreSQL, Netezza, Oracle, SQL Server, and DB2. • There is also a generic JDBC connector

  8. To Use Sqoop • To use Sqoop specify the tool you want to use and the arguments that control the tool • Standard Syntax • sqoop tool-name [tool-arguments] • Help is available • sqoop help (tool-name) or • sqoop import --help

  9. Sqoop Tools • Tools to import / export data: • sqoopimport • sqoopimport-all-tables • sqoopcreate-hive-table • sqoopexport • Tools to inspect a database • sqooplist-databases • sqooplist-tables

  10. Sqoop Arguments • Common tool arguments --connect JDBC connect string --username username for authentication --password password for authentication • Import control arguments --append Append data to an existing dataset in HDFS --as-textfile Imports data as plain text (default) --table Table to read --target-dir HDFS target directory --where WHERE clause used for filtering • Sqoop also provides arguments and options for output line formatting, input parsing, Hive, code generation, HBase and many others

  11. Sqoop Examples • Import an employees table from the HR database • $ sqoop import --connect jdbc:mysql:// database.example.com/hr \ --username abc --password 123 --table employees • Import an employees table from the HR database, but only employee’s whose salary exceeds $70000 • $ sqoop import --connect jdbc:mysql:// database.example.com/hr \ --username abc --password 123 --table employees \ --where “salary > 70000” • Export new employee data into the employees table in the HR database • $ sqoop export --connect jdbc:mysql://database.example.com/hr --table employees --export-dir /new_employees

  12. Apache Flume

  13. What Is Flume? • Apache Flume is a distributed, reliable system for efficiently collecting, aggregating and moving large amounts of log data from many different sources into HDFS • Supports complex multi-hop flows where events may travel through multiple agents before reaching HDFS • Allows fan-in and fan-out flows • Supports contextual routing and backup routes (fail-over) • Events are staged in a channel on each agent and are delivered to the next agent (like HDFS) in the flow • Removed from a channel after they are stored in the channel of next agent or in HDFS

  14. Flume Components • Event – data being collected • Flume Agent – source, channel, sink • Source – where the data comes from • Channel – repository for the data • Sink – next destination for the data

  15. How Does It Work? • A Flume event is data flow • A Flume agent is a (JVM) process that hosts the components (source, channel, sink) through which events flow from an external source to the next destination (hop) • A Flume source receives events sent to it by an external source like a web server • Format specific • When a Flume source receives an event it stores it into one or more channels. • Channel is a passive store • Can be a memory channel • Can be a durable

  16. How Does It Work? • The sink removes the event from the channel and puts it into an external repository (HDFS) or forwards it to the source of the next Flume agent (next hop) in the flow • The source and sink within the given agent run asynchronously with the events staged in the channel.

  17. Single Agent

  18. Multi-Agent

  19. Multiplexing Channels

  20. Consolidation

  21. Configuration • To define the flow within a single agent • list the sources / channels / sinks • point the source and sink to a channel • Basic syntax: # list the sources, sinks and channels for the agent <Agent>.sources = <Source> <Agent>.sinks = <Sink> <Agent>.channels = <Channel1> <Channel2> # set channel for source <Agent>.sources.<Source>.channels = <Channel1> <Channel2> ... # set channel for sink <Agent>.sinks.<Sink>.channel = <Channel1>

  22. Flume Example • Agent lets a user generate events and display them to the console. • Defines a single agent named a1 • a1 has a source that listens for data on port 44444 • a1 has a channel that buffers event data in memory • a1 has a sink that logs event data to the console # example.conf: A single-node Flume configuration # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1

  23. Flume Example . . . # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

  24. Built-In Flume Sources • Avro • Thrift • Exec • JMS • Spooling Directory • NetCat • Sequence Generator • Syslog • HTTP

  25. Built-In Flume Sinks • HDFS • Logger • Avro • Thrift • IRC • File Roll • Null • HBase • Solr • Elastic Search

  26. Built-In Flume Channels • Memory • JDBC • File • Pseudo Transaction

  27. Flume Interceptors • Attach functions to sources for some type of transformation • Convert event to a new format • Add a timestamp • Change your car's oil

  28. Code Examples • Let's take a look at some custom Flume code! • Twitter Source • PostgreSQL Sink

  29. Resources • http://sqoop.apache.org • http://flume.apache.org

More Related