1 / 28

Let's Break It Up: Using Informix with Hadoop

Let's Break It Up: Using Informix with Hadoop. Pradeep Natarajan Session: C13 IBM Corp. Tue 5/17/2011 04:40p. Agenda. Hadoop – What? History Hadoop – Why? Hadoop Distributed File System MapReduce algorithm Hadoop – How? Informix and Apache Hadoop. Session C13. 2/20/11.

kirra
Download Presentation

Let's Break It Up: Using Informix with Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Let's Break It Up: Using Informix with Hadoop Pradeep Natarajan Session: C13 IBM Corp. Tue 5/17/2011 04:40p

  2. Agenda • Hadoop – What? • History • Hadoop – Why? • Hadoop Distributed File System • MapReduce algorithm • Hadoop – How? • Informix and Apache Hadoop Session C13 2/20/11 2

  3. Session C13 Hadoop Overview • What is Hadoop? • Framework for very large scale data processing • Open source Apache project • Written in Java • Runs on Linux, Mac OS/X, Windows, and Solaris • Hadoop core • Distributed file system • API & implementation of MapReduce • Web-based interface to monitor the cluster’s health

  4. Hadoop Timeline • 2003 – Google’s GFS paper • 2004 – MapReduce paper • 2005 – Nutch using MapReduce • 2006 – Hadoop moves out of Nutch • 2007 – Yahoo! running 1000 node Hadoop cluster • 2008 – Hadoop becomes top-level Apache project • 2010 – IBM introduces a portfolio of solutions & services for Big Data: IBM InfoSphere BigInsights Session C13 2/20/11 4

  5. Session C13 Hadoop Overview • Why Hadoop? • Large volume of data • 100s of terabytes or petabytes of data • Need to scale up (aka lots of nodes) • Distributed file system • Use cheap commodity hardware • In large clusters, nodes will fail • Automatic fail over • Fault tolerance through data replication • Common infrastructure across all nodes

  6. Hadoop Cluster Source: Apache Hadoop • Typically 2-level architecture • Nodes are commodity Linux PCs • 40 nodes/rack • Uplink from rack is 8 gigabit • Rack-internal is 1 gigabit Session C13

  7. Session C13 Hadoop Overview • When should you use Hadoop? • Processing lots of unstructured data • Ex. Web search, image analysis, searching log files • Parallelization is possible • Running batch jobs is acceptable • Access to cheap hardware (public cloud is acceptable?)

  8. Session C13 Hadoop Overview • When to NOT use Hadoop? • Processor intensive operation with little data • Ex. Calculating 1000000th digit of π (Pi) • Job is not easily parallelizable • Data is not self-contained • Need interactive processing or state aware computation

  9. Session C13 Hadoop Overview • Hadoop is NOT … • a replacement for RDBMS • suitable for indexed/structured data • a substitute for ALL your data warehouses • a substitute for High Availability SAN-hosted file system • a POSIX file system

  10. Powered By Hadoop Session C13

  11. Session C13 Hadoop Distributed File System (HDFS) • Petabyte file system for the cluster • Single namenode for the cluster • Files are append only (no seek()) • Optimized for streaming reads of large files • Data split into large blocks • Block size = 128MB (as opposed to 4KB in Unix) • Blocks are replicated to multiple datanodes

  12. Session C13 HDFS Source: Apache Hadoop

  13. Session C13 HDFS • Client • Intelligent • Talks to the name node to find location of blocks • Accesses data directly from the nearest data node replicas • Can only append to existing files

  14. Session C13 HDFS • Name Node • Single name node per cluster • Manages file system namespace and metadata • Maps a file name to a set of blocks • Maps a block to the data nodes (replicas) • Data Node • Lots of them (1000s) • Manages data blocks and sends them to the client • Data is replicated; failure is expected

  15. Session C13 HDFS – File write Source: Isabel Drost, FOSDEM 2010

  16. Session C13 HDFS – File Read Source: Isabel Drost, FOSDEM 2010

  17. Session C13 MapReduce Programming Model • Targets data intensive computations • Input data format – specified by user • Output – <key, value> pair • Map & Reduce – user specified algorithm Input Map Reduce Output <k,v> Intermediate <k, v>

  18. Session C13 MapReduce Programming Model Source: Owen O’Malley, Yahoo!

  19. Session C13 Hadoop MapReduce • Job tracker • One per cluster • Receives job requests from client • Schedules and monitors MR jobs on task trackers • Task tracker • Lots of them • Execute MR operations • Read blocks from data nodes

  20. Session C13 Hadoop MapReduce Source: Isabel Drost, FOSDEM 2010

  21. Session C13 Hadoop MapReduce Source: Isabel Drost, FOSDEM 2010

  22. Session C13 Using Apache Hadoop • Requirements: Linux, Java 1.6, sshd, rsync • Configure SSH • Unpack Hadoop • Edit a few configuration files • Format the DFS on the name node • Start all the daemon processes

  23. Session C13 Using Apache Hadoop • Steps for running a Hadoop job: • Compile your job into a JAR file • Copy input data into HDFS • Execute bin/hadoop jar with relevant args • Monitor tasks via Web interface (optional) • Examine output when job is complete

  24. Session C13 Informix and Hadoop • Sqoop (SQL-to-Hadoop) • Command-line tool • Connects Hadoop and traditional database systems • Imports tables/databases from DBMS into HDFS • Generates Java classes to interact with the imported data • Export MR results back to a database

  25. Session C13 Hadoop Informix database Sqoop Informix and Hadoop • Sqoop uses JDBC connection • IBM Data Server driver for JDBC (DRDA protocol) • sqoop --connect jdbc:ids://myhost.ibm.com:9198/stores_demo • --table CUSTOMER --as-sequencefile • Informix JDBC driver (SQLI protocol) • sqoop --connect jdbc:informix-sqli://myhost.ibm.com:9198/stores_demo:INFORMIXSERVER=ol_1170 • --table CUSTOMER --as-sequencefile

  26. Session C13 References • Apache Hadoop wiki - http://wiki.apache.org/hadoop/ • Apache Hadoop – http://hadoop.apache.org • Sqoop wiki - https://github.com/cloudera/sqoop/wiki/ • Cloudera Sqoop - http://www.cloudera.com/blog/2009/06/introducing-sqoop/

  27. Questions ?!? Session C13 2/20/11 27

  28. Let's Break It Up: Using Informix with Hadoop Pradeep Natarajan IBM Corp. pnatara@us.ibm.com (913)599-7136

More Related