1 / 35

Big Data and Hadoop On Windows

Big Data and Hadoop On Windows. Image credit: morguefile.com/creative/ imelenchon. .Net SIG Cleveland. About Me. Serkan Ayvaz , Sn . Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ. LinkedIn: serkanayvaz@gmail.com email:ayvazs@ccf.org

cooper
Download Presentation

Big Data and Hadoop On Windows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data and Hadoop On Windows Image credit: morguefile.com/creative/imelenchon .Net SIG Cleveland

  2. About Me • SerkanAyvaz, Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ. • LinkedIn: serkanayvaz@gmail.com • email:ayvazs@ccf.org • Twitter:@sayvaz

  3. Agenda • Introduction to Big Data • Hadoop Framework • Hadoop On Windows • Ecosystem • Conclusions

  4. What is Big Data?(“Hype?”) • Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.-Wikipedia

  5. What is new? • Enterprise data grows rapidly • Emerging Market for Vendors • New Data Sources • Competitive industries - need for more Insights • Asking different questions • Generating models instead transforming data into models

  6. What is the problem? • Size of Data; Rapid growth, TBs to PBs are norm for many organizations • As of 2012, size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. • Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc • Structured • Unstructured • Semi-structured • Rate of Data Growth • As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -Wikipedia • Particularly large datasets; meteorology, genomics, complex physics simulations,and biological and environmental research, Internet search, finance and business informatics

  7. Critique • Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review. • Consumer privacy concerns by increasing storage and integration of personal information

  8. Things to consider • Return of Investment may differ • Asking wrong questions, won’t get right answers • Experts to fit in the organization • Requires leadership decision • Might be fine with traditional systems(for now)

  9. What is Hadoop? Hadoop Core • Scalability • Scales Horizontally, Vertical scaling has limits • Scales seamlesly • Moves processing to the data, opposed to traditional methods • Network bandwidth is limited resource • Processes data sequentially in chunks, avoid random access • Seeks are expensive, disk throughput is reasonable • Fault tolerance • Data Replication • Economical • Commodity-Servers(“not Low-end”) vs Specialized Servers • Ecosystem • Integration with other tools • Open Source • Innovative, Extensible HDFS Storage MapReduce Processing

  10. What can I do with Hadoop? • Distributed Programming(MapReduce) • Storage, Archive Legacy data • Transform Data • Analysis, Ad Hoc Reporting • Look for Patterns • Monitoring/ Processing logs • Abnormality detection • Machine Learning and advanced algorithms • Many more

  11. HDFS Blocks • Large enough to minimize the cost of seeks-64 MB default • Unit of abstraction makes storage management simpler than file • Fits well with replication strategy and availability NameNode • Maintainsthe filesystem tree and metadata for all the files and directories • Stores the namespace image and edit log Datanode • Store and retrieve blocks • Report the blocks back to NameNode periodically

  12. HDFS • Designed for and Shines with large files • Fault tolerance - Data Replication within and across Racs • Hadoop breaks data into smaller blocks • Data locality • Most efficient with write-once, read-many-times pattern Good Not so good • Low-latency data access • optimized for high throughput data, may be at the expense of latency. • Consider Hbase for low latency • Lots of small files • namenode holds filesystem metadata in memory • the limit to the number of files in a filesystem • Multiple writers, arbitrary file modifications • Files in HDFS may be written to by a single writer.

  13. Data Flow Read Write Source:Hadoop:The Definitive Guide

  14. MapReduce Programming • Splits input files into blocks • Operates on key-value pairs • Mappers filter & transform input data • Reducers aggregate mappers output • Handles processing efficiently in parallel • Move code to data – data locality • Same code run on all machines • Can be difficult to implement some algorithms • Can be implemented in almost any language • Streaming MapReduce for python, ruby, perl, php etc • pig latin as data flow language • hive for sql users

  15. MapReduce • Programmers write two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* • All values with the same key are reduced together • For efficiency, programmers typically also write: partition (k’, number of partitions) → partition for k’ • Often a simple hash of the key, e.g., hash(k’) mod n • Divides up key space for parallel reduce operations combine (k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic • The framework takes care of rest of the execution

  16. Simple example - Word Count // Map Reduce function in JavaScript // ------------------------------------------------------------- varmap = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (vari = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; varreduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };

  17. Input Divide and Conquer k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 map map map map x 4 y 1 z 3 z 6 x 6 z 1 y 7 z 9 combine combine combine combine x 4 y 1 z 9 x 6 z 1 y 7 z 9 partition partition partition partition Shuffle and Sort: aggregate values by keys x x 4 10 6 y y 8 1 7 z z z 19 1 1 9 9 8 9 reduce reduce reduce r1 r2 r3 s3 Output

  18. How MapReduce Works? Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value); Source:Hadoop:The Definitive Guide

  19. How is it different from other Systems? • Parallel - Message Passing Interfaces(MPI) • Compute-intensive jobs, • Issue larger data volumes • Network bandwidth is the bottleneck and compute nodes become idle. • Hard to implement • Challenge of Coordinating the processes in a large-scale distributed computation • Handling partial failure • Managing check pointing and recovery

  20. Comparing MapReduce to RDBMs

  21. MapReduce • MapReduce complementary to RDBMs, not competing • MapReduce good fit for analyzing the whole dataset in batch • An RDBMS is good for point queries or updates • indexed to deliver low-latency retrieval • relatively small amount of data. • MapReduce suits applications where the data is written once and read many times, • An RDBMS is good for datasets that are continually updated.

  22. Hadoop on Windows Overview HDInsight HDInsight Server HDInsight on Cloud Familiar Tools &Functionality Hortonworks Data platform Windows Platform 100% Open Source Contributions to Community Apache Hadoop Core Common framework Open Source Community Shared by all Distribution

  23. Hadoop on Windows • Standard Hadoop Modules • HDFS • MapReduce • Pig • Hive • Monitoring Pages • Easy installation and Configuration • Integration with Microsoft system • Active Directory • System Center • etc

  24. Why Hadoop on Windows important? • Windows Server Large Market share • Large Developer and User community • Existing Enterprise tools • Familiarity • Simplicity of Use and Management • Deployment options on both Windows Server and Windows Azure.

  25. User -Self Service Tools: Data Viewers, BI, Visualization HADOOP [Server and Cloud] Java Streaming HiveQL PigLatin .NET Other langs. HDFS DATA [unstructured, semi-structured, structured] NOSQL SQL HDFS • External Data • Web • Mobile Devices • Social Media Legacy Data RDBMS

  26. Run Jobs • Submit a JAR file(Java MapReduce) • HiveQL • PigLatin • .NET wrapper through Streaming • .NetMapReduce • LINQ to Hive • JavaScript Console • Excel Hive Add-In

  27. .NetMapReduce Example NuGet Packages install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient Reference “Microsoft.Hadoop.MapReduce.DLL” Create a class the implements “HadoopJob<YourMapper>Create a class called “FirstMapper” that implements “MapperBase” Run DLL using MRRunner Utility; > MRRunner -dllMyDll -class MyClass -- extraArg1 extraArg2  Run Invoke Exe using MRRunner Utility; varhadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);

  28. .NetMapReduce Example public class FirstJob : HadoopJob<SqrtMapper> { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfigurationconfig = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { intinputValue = int.Parse(inputLine); // Perform the work. double sqrt = Math.Sqrt((double)inputValue); // Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); } }

  29. Hadoop Ecosystem • Hadoop • Common, MapReduce, HDFS • HBase • Column oriented distributed database • Hive • Distributed data warehouse-SQL like query platform • Pig • Data transformation language • Sqoop • Tool for bulk Import/export between HDFS, HBase, Hive and relational databases • Mahout • Data Mining Algorithms • ZooKeeper • Distributed Coordination service • Oozie • Job Running and scheduling workflow service

  30. What’s HBase? • Column Oriented DistiributedDB • Inspired by Google BigTable • Uses HDFS • Interactive Processing • Can use either without MapRed • PUT, GET, SCAN Commands

  31. What’s Hive? • Translate HiveQL ,similar to SQL, to MapReduce • A Distributed Data warehouse • HDFS table file format • Integrate with BI products on tabular data, Hive ODBC, JDBC drivers

  32. Hive • HiveQL – Familiar, high level language • Batch jobs – Ad Hoc Queries • Self service BI tools via ODBC, JDBC • Schema but not strict as traditional RDBMs • Supports UDFs • Easy access to Hadoop data Good for Not so good for • No Updates or deletes, Insert only • Limited Indexes, built-in optimizer, no caching • Not OLTP • Not fast as MapReduce

  33. Conclusion • Hadoop is great for its Purposes and here to stay • BUT Not a common cure for every problem • Developing standards and best practices very important • Users may abuse the resources and scalability • Integration with Windows Platform • Existing systems, tools, Expertise • Parallelization • Easier to scale as need • Economical • Commodity Hardware • Relatively short training, application development time with Windows

  34. Resources&References • Hadoop: The Definitive Guide by Tom White • http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979 • Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010. • Apache Hadoop • http://hadoop.apache.org/ • Microsoft Big data page • http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx • Hortonworks Data Platform • http://hortonworks.com/products/hortonworksdataplatform/ • Hadoop SDK • http://hadoopsdk.codeplex.com/

  35. Thank you! Any Questions?

More Related