1 / 25

Big Data Frameworks

Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.

Download Presentation

Big Data Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIGDATA FRAMEWORKS Presented by CuelogicTechnologies

  2. Introduction Thereare3V’sthatarevitalforclassifyingdataasBigData. These include Volume, Velocityand Veracity. Volume: Datavolumesitisintermsofterabytes,petabytesandsoon. Velocity: Velocityistodowiththehighspeedofdatamovementlike real-timedatastreamingata rapidrateinmicroseconds. Veracity: Veracityinvolvesthehandlingapproachforbothstructured and unstructureddata.

  3. ImplementationofBigDatainfrastructureandtechnology canbeseeninvariousindustrieslikebanking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processingandanalysisforsuchcolossalvolumescannotbe handledbytheexistingdatabasesystemsortechnologies. IT ABOUT THINK

  4. Therearemanyframeworkspresentlyexistinginthisspace.Someof thepopularonesareSpark,Hadoop,HiveandStorm. SomescorehighonutilityindexlikePrestowhileframeworkslikeFlink have greatpotential. TherearestillotherswhichneedsomementionliketheSamza,Impala, Apache Pig,etc. Someoftheseframeworkshavebeenbrieflydiscussedbelow.

  5. ApacheHadoop HadoopisaJava-basedplatformfoundedbyMikeCafarellaandDoug Cutting. Thisopen-sourceframeworkprovidesbatchdataprocessingaswell as data storage services across a group of hardware machines arranged inclusters. HadoopconsistsofmultiplelayerslikeHDFSandYARNthatwork togethertocarryoutdataprocessing.

  6. HDFS(HadoopDistributedFileSystem)isthehardwarelayerthat ensures coordination of data replication and storage activities across various data clusters. In the event of a cluster node failure,real-timecanstillbemadeavailableforprocessing. YARN(YetAnotherResourceNegotiator)isthelayerresponsible forresourcemanagementandjobscheduling. MapReduceisthesoftwarelayerthatfunctionsasthebatch processingengine.

  7. Cons Pros Include vulnerability tosecurity breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers. Includecost-effectivesolution, highthroughput,multi-language support, compatibilitywithmost emerging technologies inBigData services, highscalability,fault tolerance, better suitedforR&D, high availabilitythroughexcellent failure handlingmechanism.

  8. ApacheSpark Itis a batchprocessingframeworkwithenhanceddatastreaming processing. Withfullin-memorycomputationandprocessingoptimisation,it promises a lightningfastclustercomputingsystem.

  9. Sparkframeworkiscomposedoffivelayers. HDFSandHBASE:Theyformthefirstlayerofdatastorage systems. YARNandMesos:Theyformtheresourcemanagementlayer. Coreengine:Thisformsthethirdlayer. Library: This forms the fourth layer containing Spark SQL for SQL queries while stream processing, GraphX and Spark R utilities for processing graph data and MLlib for machine learningalgorithms. Thefifthlayercontainsanapplicationprograminterfacesuchas Java orScala.

  10. Cons Pros Includescalability,lightning processingspeedsthrough reduced number ofI/Ooperations to disk, faulttolerance,supports advancedanalyticsapplications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, nota genuine streaming engine.

  11. Storm It is an application development platform-independent, can be used withanyprogramminglanguageandguaranteesdeliveryofdatawith the leastlatency. InStormarchitecture,thereare2nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation.Incaseof a clusterfailure,thetaskisreassignedto anotherone.

  12. Cons Pros Include ease insetupand operation, highscalability,good speed, fault tolerance,supportfor a wide range oflanguages Include compleximplementation, debugging issues and not very learner-friendly

  13. ApacheFlink ApacheFlink,anopen-sourceframeworkisequallygoodforbothbatch aswellasstreamdataprocessing. Itissuitedforclusterenvironments.Itisbasedontransformations- streamsconcept. Itisalsothe4GofBigData.Itisthe100timesfasterthanHadoop- MapReduce.

  14. Flinksystemcontainsmultiplelayers DeployLayer RuntimeLayer LibraryLayer

  15. Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.

  16. Hive Apache Hive, designed by Facebook, is an ETL (Extract / Transform/ Load)anddatawarehousingsystem.ItisbuiltontopoftheHadoop– HDFSplatform. ThekeycomponentsoftheHiveArchitectureinclude Deploy Layer RuntimeLayer

  17. ThekeycomponentsoftheHiveArchitectureinclude HiveClients HiveServices Hive Storage andComputing The Hive engine converts SQL- queries or requests to MapReduce taskchains. The engine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor:ItsendstaskstotheMapReduceframework

  18. Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.

  19. Presto Prestoistheopen-sourcedistributedSQLtoolmostsuitedforsmaller datasets up to 3Tb.Presto engine includes a coordinator and multiple workers. When client submits queries, these are parsed, analysed, their executionplannedanddistributedforprocessingamongtheworkers by thecoordinator.

  20. Cons Pros Includeleastquery degradation even intheevent ofincreasedconcurrent query workload. Ithas aquery execution rate thatisthree times fasterthan Hive.Ease in addingimagesand embedding links. Highlyuser- friendly. Include reliabilityissues

  21. Impala Impalaisanopen-sourceMPP(MassiveParallelProcessing)query enginethatrunsonmultiplesystemsunder aHadoopcluster. IthasbeenwritteninC++andJava.

  22. It is not coupled with its storage engine. It includes 3main components ImpalaDaemon(Impalad):Itisexecutedonevery node where Impala isinstalled. ImpalaStateStore ImpalaMetaStore ImpalahasitsquerylanguagelikeSQL.

  23. Cons Pros Includesupportsin-memory computationhenceaccesses datawithoutmovement directly fromHadoopnodes, smooth integrationwithBI tools likeTableau,ZoomData, etc., supportsa wide range of fileformats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.

  24. ContactUs +1 347 3748437 info@cuelogic.com https://www.cuelogic.com/ Unit610,134W29thSt, New York, NY10001 Content Source: CuelogicBlog

More Related