1 / 66

A framework for easy development of Big Data applications

A framework for easy development of Big Data applications. Rubén Casado ruben.casado@treelogic.com @ ruben_casado. Agenda. Big Data processing Lambdoop framework Lambdoop ecosystem Case studies Conclusions. About me :-). PhD in Software Engineering MSc in Computer Science

rusty
Download Presentation

A framework for easy development of Big Data applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado

  2. Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions

  3. About me :-)

  4. PhD in Software Engineering • MSc in Computer Science • BSc in Computer Science Academics Work Experience

  5. AboutTreelogic

  6. Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledgeto improve quality standards in our daily life

  7. TREELOGIC – Distributor and Sales

  8. Research Lines Solutions Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Semantics Security & Safety Justice Health Transport Financial services ICT tailored solutions R&D

  9. 7 ongoing FP7projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostarsprojects Coordinating all of them

  10. More than 300 partners in last 3 years More than 40 projects with budget over 120 MEUR 7 years’ experience in R&D projects Overall participation in 11 European projects Project coordinator in 7 European projects Research & INNOVATION

  11. www.datadopter.com

  12. Agenda • Big Data processing • Lambdoop framework • Lambdoop ecosystem • Case studies • Conclusions

  13. What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques

  14. How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -

  15. 3 problems Volume Variety Velocity

  16. 3 solutions Batch processing Real-time processing NoSQL

  17. 3 solutions Batch processing Real-time processing NoSQL

  18. Batch processing • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Volume

  19. Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity

  20. Hybrid computation model • Low latency • Massivedata + Streamingdata • Scalable • Combine batch and real-time results Volume Velocity

  21. Hybrid computation model All data Batch Batch processing results Final results Combination New data Stream Real-time processing results

  22. Processing Paradigms 2003 Inception • Batch processing • Large amount of statics data • Scalable solution • Volume • Real-time processing • Computing streaming data • Low latency • Velocity • Hybrid computation • Lambda Architecture • Volume + Velocity 2006 1ª Generation 2010 2ª Generation 2014 3ª Generation

  23. Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS

  24. Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions

  25. What is Lambdoop? • Open sourceframework • Software abstraction layer over Open Source technologies • Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis • Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process • Same single API for the three processing paradigms • Batch processing similar to Pig / Cascading • Real time processing using built-in functions easier than Trident • Hybrid computation model transparent for the developer

  26. Why Lambdoop? • Building a batch processing application requires • MapReduce developing • Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …) • Storage systems (Hbase, MongoDB, HDFS, Cassandra…) • Real-time processing requires • Streaming computing (S4, Storm, Samza) • Unboundend input (Flume, Scribe) • Temporal data stores (In-memory, Kafka, Kestrel)

  27. Why Lambdoop? • Building a hybrid computation system (Lambda Architecture) requires • Application logic has to be defined in two different systems using different frameworks • Data must be serialized consistently and kept in sync between each system • Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

  28. Why Lambdoop? “One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture”. Nathan Marz “Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop (…) are there but there is a shortage of people with the expertise to leverage them. Rajat Jain

  29. Lambdoop Streaming data Workflow Operation Data Data Static data

  30. Lambdoop Batch Hybrid Real-Time

  31. Data Input • Informationrepresentedas Dataobjects • Types: • StaticData • StreamingData • Every Dataobject has a Schema to describe the Datafields (types, nulleables, keys…) • A Data object is composed by Datasets.

  32. Data Input • Dataset • A Data object is formed by one or more Datasets. • All Datasets of a Data object share the same Schema • Datasets are formed by Register objects, • A Register is composed by RegisterFields.

  33. Data Input • Schema • Very similar to Avro definition schemas. • Allow to define input data’s structure, fields, types, nulleables… • Json format { "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ] }

  34. Data Input • Importingdata intoLambdoop • Loaders: Import information from multiple sources and store it into the HDFS as Data objects • Producers: Get streaming data and represent it as Data objects • Heterogeneous sources. • Serialize information into Avro format

  35. Data Input • Static Data example: Importing a Air Qualitydatasetfrom local logsto HDFS • Loader • Schema’s path is files/csv/Air_quality_schema //Readschemafrom a file Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader);

  36. Data Input • Streaming Data example: Reading streaming sensor data from TCP port • Producer • Weather stations emit messages to port 8080 • Schema’s path is files/csv/Air_quality_schema intport = 8080; //Readschema Stringschema = readSchemaFile (schema_file); Producer producer = newTCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = newStreamingData(producer)

  37. Data Input • Extensibility • Users can implement their own data loaders/producers • Extend Loader/Producer interface • Read data from original source • Get and serialize information (Avro format) considering Schemas

  38. Operations • Unitary actions to process data • An Operation takes Data as input, processestheData and produces another Data as output • Types of operations: • Aggregation: Produces a single value per DataSet • Filter: Output data has the same schema as input data • Group: Produces several DataSet, grouping registers together • Projection: Changes the Data schema, but preserves the records and their values • Join: Combines different Data objects

  39. Operations

  40. Operations • Extensibility(User Defined Operations):New operations can be defined implementing a set of interfaces: • OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed • BatchOperation: Provides MapReduce logic to process the input Data • StreamingOperation: Provides Storm/Trident based functions to process streaming registers • HybridOperation: Provides merging logic between streaming and batch results

  41. Operations • User Defined Operation interfaces

  42. Workflows • Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations • BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output • StreamingWorkflow: Operates on a StreamingData to produce another StreamingData • HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) • Workflow connections Data Workflow Data Data Workflow Workflow Workflow Data Data Workflow Workflow Data

  43. Workflows // Batch processing example Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader); Workflowwf = new BatchWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults Data output = wf.getResults();

  44. Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflowwf = new StreamingWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 averageonfiltered input data Avgavg= new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runstheworkflow wf.run(); //Getstheresults While (!stop){ Data output = wf.getResults(); … }

  45. Workflows // Hybridcomputationexample Producer producer = new PortProducer("catest", schema1, config); StreamingDatastreamInput = new StreamingData(producer); Loaderloader = new CSVLoader("AQ.avro",uri, schema2) StaticDatabatchInput= new StaticData(loader); Data input = new HybridData(streamInput, batchInput); Workflowwf = new HybridWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addOperation(filter); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults While (!stop) { Data output = wf.getResults();}

  46. Results exploitation Filter RollUp StdError Avg Select Cube Variance join … VISUALIZATION EXPORT CSV, JSON, … ALARM SYSTEM

  47. Results exploitation • Visualization /* Produce from Twitter */ TwitterProducerproducer = new TwitterProducer(…); Data data = new StreamingData(producer); StreamingWorkflowwf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ Data results = wf.getResults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");

  48. Results exploitation • Visualization

More Related