1 / 22

Introduction

Introduction to Avro and Integration with Hadoop. Introduction. What is Avro?.

chaz
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Avro and Integration with Hadoop Introduction

  2. What is Avro? • Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data. • Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas

  3. Creating your first Avro schema • Schema description: • { • "name": "User", • "type": "record", • "fields": [ • {"name": "FirstName", "type": "string", "doc": "First Name"}, • {"name": "LastName", "type": "string"}, • {"name": "isActive", "type": "boolean", "default": true}, • {"name": "Account", "type": "int", "default": 0} ] • }

  4. Avro schema features • Primitive types (null, boolean, int, long, float, double, bytes, string) • Records • { "type": "record", • "name": "LongList", • [ {"name": "value", "type": "long"}, • {"name": ”description", "type”:”string”}] • } • Others (Enums, Arrays, Maps,Unions,Fixed)

  5. Avro schema features • Primitive types (null, boolean, int, long, float, double, bytes, string) • Records • { "type": "record", • "name": "LongList", • [ {"name": "value", "type": "long"}, • {"name": ”description", "type”:”string”}] • } • Others (Enums, Arrays, Maps,Unions,Fixed)

  6. How to create Avro record? String schemaDescription = " { \n" + " \"name\": \"User\", \n" + " \"type\": \"record\",\n" + " \"fields\": [\n" + " {\"name\": \"FirstName\", \"type\": \"string\", \"doc\": \"First Name\"},\n" + " {\"name\": \"LastName\", \"type\": \"string\"},\n" + " {\"name\": \"isActive\", \"type\": \"boolean\", \"default\": true},\n" + " {\"name\": \"Account\", \"type\": \"int\", \"default\": 0} ]\n" + "}"; Schema.Parser parser = new Schema.Parser(); Schema s = parser.parse(schemaDescription); GenericRecordBuilder builder = new GenericRecordBuilder(s);

  7. How to create Avro record? (cont. 2) The first step to create Avro record is to create JSON-based schema Avro provides parser that will take a Avro schema string and return schema object. Once the schema object is created, we have created a builder that will allow us to create records with default values

  8. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe

  9. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe

  10. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);

  11. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);

  12. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }

  13. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }

  14. How to write Avro records in a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();

  15. How to reading Avro records from a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();

  16. How to read Avro records from a file? File file = new File(“<file-name>"); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); while (dataFileReader.hasNext()) { Record r = (Record) dataFileReader.next(); System.out.println(r.toString()); }

  17. Running MapReduce Jobs on Avro Data 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);

  18. Running MapReduce Jobs on Avro Data (cont. 2) 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);

  19. Running MapReduce Jobs on Avro Data - Mapper public static class MapImplextends AvroMapper<GenericRecord, Pair<String, GenericRecord>> { public void map( GenericRecord datum, AvroCollector<Pair<String, GenericRecord>> collector, Reporter reporter) throws IOException { …. } }

  20. Running MapReduce Jobs on Avro Data - Reducer public static class ReduceImpl extends AvroReducer<Utf8, GenericRecord, GenericRecord> { public void reduce(Utf8 key, Iterable<GenericRecord> values, AvroCollector< GenericRecord> collector, Reporter reporter) throws IOException { collector.collect(values.iterator().next()); return; } }

  21. Running Avro MapReduce Jobs on Data with Different schema List<Schema> schemas= new ArrayList<Schema>(); schemas.add(schema1); schemas.add(schema2); Schema schema3=Schema.createUnion(schemas); This will allow to read data from different sources and process both of them in the same mapper

  22. Summary • Avro is a great tool to use for semi-structured and structured data • Simplifies MapReduce development • Provides good compression mechanism • Great tool for conversion from existing SQL code • Questions?

More Related