An introduction to hdinsight
1 / 12

An Introduction to HDInsight - PowerPoint PPT Presentation

  • Uploaded on

An Introduction to HDInsight. June 27 th , 2013 @ sqlbischmidt http:// Big Data. Structured or Unstructured?. Structured data is identifiable Organized by columns and rows databases

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'An Introduction to HDInsight' - vine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to hdinsight

An Introduction to HDInsight

June 27th, 2013


Structured or unstructured
Structured or Unstructured?

  • Structured data is identifiable

    • Organized by columns and rows

    • databases

  • Unstructured data has no such identifiable structure


  • Getting Started

    • “Apache”Hadoopbased service

    • Modern, cloud based solution platform that manages data of any type and/or size

  • Big data does not provide value on its own, it must be ETL’d

Hdinsight continued
HDInsight (continued)

  • An HDInsight Azure instance consists of a head node (also called a namenode) and one or more data nodes

  • Benefits:

    • Integration into Social Media

    • Advanced Analytics

    • “Live” Changes

      • What’s the weather like right



  • MapReducetakes a large, unstructured data set and breaks it down by mapping, shuffling, and sorting the data to generate an output file that contains the level along with an output file

  • HDFS: Hadoop distributed file system

    • Data gets distributed over multiple drives on multiple servers

  • JAR files: bundled MapReduce code that can be compiled and executed

An introduction to hdinsight

  • Pig is an alternative to writing Java scripting code for creating and running MapReduce jobs.

  • The language is called Pig Latin

  • Using Pig is a good way to reduce the time needed to create MapReduce programs

  • Many algorithms can be written in less than 5 lines of Pig Latin code!

An introduction to hdinsight

  • Pig Latin statements follow a general flow of:

    • LOAD


    • DUMP or STORE

  • Pig Latin can be written in either grunt mode (interactive) or script mode (batch)

An introduction to hdinsight

  • Hive is the “SQL like” language that lays on top of Hadoop

  • Commonly referred to as Hive Query Language (or HQL)

  • Structure without modeling

  • Hive can handle larger data sets than SQL as it queries data in parallel across multiple nodes using MapReduce

Data explorer
Data Explorer

  • Data Explorer is currently in Preview mode from Microsoft

  • Excel can connect directly to our HDInsight data cluster that we can use to bring data in for analysis.

  • Can then join this data with other relational sources to “mash” the data together

Additional resources
Additional Resources

  • Apache Homepage


  • HDInsight


  • Horton Works