Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure)

Platform for Big Data, NoSQL and Relational Data. What makes sense for me?(+Azure)
Michael Epprecht Technology Evangelist michael.epprecht@microsoft.com @fastflame

Agenda Big Data AllSQL, NoSQL, NewSQL, SomeSQL Windows Azure

Big Data

WHAT IS BIG DATA? Big Data Petabytes Click stream Wikis/blogs Sensors/RFID/devices Social sentiment Audio/video Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Web 2.0 Advertising Mobile Collaboration eCommerce Terabytes Web Logs Digital Marketing Search Marketing Recommendations ERP/CRM Gigabytes Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline Megabytes Data Complexity: Variety and Velocity

Original Gartner three V’s Feb 2001: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf Volume (think data tiering) Size of the data Manageability Velocity (think CEP) Speed at which data is received Latency to deliver data analysis Variety (think ETL, ODS, Email, Social Networks) Differing formats of data Disparate source systems

Big Data to Data Analytics Variety: Dealing with Un/Semi-structured and Structured How do you mix Oranges and Apples? Compare Textual data with Relational Tooling – accessing the “Variety” of different data sources Determining “Value” Big Data = Proxy for doing more with existing data Perspective What you are doing Hardware Innovations overtime Spinning disk V Flash GPGPU v CPU

Replacing BI? Single Version of the Truth? Conformed dimensions (standardised data reporting) Four different operational systems ETL’d into single dimension Does Big Data change that? NO! YES! Unstructured data is unstructured – can it be conformed? Report on Detail or Aggregations? No – Analytics – we are data mining Still needs standardisation and thought – formal design process

All data has Structure - not All data has Context Data stored [in structure] Image -> png, jpg, bmp etc. Free-text -> ascii, unicode, .docx, xls etc. Sound -> mp3, mpeg Data queried Image -> (?) face regonition, kinect Free-text -> grammar Sound -> Pitch, Note etc. Context? Image -> Polygon Free-text -> ?? Sound -> Bars in the Music??

has Structure? A1 difficulties

has Context? Stored in Normal Form (Relational) Stored in Unicode A1 – could mean anything Difficulties – the word itself has meaning Notes: Using Norm Form (Relational) context is provided by schema New term time – Uncontexted data (115 Bing references) Context gives data structured only when applied

Big Data Processing

We’ve been Hyped Band wagon is rolling If you hear a new term – research it; probably nothing new

Finally: What is Big Data (really)? Data Analytics (stuff we already do) What is new? New toolsets to help with variety of data Industry waking up to the power of commodity kit Data Science as a field (combination of a BI Analyst, Business Analyst and BI Developer) It’s still all about Insights into our data Hadoop– the platform of the next generation? Look out for the name change Big Data will become Data Analytics

A NEW SET OF QUESTIONS How do I better predict future outcomes? How do I optimize my fleet based on weather and traffic patterns? What’s the social sentiment for my brand or products Advanced ANALYTICS SOCIAL & Web ANALYTICS LIVE DATA FEEDS

Common Big Data Customer ScenariosGain competitive advantage by moving first and fast in your industry IT infrastructure optimization Legal discovery Social network analysis Traffic flow optimization Web app optimization Weather forecasting Healthcare outcomes Natural resource exploration Churn analysis Fraud detection Life sciences research Advertising analysis Equipment monitoring Smart meter monitoring

What is Hadoop?

Massively Parallel Processing (MPP) Chop a task up across multiple physical machines High Performance Clustering (HPC) Distributed Data Processing (DDP) Processing done locally on Data MapReduce is based on Something we know already

Why MPP? Because Enterprise kit for this performance is way too expensive. 100 machines with cheap DAS costs fraction of a scale up machine with expensive SAN infrastructure Most NoSQL and NewSQL products are built with MPP and commodity kit as a design feature. Cloud computing model also Network connectivity is key component (oh, hence take the processing to the data!) Follows the design paradigm that processing should move to the data and not the data to the processing

What is Hadoop? Open source project coordinated by Apache Analogous to an OS; core components: Utilities HDFS MapReduce Lots of other projects that sit within the ecosphere: Mahout, Sqoop, Flume, Scribe, Oozie, Jaql, Hue, Hiho, Hive, Pig, Hbase, … and more and more… • V1.0.0 and V2.0.0 code branches

HBase persistent | distributed In Memory Efficient at Random Reads/Writes Distributed, large scale data store Utilizes Hadoop for persistence Both HBase and Hadoop are distributed

In HadoopMapReduce speak Map Parse input line to get data you want: output: key (presented to single reducer), value pair (what we will likely aggregate) Shuffle Sort and move same “keys” to same node for reduction (can be expensive – plan your data partitions properly) Reduce Aggregate values Output http://developer.yahoo.com/hadoop/tutorial/module4.html

MapReduce as SQL Map = SELECT FROM WHERE Reduce = GROUP BY

AllSQL, NoSQL, NewSQL and SomeSQL

AllSQL Data stored in Normal Form ACID for consistency and durability Queries done using ANSI SQL Basically what the majority of folk do The majority of reporting products use SQL as an interface Everybody knows SQL (despite its sins) Easy to understand and get going with

NoSQL (Not Only SQL) Led by Developers wanting: More flexible data structures (dynamic schema) Ability to store none-tabular data Higher Scalability – scale out Hardware cost – build on commodity kit Durability and consistency not a primary concern Open source – move away from proprietary products Data resilience built into the product through replicas rather than expensive hardware and software solutions Examples See http://nosql-database.org/ - there are 100’s! Azure Table Store Google’s BigTable HADOOP MapReduce Cassandra RavenDB CouchDB MongoDB

NoSQLmomentum RDBMS cannot scale because of ACID (Atomicity, Consistency, Isolation, Durability) Swathe of new open source products Data captured has value but not readily accessible NewSQL– will it “cure” the NoSQL problem?

NewSQL Existing AllSQL Products do not scale out well Single machine design Design is several decades old Expensive to create a DR/HA environment Realisation Folk do not want to learn Java in order to report off their data Most toolsets use SQL as a method for reporting Examples VoltDB NuoDB Azure DB

AllSQL, NoSQL, NewSQL and SomeSQL Days where everything in SQL Server are going BI/BA/DA {whatever you want to call it} done across different data sources – semi/un/fully structured Understand the non-relational world The SQL language isn’t going anywhere This isn’t about enterprise only – this affects us all

Windows Azure

MANAGE any data, any size, anywhere 010101010101010101 Unified Monitoring, Management & Security 1010101010101010 01010101010101 101010101010 Non-Relational Streaming Relational Data Movement

HADOOP INTEGRATED INTO THE DATA PLATFORM Non-Relational Microsoft HDInsight Server for on-premises Windows Azure HDInsight Service for cloud Enterprise class security, HA & management Seamlessly integrated with Microsoft BI tools Windows Simplicity and Manageability Provisioned in minutes on Windows Azure Built on Hortonworks Data Platform (HDP)

Hadoop architecture. Business Intelligence (Excel, PowerView…) Active Directory (Security) Pipeline / workflow (Oozie) Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) Data Integration ( ODBC / SQOOP/ REST) Scripting (Pig) Query (Hive) Machine Learning (Mahout) NoSQL Database (HBase) System Center Log file aggregation (Flume) Distributed Processing (Map Reduce) Distributed Storage (HDFS)

insights FOR ALL USERS through familiar tools PB TB GB BI Professionals Business Analysts Data Scientists Advanced Analytics from Microsoft and 3rd parties Self Service Analysis with PowerPivot & Power View Interactivity & exploration with Hadoop data in Excel

Azure SQL Database

SQL Database Architecture

Architecture Federation An object contained within a user database Defines the scheme for the federation Represent the database being sharded Federation Root Database that houses the federation object Federation Member System managed SQL databases Contain part, or “slices” of data Federations SalesDB Orders_federation Orders_federation Orders_Fed Federation Root Federation Members CREATE FEDERATION fed_name(fed_key_labelfed_key_typedistribution_type)

Architecture Cont. Federation Key The key used for data distribution int, bigint, guid, varbinary Atomic Unit Represent a single instance of a federation key. All rows in all federated tables with the same federation key value. Federations SalesDB Orders_federation Orders_federation Orders_Fed Member: range [1000, 2000) Federation Root Federation Members AUPK=5 AUPK=25 AUPK=35 AUPK=5 AUPK=25 AUPK=35 AUPK=1005 AUPK=1025 AUPK=1035 Atomic Units

Architecture Cont. Federated Table Contains only atomic units for member’s key range Reference Table Non-Federated table

Repartitioning Dynamic Partitioning SPLIT members to spread workloads over to more nodes DROP members to shrink back to fewer nodes ALTER FEDERATION Orders_Fed SPLIT AT (tenant_id=7500) SalesDB Orders_federation Orders_federation Orders_Fed [5000, 7500) & [7500, 10000) [5000, 10000)

Reliable Routing Built-in Data-Dependent Routing (DDR) Ensure apps can discover where the data is just-in-time No “Shard Map” caching Guaranteed member routing USE FEDERATION Orders_Fed (tenant_id=7509) SalesDB Orders_federation Orders_federation Orders_Fed [5000, 7500) & [7500, 10000)

Azure NoSQL (Azure Table Storage)

Table Storage Concepts Account Table Entity Name =… Email = … customers Name =… EMailAdd= contoso Photo ID =… Date =… photos Photo ID =… Date =…

Table Details Create, Query, Delete Tables can have metadata Not an RDBMS! Table Insert Update Merge – Partial update Replace – Update entire entity Upsert Delete Query Entity Group Transactions Multiple CUD Operations in a single atomic transaction Entities

Entity Properties Entity can have up to 255 properties Up to 1MB per entity Mandatory Properties for every entity PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Defines the sort order Timestamp Optimistic Concurrency Exposed as an HTTP Etag No fixed schema for other properties Each property is stored as a <name, typed value> pair No schema stored for a table Properties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double

No Fixed Schema FAV SPORT Canoeing

Querying ?$filter=Last eq ‘Wegner’

Purpose of the PartitionKey Entity Locality Entities in the same partition will be stored together Efficient querying and cache locality Endeavour to include partition key in all queries Entity Group Transactions Atomic multiple Insert/Update/Delete in same partition in a single transaction Table Scalability Target throughput – 500 tps/partition, several thousand tps/account Windows Azure monitors the usage patterns of partitions Automatically load balance partitions Each partition can be served by a different storage node Scale to meet the traffic needs of your table

Partitions and Partition Ranges Server A Table = Products [MinKey - Canoes) Server A Table = Products Server B Table = Products [Canoes - MaxKey)

MANAGE ANY DATA, ANY SIZE ANYWHERE Unified Monitoring, Management & Security Non-Relational Hadoop on Windows Hadoop on Azure Relational Streaming StreamInsight SQL Server Database & Parallel Data Warehouse 1010101010101010 01010101010101 101010101010 Data Movement Hadoop Connectors & ETL

Frameworks caching identity service bus media cdn big data commerce integration analytics hpc mobile Services . . . . . . . . . . . . . . . . . . Fabric virtual machines web sites cloud services SQL database noSQL database blob storage connect virtual network traffic manager compute storage networking Global Physical Infrastructure servers / network / datacenters Automated Managed Resources Elastic Usage Based Infrastructure N Central US, S Central US, N Europe, W Europe, E Asia, SE Asia + 24 Edge CDN Locations

www.microsoft.ch/shape

Questions?

Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure)

Platform for Big Data, NoSQL and Relational Data. What makes sense for me ? (+Azure)

Presentation Transcript

Platform for Big Data, NoSQL and Relational Data. What makes sense for me?(+Azure)

Windows® Azure™ Deep Dive

steve plank “ planky ” m icrosoft

Bruno Terkaly | Technical Evangelist Bret Stateham | Technical Evangelist

What’s new in Azure SDK 1.3 (and 1.4)

Developing Windows Azure applications

Microsoft cloud (Azure) database

Bruno Terkaly | Technical Evangelist Bret Stateham | Technical Evangelist

Windows Azure

Dhananjay Kumar

SQL Azure Intro and What’s New Level: Introductory to Intermediate

Meet Azure Files, your new Swiss Army Knife

Data v Cloudu

Microsoft Azure Training

Azure Application Development - Windows azure application development company

Azure Machine Learning Tutorial | Azure Tutorial | Azure Training | Edureka

Best azure cloud app Development - Windows azure application development

Azure Managed Instance Your Bridge to the Cloud

100% PASS Microsoft AZURE without exam or training

azure mobile application development - Windows azure application development

Windows Azure Application Development Company

Azure Mobile Application Development - Windows Azure Developers

Azure Cloud App Development - Windows azure application development