Data liberty
Download
1 / 46

Data Liberty - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Data Liberty. Alternatives to the shackles of limited scale in data solutions Andy Cross Windows Azure MVP Elastacloud. Thank you, sponsors!. The Cloud for Modern Business. aka.ms/azuretry. Deploy fast in the cloud, scale elastically and minimize test cost

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Liberty' - nansen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data liberty

Data Liberty

Alternatives to the shackles of limited scale in data solutions

Andy Cross

Windows Azure MVP

Elastacloud



The Cloud for Modern Business

aka.ms/azuretry

Deploy fast in the cloud, scale elastically and minimize test cost

Activate your Windows Azure MSDN benefit at no additional charge

Grabyourbenefit

aka.ms/msdnsubscr


Social media
SocialMedia

Tell everyone I’m awesome #cloudbrew

I’m @andybareweb


Data value at scale requires technology choices
Data value at scale requires technology choices;

often prioritising data read traversal over operational characteristics of create/update/delete

embracing hybrid data platforms with varied technology partners over homogenous estates

establishing alternative skillsets, augmented with entrenched languages, trusting cloud over maintenance

following robust engineering processes to provide rigour in a deterministic world


Bravery leads to rewards
Bravery leads to rewards;

the winners will have data which shows them that they’ve won

the commoditised query turns energy sucking data silos into profit centres

new data traversal mechanisms lead to new connotative data expression

everything you already know is relevant and valid; the constraints on how it is applied are not


WHAT’S A DATA SCIENTISTS FAVOURITE LANGUAGE?

Most developers have heard of Big Data. I’m going to show how Microsoft are increasingly relevant in this space.

My talk is about architecture and approach. Note we’re talking Big Data and not strictly Data Science.

But it’s always worth context so lets start with the history.


Wikimedia commons

IBM have been a leader in Big Data for years.


We’re not as great as we’d hope; we’re often still bound by our ability to marshal our IO.

Just as the speed of loading punchcards was historically a limiting factor, we are now limited by our capacity to ingest data on individual machines.

This leads to ideas such as DFS and data locality.


During the evolution of data we eventually moved to client/server and this was a big step up from dBase et al of the time.

Fundamentally however, the tabular structured nature of data poses many changes; not least the long term effects of normalisation which trade off effective storage in the short term with long term offset compute which is required to reconstruct sets.

This eventually leads to such ideas as NoSQL document and entity stores.


Modelling of data provides a consistent challenge. Our world is highly connected and our brains are effective connectors of data. Real world data fits poorly into highly structured data sets.

This leads to semi-structured and unstructured data formats and data queryability through relationship traversal


The technologies shown today are primarily written in non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the .net and Microsoft stacks.

There are obviously challenges beyond language to running the alternative stacks; but remember in the Cloud you aren’t responsible for tuning a Linux cluster which has been running for 5 years. You should provision for a duration that is bounded by the likelihood of the cluster requiring routine maintenance.


Hadoop key facts
Hadoop – KEY non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the FACTS

Open Source; Apache Foundation.

Java.

Map Reduce framework for job distribution; Distributed File System for file access.

In Windows Azure this is known as HDInsight.


Hadoop is O(n) non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

It exhibits linear performance; when the dataset doubles, the time taken to execute the algorithm doubles.


Lets look at some scary Java non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Any children should look away now.


Hadoop sdk
Hadoop SDK non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

C# integration

Remote Data & Jobs

Hive in C#

Serialization


Jobs non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

publicclassSwedishSessionsJob : HadoopJob<SwedishSessionsMapper, SessionsReducer>

{

publicoverrideHadoopJobConfiguration Configure(ExecutorContext context)

{

var config = newHadoopJobConfiguration()

{

InputPath = "\"/AllSessions/*.gz\"",

OutputFolder = "/SwedishSessions/"

};

return config;

}

}


Mapper
Mapper non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

publicclassSwedishSessionsMapper : MapperBase

{

publicoverridevoid Map(stringinputLine, MapperContext context)

{

if (inputLine.Contains("Country=Sweden")

{

context.IncrementCounter("SwedishSession");

context.EmitKeyValue(“SE", "1");

}

}

}


Reducer
Reducer non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

publicclassSessionsReducer : ReducerCombinerBase

{

publicoverridevoid Reduce(string key, IEnumerable<string> values, ReducerContext context)

{

context.EmitKeyValue(key, values.Count());

}

}


Testing hadoop queries
Testing Hadoop Queries non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

varinputData = "Country=Sweden&Name=Magnus";

var result = StreamingUnit.Execute<Jobs.SwedishJob>(new[]{inputData});

Assert.AreEqual("SE\t1", result.ReducerResult.First());


Skill non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the reuse

Your existing development team can immediately realise value

The frameworks facilitate deterministic testing for highly reliable queries

Express elegant solutions in C#

Familiar Unit Testing patterns

Concise programmatic terseness

Complex logic is best expressed in programmatic form


Commoditised query non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Value of query

Value

De-provision

Provision

Cost

Time

Action Cost

Execute


Hdinsight wins
HDInsight non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the wins.

* Tools are great but not friendly

Automated provisioning and job execution services.

Transient clusters limit exposure to poorly tooled* java estate.

Persistence with Windows Azure Blob Storage as HDFS proxy known as Azure Storage Vault (ASV).

Persistence in Windows Azure SQL Database for Hive Metastore.

Javascript console.


Nosql document and entity stores
NoSQL non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the Document and Entity Stores

Examples in MongoDB

Entity stores are similar; you can find a great example in Windows Azure Table Storage


What is a document database
What is a document database? non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Relational Database

Document Database

{

"_id" : ObjectId("51fccc57f82352d76653bdae"),

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"_id" : ObjectId("51fccc57f82352d76653bdae"),

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"_id" : ObjectId("51fccc57f82352d76653bdae"),

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"_id" : ObjectId("51fccc57f82352d76653bdae"),

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"Name" : {

"FirstName" : "Owen",

"LastName" : "Grzegorek"

},

"Company" : "Howard Miller Co",

"Address" : {

"Line1" : "15410 Minnetonka Industrial Rd",

"Line2" : "Minnetonka",

"Line3" : "Hennepin",

"Line4" : "MN",

"Line5" : "55345"

},

"ContactDetails" : {

"Phone" : "952-939-2973",

"Fax" : "952-939-4663",

"Email" : "[email protected]",

"Web" : "http://www.owengrzegorek.com"

}

}

{

"Name" : “Isaac Abraham",

“Age" : “33“

“Football Team” : “Tottenham”

“Icon” :

}

{

"Name" : “Richard Conway",

“Books Published” : “12”,

“Specialises in” : “Data Science”

}

{

"Name" : “Andy Cross",

“Hometown" : “Blackpool“

}


Mongodb key facts
MongoDB non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the Key Facts

  • General Purpose Operational Database

    • Real-time updates, ad-hoc queries and batch processing

    • Maps nicely with popular programming models e.g. .NET

    • Schema-free documents – lightweight and quick to get up and running

  • High Performance

    • Embedding documents – no expensive joins across tables

    • Indexes allow query optimization

    • High-speed saving of data (writes)

  • High Availability

    • Built in replication

    • Built in failover

  • Easy Scalability

    • “Sharding” allows easily spreading data across multiple databases

    • Replicated data can be spread throughout the cluster


MongoDB non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the is O(log n)

It exhibits logarithmic performance; when the dataset doubles, the time taken to execute the algorithm increases by a fixed amount


Strengths of mongodb
Strengths of non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the MongoDB

Good fit for .NET developers

Low barrier to entry

Uses well-known .NET technologies e.g. LINQ

Good migration path from SQL-style development

Works well as operational data store

Batch processing capability for map reduce

Flexible

Massively scalable with well-defined replication model

Self-managing – easily add new nodes

High performance writes and eventually consistent reads

Database is free to use (tooling is not!)

Popular, so a relatively large community

Designed for scalability

Low cost


Mongo sdk
Mongo SDK non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

There are many different way to connect with MongoDB from a .net project.

Official

Wrapper

Alternative

Tool


C implementations
C# implementations non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

publicclassBook

{

publicstring Author { get; set; }

publicstring Title{ get; set; }

}

// "entities" is the name of the collection

var books = database.GetCollection<Entity>("books");

Book book = new Book

{

Author = "Ernest Hemingway",

Title = "For Whom the Bell Tolls"

};

books.Insert(book);

If your data is regularly structured, you can use domain classes:


C implementations1
C# implementations non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

BsonDocument person = newBsonDocument {

{ "name", "John Doe" },

{ "address", newBsonDocument {

{ "street", "123 Main St." },

{ "city", "Centerville" },

{ "state", "PA" },

{ "zip", 12345}

}}

};

var people = database.GetCollection<BsonDocument>("people");

people.Insert(person);

If your data is irregularly structured or semi-structured, you can use a BSON object model:


Nosql document wins
NoSQL non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the Document Wins

Semi-structured data first class citizen

Built in MapReduce

Operational and interactive

Massively scalable


Graph databases neo4j key facts
Graph Databases, Neo4j KEY FACTS non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Open Source; Neotechnologies

Java

Runs equally well on Windows or Linux. In Windows Azure there are VMDepot images able to be deployed in a few simple steps. Additionally the Azure Linux VMs are a good fit for this database engine.

There is an Open Source .net SDK available through Nuget and actively maintained primarily by an Australian company, Readify.


Neo4j is non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the O(1)

It exhibits constant-time performance; that is, the algorithm takes the same time to execute irrespective of the size of the dataset.


How o 1
How non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the O(1)?

  • Graphs don’t have tables. They don’t have collections.

  • They have nodes and relationships.

  • Rather than having to select out a whole table, we can identify a point on the graph

    • A start point

  • Follow the traversal of relationships from that point.


http://www.apcjones.com/arrows/# non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the


Things we can do
Things we can do non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Find all the things formed in Sweden

START sweden = node:countryIdx(“country=Sweden”)

MATCH Sweden<-[:FORMED_IN]-something

RETURN something;

Find friends of friends

START magnus = node:peopleIdx(“name=magnus”)

MATCH magnus-[:FRIENDS]->friend-[:FRIENDS]->friendoffriend

RETURN friendoffriend;


Neo4j client
NEO4J Client non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

Open source Neo4j Client


C examples
C# examples non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

var query = neo4Jclient.Cypher

.Start(new

{

sweden = Node.ByIndexLookup("countryIdx", "country", "sweden")

})

.Match("sweden-[:FRIENDS]->friend-[:FRIENDS]->friendoffriend")

.Return<Node<Friend>>("friendoffriend");


Graph database wins
Graph Database Wins non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

  • Modelled domains match cognitive processes

  • Optimised for traversal of relationships allow complex and “social” queries to emerge

    • LIKES of FRIENDS of COLLEAGUES

  • O(1) performance characteristics due to ability to START queries at arbitrary graph points.


Summary
Summary non-.net and non-Microsoft languages and frameworks. Every time we do this, I’ll show examples ONLY in the

  • HDInsight brings Hadoop to Azure

    • Suited to Data Volume, Variety, Variability etc

  • MongoDB brings Document stores

    • Suited to Data Volume, Operational concerns

  • Neo4j brings Graph database

    • Suited to data relationship traversal



ad