big data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
„Big data ” PowerPoint Presentation
Download Presentation
„Big data ”

Loading in 2 Seconds...

play fullscreen
1 / 61

„Big data ” - PowerPoint PPT Presentation


  • 216 Views
  • Uploaded on

„Big data ”. Benczúr András MTA SZTAKI. Big Data – the new hype. “big data” is when the size of the data itself becomes part of the problem “big data” is data that becomes large enough that it cannot be processed using conventional methods Google sorts 1PB in 33 minutes (07-09-2011)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '„Big data ”' - maren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
big data

„Big data”

Benczúr András

MTA SZTAKI

big data the new hype
Big Data – thenewhype
  • “big data” is when the size of the dataitself becomes part of the problem
  • “big data” is data that becomes largeenough that it cannot be processedusingconventionalmethods
  • Googlesorts 1PB in 33 minutes (07-09-2011)
  • Amazon S3 store contains 499B objects (19-07-2011)
  • New Relic: 20B+ applicationmetrics/day (18-07-2011)
  • Walmart monitors 100M entities in realtime (12-09-2011)

Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT-2011.4.4 Info day in Luxembourg on 26 September 2011

big data planes
Big Data Planes

frauddetection

mediapricing

custom hardware

powerstationsensors

Web content

navigationmobility

Matlab

Revolution

GraphLab

SPSS

Big Data Services

custom software

news

curation

SAS

R

online reputation

IT logs

realtime

MOA

SciPy

KDB

Mahout

Vertica

S4

Netezza

Esper

Big analytics

Speed

HBase

Greenplum

MapR

InfoBright

Progress

Fastdata

MySql

batch

Hadoop

Size

MegaByte

PetaByte

overview
Overview
  • Introduction – Buzzwords
  • Part I: Background
    • Examples
      • Mobility and navigation traces
      • Sensors, smart city
      • IT logs
      • Wind power
    • Scientific and Business Relevance
  • Part II: Infrastructures
    • NoSQL, Key-valuestore, Hadoop, H*, Pregel, …
  • Part III: Algorithms
    • BriefHistory of Algorithms
    • Web processing, PageRankwithalgorithms
    • Stream algoritmusok
    • Entityresolution – detailedcomparison (QDB 2011)
navigation and mobility traces
Navigation and MobilityTraces
  • Streamingdataat mobile basestations
  • Privacyissues
    • Regulationstoletonlyanonymizeddataleavebeyondnetworkoperations and billing
    • Doregulation policy makersknowaboutdeanonymizationattacks?
    • Whatyourwife/husbandwillnotknow – your mobile providerwill
sensors smart home city country
Sensors – smarthome, city, country, …
  • Road and parking slotsensors
  • Mobile parking traces
  • Public transport, Oystercards
  • Bike hireschemes

Source: Internet of ThingsComicBook, http://www.smartsantander.eu/images/IoT_Comic_Book.pdf

corporate it log processing
Corporate IT log processing

Traditional methods fail

Aggregation into

Data Warehouse

?

Our experience:

30-100+ GB/day3-60 M events

Identify bottlenecks

Optimize procedures

Detect misuse, fraud, attacks

scientific and business relevance
Scientific and business relevance
  • VLDB 2011 (~100 papers):
    • 6 papers on MapReduce/Hadoop, 10 on big data (+keynote), 11 NoSQL architectures, 6 GPS/sensory data
    • tutorials, demos (Microsoft, SAP, IBM NoSQL tools)
    • session: Big Data Analysis, MapReduce, Scalable Infrastructures
  • EWEA 2011: 28% of papers on wind power raise data size issues
  • SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics
  • Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures”
  • The Economist 2010.02.27: „Monstrous amounts of data … Information is transforming traditional businesses”
  • News special issue on Big DatathisApril
new challenges in database technologies
New challengesindatabase technologies

Question of research and practice: Applicability to a specific problem? Applicability as a general technique?

overview1
Overview
  • Part I: Background
    • Examples
    • Scientific and Business Relevance
  • Part II: Infrastructures
    • NoSQL
    • Key-valuestores
    • Hadoop és Hadoopra épülő eszközök
    • BulkSynchronous Parallel, Pregel
    • Streaming, S4
  • Part III: Algorithms, Examples
most j n sok k ls slide show
Most jön sok külső slide show …
  • NoSQL bevezető – www.intertech.com/resource/usergroup/NoSQL.ppt
  • Key-valuestores
    • BerkeleyBD – nem osztott
    • Voldemort – behemoth.strlen.net/~alex/voldemort-nosql_live.ppt
    • Cassandra, Dynamo, …
    • Hadoop alapon is létezik (lent): HBase
  • Hadoop – Erdélyi Miki fóliái
  • HBase – datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt
  • Cascading – nem lesz
  • Mahout – cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt
  • Miért kell más? Mi más kell?
    • BulkSynchronous Parallel
    • Graphlab – DannyBicksonslides
    • MOA – http://www.slideshare.net/abifet/moa-5636332/download
  • Streaming
    • S4 – http://www.slideshare.net/alekbr/s4-stream-computing-platform
use of large matrices
Use of large matrices
  • Main step in all distributed algorithms
    • Network based features in classification
    • Partitioning for efficient algorithms
    • Exploring the data,

navigation (e.g. ranking

to select a nice

compact subgraph)

  • Hadoop apps (e.g. PageRank)

move the entire data around in

each iteration

  • Baseline C++ code keeps

data local

Hadoop

Hadoop + KeyValue store

Best C++ custom code

bsp vs mapreduce
BSP vs. MapReduce
  • MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations.
  • BSP: Tailored towards processing data with locality.
    • Proprietary: Google Pregel
    • Open-source (will be??… several flaws now): HAMA
    • Home developed C++ code base
  • Both: Easy parallelization and distribution.

Sidlo et al.,Infrastructures and bounds for distributed entity resolution. QDB 2011

overview2
Overview
  • Part I: Background
  • Part II: Infrastructures
    • NoSQL, Key-value store, Hadoop, H*, Pregel, …
  • Part III: Algorithms, Examples, Comparison
    • Data and computationintensetasks, architectures
    • History of Algorithms
    • Web processing, PageRankwithalgorithms
    • Entityresolution – detailedwithalgorithms
  • Summary, conclusions
types of big data problems
Types of Big Data problems
  • Data intense
    • Web processing, info retrieval, classification
    • Log processing (telco, IT, supermarket, …)
  • Compute intense:
    • Expectation Maximization, Gaussian mixture decomposition, image retrieval, …
    • Genom matching, phylogenetic trees, …
  • Data AND compute intense:
    • Network (Web, friendship, …) partitioning, finding similarities, centers, hubs, …
    • Singular value decomposition
hardware
Hardware

Data intense:

Map-reduce (Hadoop), cloud, …

  • Compute intense:
  • Shared memory
  • Message passing
  • Processor arrays, …
  • → became affordable choice recently, as graphics co-procs!

Data AND compute intense??

big data why now
Big data: Why now?
  • Hardware is just getting better, cheaper?
  • But data is getting larger, easier to access
  • Bad news for algorithms slower than ~ linear
slide21

Moore’s Law: doublingin 18 months

Butin a keyaspect, the trend has changed!

Fromspeedto no of cores

numbers everyone should know
„Numbers Everyone Should Know”
  • Disk
  • 10+TB
  • RAM
  • 100+ GB
  • CPU
  • L2 1+ MB
  • L1 10+ KB
  • GPU onboardmemory
  • Global 4-8 GB
  • Blockshared 10+ KB
  • RAM
  • L1 cache reference 0.5 ns
  • L2 cache reference 7 ns
  • Main memory reference 100 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Intra-process communication
  • Mutex lock/unlock 100 ns
  • Read 1 MB sequentially from network 10,000,000 ns
  • Disk
  • Disk seek 10,000,000 ns
  • Read 1 MB sequentially from disk 30,000,000 ns

Jeff Dean, Google

back to databases this means
Back to Databases, this means …

2000/Sec

1600/Sec

Sub-linear speed-up

1000/Sec

16 CPUs

10 CPUs

5 CPUs

  • Read 1 MB sequentially…
  • memory 250,000 ns
  • network 10,000,000 ns
  • disk 30,000,000 ns

MEMORY

CPU

M

CPU

CPU

CPU

CPU

M

M

M

M

CPU

CPU

CPU

CPU

CPU

Linear speed-up (ideal)

CPU

  • Cost
  • Security
  • Integrity control more difficult
  • Lack of standards
  • Lack of experience
  • Complexity of management and control
  • Increased storage requirements
  • Increased training cost

Number of transactions/second

Number of CPUs

Connolly, Begg: Database systems: a practical approach to design, implementation,

and management], International computer science series, Pearson Education, 2005

the brief history of algorithms
“The brief history of Algorithms”

P, NP

ThinkingMachines: hypercube, …

PRAM

theoretic models

External memory algs

SIMD, MIMD, messagepassing

CM-5: manyvectorprocs

Map-reduce

Google

Multi-core

Many-core

Cloud

Flash disk

Cray: vectorprocessors

earliest history p np
Earliest history: P, NP

15

5

2

2

1

1

15

1

2

2

1

1

2

5

1

25

1

1

  • P: Graph traversal

Spanning tree

  • NP: Steiner trees
why do we care about graphs trees
Why do we care about graphs, trees?

Image segmentation

name

ID

e-mail

Entity Resolution

1

2

3

history of algs spanning trees in parallel
History of algs: spanning trees in parallel
  • iterative minimum spanningforest
  • everynode is a treeat start; everyiterationmergestrees

Bentley: A parallel algorithm for

constructing minimum spanning trees

1980

Harish et al. Fast Minimum Spanning

Tree for Large Graphs on the GPU

2009

3

2

1

4

5

6

8

7

overview3
Overview
  • Part I: Background
  • Part II: Infrastructures
    • NoSQL, Key-value store, Hadoop, H*, Pregel, …
  • Part III: Algorithms, Examples, Comparison
    • Data and computationintensetasks, architectures
    • History of Algorithms
    • Web processing, PageRankwithalgorithms
    • Streaming algoritmusok
    • Entityresolution – detailedwithalgorithms
  • Summary, conclusions
t he web is about data too

Posted by John Klossner on Aug 03, 2009

WEB 1.0 (browsers) – Users find dataWEB 2.0 (social networks) – Users find each otherWEB 3.0 (semantic Web) – Data find each other

WEB 4.0 – Data create their own Facebook page, restrict friends.

WEB 5.0 – Data decide they can work without humans, create their own language.

WEB 6.0 –Human users realize that they no longer can find data unless invited by data.

WEB 7.0 – Data get cheaper cell phone rates.

WEB 8.0 – Data horde all the good YouTube videos, leaving human users with access to bad ’80′s music videos only.

WEB 9.0 – Data create and maintain own blogs, are more popular than human blogs.

WEB 10.0 – All episodes of BattlestarGallactica will now be shown from the Cylons’ point of view.

The Web is aboutdatatoo

Big Data interpetation: recommenders, personalization, infoextraction

partner approaches to hardware
Partner approachesto hardware
  • HanzoArchives (UK):

Amazon EC2 cloud + S3

  • Internet MemoryFoundation:

50 low-end servers

  • We: indexing 3TB compressed, .5B pages
    • Open sourcetoolsnotyetmature
    • Oneweek of processingon 50 old dualcores
    • Hardware worthapprox €10,000; Amazon pricearound €5000
text retrieval conference measurement

DocumentsstoredinHBasetables over theHadoop file system (HDFS)

Indexelés:

200 példány saját C++ kereső

40 Lucene példány, utána top 50,000 találat saját kereső

SolR? Katta? Tényleg realtime működnek? Ranking?

Realistic- Even spam is important!

Text REtrieval Conference measurement
  • Spam
  • Obvious parallelization: each node processes all pages of one host
  • Link features (eg. PageRank) cannot be computed in this way

M. Erdélyi, A. Garzó, and A. A. Benczúr: Web spam classification: a few features worth more (WebQuality 2011)

distributed storage hbase vs warc files
Distributedstorage: HBasevs WARC files
  • WARC
    • Many, many medium sized files very inefficient w/ Hadoop
    • Either huge block size wasting space
    • Or data locality lostasblocksmaycontinueat non-local HDFS node
  • HBase
    • Data locality preserving ranges – cooperation w/ Hadoop
    • Experiments up to 3TB compressed Web data
  • WARC to HBase
    • One-time expensive step, no data locality
    • One-by-one inserts fail, very low performance
    • MapReduce jobs to create HFiles, the native HBase format
  • HFiletransfer
    • HBaseinsertionrate 100,000 per hour
the random surfer model
The Random SurferModel

Starts at a random page—arrives atquality page

u

Nodes = Web pages

Edges = hyperlinks

pagerank the random surfer model
PageRank: The Random SurferModel

Chooses random neighbor with probability 1- 

u

the random surfer model1
The Random SurferModel

Or with probability  “teleports” to random page—gets bored and types a new URL

u

the random surfer model2
The Random SurferModel

And continues with the random walk …

the random surfer model3
The Random SurferModel

And continues with the random walk …

the random surfer model4
The Random SurferModel

Until convergence … ?

[Brin, Page 98]

pagerank as quality
PageRank asQuality

A quality page is pointed to by several quality pages

PR(k+1) = PR(k)( (1 - ) M +  · U)

= PR(1)( (1 - ) M +  · U)k

personalized pagerank
Personalized PageRank

Orwithprobability “teleports” to random page—selected from her bookmarks

u

algorithmics
Algorithmics
  • Estimated 10+ billions of Web pages worldwide
  • PageRank (as floats)
    • fits into 40GB storage
  • Personalization just to single pages:
    • 10 billions of PageRank scores for each page
    • Storage exceeds several Exabytes!
  • NB single-page personalization is enough:
slide44

Forcertainthingsarejusttoobig?

  • For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years.
  • The record for hitch hiking this distance is just under five years, but you don't get to see much on the way.

D Adams, The Hitchhiker's Guide to the Galaxy. 1979

markov chain monte carlo
MarkovChain Monte Carlo
  • Reformulation by simple tricks of linear algebra
    • From u simulateN independent random walks
    • Database of fingerprints: ending vertices of the walks from all vertices
  • Query
    • PPR(u,v) := # ( walks u→v ) / N
    • N ≈ 1000 approximates top100well

Fogaras-Racz: TowardsScalingFullyPersonalizedPageRank, WAW 2004

simrank similarity in graphs
SimRank: similarityingraphs

“Twopagesaresimilarifpointedtobysimilarpages” [Jeh–Widom KDD 2002]:

  • Same trick: pathpairsummation (can be sampled [Fogaras–Rácz WWW 2005]) over

u = w0,w1, . . . ,wk−1,wk = v2

u = w’0 ,w’1 , . . . ,w’k−1,w’k = v1

  • DB applicatione.g: Yin, Han, Yu. LinkClus: efficientclusteringviaheterogeneoussemanticlinks, VLDB '06
communication complexity bounding
Communicationcomplexitybounding
  • Bit-vector probing (BVP)
  • Theorem: B ≥ mfor any protocol
  • Reduction from BVPtoExact-PPR-compare

Alice has a bit vector

Input: x = (x1, x2, …, xm)

Bob has a number

Input: 1 ≤ k ≤ m

Xk= ?

Communication

B bits

Alice has x = (x1, x2, …, xm)

G graph with V vertices, where V2 = m

Pre-compute an Exact PPR dataof size D

Bob has 1 ≤ k ≤ m

u, v, w vertices

PPR(u,v) ? PPR(u,w)

Xk= ?

Communication

Exact PPR, D bits

Thus D = B ≥ m= V2

theory of streaming algorithms
Theory of Streamingalgorithms
  • Distinctvalues példa – Motwanislides
    • Szekvenciális, RAM algoritmusok
    • Külső táras algoritmusok
    • Mintavételezés negatív eredmény
    • „Sketching” technika
overview4
Overview
  • Part I: Background
    • Examples
    • Scientific and Business Relevance
  • Part II: FoundationsIllustrated
    • Data and computationintensetasks, architectures
    • History of Algorithms
    • Web processing, PageRankwithalgorithms
    • Streaming algoritmusok
    • Entityresolution – detailedwithalgorithms
  • Summary, conclusions
distributed computing paradigms and tools
DistributedComputingParadigms and Tools
  • DistributedKey-ValueStores:
    • distributedB-tree index forallattributes
    • Project Voldemort
  • MapReduce:
    • map → reduceoperations
    • ApacheHadoop
  • BulkSynchronous Parallel:
    • supersteps: computation → communication → barriersync
    • ApacheHama
slide51

Entity resolution

A2

A1

A3

e1

r1:

r2:

e2

r3:

r4:

e3

r5:

  • Records withattributes – r1, … , r5
  • Entitiesformedbysets of records – e1, e2 , e3
  • EntityResolutionproblem = partition the recordsintocustomers, …
a communication complexity lower bound
A communication complexity lower bound
  • Set intersection cannot be decided by communicating less than Θ(n) bits[Kalyanasundaram, Schintger 1992]
  • Implication: ifdata is over multiple servers, one needs Θ(n) communication to decide if it may have a duplicate with another node
  • Best we can do is communicate all data
  • Related area: Locality Sensitive Hashing
    • no LSH for „minimum”, i.e. to decide if two attributes agree
    • similar to negative results on Donoho’s Zero „norm” (number of non-zero coordinates)
wait how about blocking
Wait – howaboutblocking?
  • Blockingspeedsupsharedmemory parallel algorithms[manyinliterature, eg. Whang, Menestrina, Koutrika, Theobald, Garcia-Molina. ER withIterativeBlocking, 2009]
  • Evenifwecouldpartitionthedatawith no duplicatessplit, thelowerboundappliesjustto test wearedone
  • Still, blocking is a good idea – wemayhavetocommunicatemuch more thanΘ(n) bits
distributed key value store kvs
Distributed Key-Value Store (KVS)
  • the record graph can be served from the KVS
    • KVS mainly provides random access to a huge graph that would not fit in main memory
    • computing nodes, many indexing nodes
  • implement a graph traversal
    • Breadth-First Search (BFS) with queues
    • spanning forest with Union-Find
    • basic textbook algorithms [Cormen-Leiserson-Rivest,…]
mapreduce mr
MapReduce (MR)
  • Sorting as the prime MR app
  • For each feature, sort to form a graph of records
    • MR has no data locality
    • All data is moved around the network in each iteration
  • Find connected components of the graph
    • Again the textbook matrix power method can be implemented in MR
    • Iterated matrix multiplication is not MR-friendly[Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009]
    • This step will move huge amounts around: the whole data even if we have small components only
bulk synchronous parallel bsp
Bulk Synchronous Parallel (BSP)
  • One master node + several processing nodesperformmerge sort
  • Algorithm:
    • resolve local dataateach node locally
    • send attribute values in sorted order
    • central server identifies and sends back candidates
    • find connected components
      • this is the prominent BSP app, very fast
experiments scalability
Experiments: scalability
  • 15 olderblade servers, 4GB memory, 3GHz CPU each
  • insuranceclientdataset(~ 2 records per entity)
experiments scalability1
Experiments: scalability
  • 15 olderblade servers, 4GB memory, 3GHz CPU each
  • insuranceclientdataset(~ 2 records per entity)
experiments scalability2
Experiments: scalability
  • 15 olderblade servers, 4GB memory, 3GHz CPU each
  • insuranceclientdataset(~ 2 records per entity)

HAMA phases

Hadoopphases

conclusions
Conclusions
  • Big Data is founded on several subfields
    • Architectures – processor arrays, many-core affordable
    • Algorithms – design principles from the ‘90-s
    • Databases – distributed, column oriented, NoSQL
    • Data mining, Information retrieval, Machine learning, Networks – for the top application
  • Hadoop and Stream processing are the two main efficient techniques
    • Limitations for data AND compute intense problems
    • Many emerging alternatives (e.g. BSP)
    • Selecting the right architecture is often the question
k rd sek
Kérdések?

AndrásBenczúr

Head, Informatics Laboratory

http://datamining.sztaki.hu/

Institute for Computer Science and Control, Hungarian Academy of Sciences

Email: benczur@sztaki.hu