320 likes | 483 Views
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied). natalia_ponomareva sobachka yahoo.com. Hadoop. Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware.
E N D
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomarevasobachka yahoo.com
Hadoop • Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware Adapted from the slides of Donald Miner
HDFS • Works on top of native (for example ext3, xfs, etc.) file system • Data is organized into files & directories • Files are divided into blocks, (64-128MB) • Files are distributed across cluster nodes • Files are write-once • The location of blocks can be used to optimize the Map/Reduce execution • Blocks are replicated for fault tolerance • Data integrity is ensured via checksums • HDFS is not good for random reads • HDFS is optimized for steaming reads of files • HDFS is based on design of Google File System
Map/Reduce Paradigm • Jobs are described in terms of Mappers and Reducers • Mappers receive input records and eject key/value pairs • Pairs from mappers are automatically • Grouped by the key • Sorted for each reducer • Reducers get key/value pairs and emit the key/value result/s
Distribute the documents among K computers …. …. …. To be or not to be …. For each doc, return a set of (word,frequency) pairs Map Map Map Map …. …. …. (to,1), (to, 1) (be,1), (be,1) (or,1), …. to be (to,1,1, ..), (be,1, 1), (come,1,1,1), … … Count the occurrences of each word Reduce Reduce Reduce To: 180 Be: 251 Come: 123 …
Example 2: inner join from MapReduce design patterns book “MapReduce design patterns” by Donald Miner and Adam Shook
Mapper class: users records “MapReduce design patterns” by Donald Miner and Adam Shook
Mapper class: comments records “MapReduce design patterns” by Donald Miner and Adam Shook
Cool things about Hadoop • No schema imposed- decide what you want when loading • Keep full original data! • Store anything – media, text, logs • Transparent Parallelism and network programming. • Fault tolerance • Blocks are replicated • Only active nodes get assigned to jobs • Map-Reduce can handle for slow mappers jobs - a dupe of a slow running mapper is created automatically and the results of the first finishing mapper will be used • Scalability Check it http://developer.yahoo.com/hadoop/tutorial/module2.html
Hadoop eco system • Higher-level languages like Pig and Hive • Cascalog
Pig • Pig is a SQL-like query language that computes using MapReduce jobs • It is higher-level than Map/Reduce: FOREACH, GROUP BY,JOIN, DISTINCT, FILTER etc. • Custom loaders and storage functions • Reads both structured and unstructured data • It is a Data flow language
Why to use PIG • Easier to adopt by non-Java programmers • No-compilations runs • Faster to write (not necessarily faster to execute) Word count example A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount’; Join example A = JOIN comments BY userID, users BY userID; • Built-in functions - count, group by, joins, filter • Built-in optimizations of executions • Can still use map/reduce from pig (use mapreduce keyword) • Very good for quick analytics
Pig drawbacks • Might be clumsy to write tests for (but usually you don’t need tests for one-off analytics) • But cool for development- use Hawk! • You can’t do everything (for example, ifs) • Pig is not good for • Advanced string manipulations (can use UDFs) • Complex joins • Math • Complex aggregates • Iterative algorithms • But the majority can be addressed with UDF • Hard to reuse code (macros have limited functionality)
Pig UDF REGISTER mylibrary.jar; DEFINE ToUpperCasecom.mine.pig.udf.ToUpperCase(); A = LOAD ’words_data' AS (word: chararray, position: int); B = FOREACH A GENERATE ToUpperCase(word);
Cascalog • Cascalog - a compiler that produces sequences of Map-Reduce programs • Clojure-based (functional programming language) • Compiles to Java byte code => can access directly all your Java-based code • Granular testing and mocking • Runs directly on Hadoop and EMR • Wide variety of built-in functionality • Inner and outer joins • Aggregators • Functions • Subqueries • Sorting • High performance Check it out https://github.com/nathanmarz/cascalog/wiki http://www.slideshare.net/nathanmarz/cascalog http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop
Examples • Example 1: clojure • + 1 2 3 • * 3 5 Check it out http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop
More examples • Inner join user=> (?<- (stdout) [?person ?age ?gender] (age ?person ?age) (gender ?person ?gender)) • Full outer join user=> (?<- (stdout) [?person !!age !!gender] (age ?person !!age) (gender ?person !!gender)) • Count of followers user=> (?<- (stdout) [?person ?count] (person ?person) (follows ?person !!follower) (c/!count !!follower :> ?count)) • The numbers that equal their squares user=> (?<- (stdout) [?n] (integer ?n) (* ?n ?n :> ?n)) Cascalog detects that we are trying to rebind the ?n variable and will automatically filter out tuples where the output of the * predicate is not equal to the input. Check it out http://nathanmarz.com/blog/new-cascalog-features-outer-joins-combiners-sorting-and-more.html
What’s hot in Big Data Arena in New York • Etsy • Foursquare • Spotify • Knewton • IntentMedia
Etsy’s skyline • Etsy – the world’s largest hand-made vintage market place • Practice continuous development (30-60 deploys per day) • Optimized for recovering from failure, rather than avoiding it • Bunch of metrics (250K) are outputted and routed to failure detection software – Skyline • Kind of real time – approx. 70 seconds lag • Runs on • 150 nodes hadoop cluster Check it out http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/
Skyline: continued • Anomalies are detected through consensus model • A metric is anomalous if it latest value is over 3 s.d. above its moving average (statistical process control) • By histogram • By linear regression (distribution of residuals) • Exponentially weighted moving averages (time series with decay factor)
Skyline: continued • Problems • Seasonality • Spike influence (raises the moving average) • Normality • Parameters • As of now, generates too much of noise
Spotify • Swedish company that allows users to search for songs and play them on demand • 20m tracks, 20K more are added per day • Runs on • Hadoop 700 nodes cluster • Trying to • Recommend music to users • Provide Intelligent search functionality • Recommendations • precomputed overnight • Collaborative-filtering type • Use signals like time user started streaming the track, when did she stop, ip address location, no rating info (can use number of streams) • Build vectors (fingerprint) of users and tracks • Use cos to find top scoring recommendations • Algos: matrix factorization, probabilistic latent semantic filtering, k-nearest neighbors to narrow down the potential candidates for recommendations • Problems: new users and new tracks Check it out http://vimeo.com/71889190
Foursquare • Mobile app that allows to explore the city and connect to friends • Utilizes location data • Based on people checking-in into the restaurants, events etc • 30m people • 50m places • 3.5b check-ins • 5m check-ins per day • Use big data for • Place recommendation • how to influence users to go to some place • Place matching (where the user is checking from) Algos: ensemble of simple models ,NaïveBayes, linear models, random forests, Gaussian mixture combined with personal history and friends’ history Check it out http://vimeo.com/71889190
Foursquare • Spatial models – they compile Gaussian mixture models –eg what’s the probability of being at this place given the info received from the phone • Sentiment detection based on users reviews (Naïve Bayes) • Collaborative filtering – amazon style- people who like this also like that • Real-time places recommendations based on • Location • Time of day • Personal check-in history • Friends preferences • Venue similarities • Aggregate historical data • Familiarity
Knewton • Adaptive learning platform • Real-time recommendations tailored for a student • Trying to determine what the student should work on next and how to learn it (depending on the learning style – visual, geometric approach etc) • Their big clients: Arizona State University and University of Alabama. • Model model engagement, boredom, frustration, proficiency, the extent to which a student knows or doesn’t know a particular topic. • Algos: Item Response Theory Model (estimates the probability that a student is able to do something based on an answer to a particular question). • Signals: click stream history (did they check review page? Or checked the hint? How long it took them to answer? Did they change their mind when answering a question) • Runs on amazon web services Check it out http://www.knewton.com/
IntentMedia • End-to-end solution for e-commerce sites seeking to monetize their website traffic through advertising while still protecting conversions. • Online travel agencies convert perhaps 3% to 5% of site visitors • IntentMedia can help sites monetize on the rest of the visitors • Combines consumer-intent data with Intent Media predictive analysis to serve up competitors’ ads to consumers who are deemed unlikely to convert on the initial publisher’s site. • Runs on: Amazon web services, uses Pig, Cascalog, Hadoop • Largest job: 25m records, 440 features signals, Check it out http://intentmedia.com
Q&A? • Thanks!