Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied)

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomarevasobachka yahoo.com

Hadoop • Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware Adapted from the slides of Donald Miner

HDFS • Works on top of native (for example ext3, xfs, etc.) file system • Data is organized into files & directories • Files are divided into blocks, (64-128MB) • Files are distributed across cluster nodes • Files are write-once • The location of blocks can be used to optimize the Map/Reduce execution • Blocks are replicated for fault tolerance • Data integrity is ensured via checksums • HDFS is not good for random reads • HDFS is optimized for steaming reads of files • HDFS is based on design of Google File System

Map/Reduce Paradigm • Jobs are described in terms of Mappers and Reducers • Mappers receive input records and eject key/value pairs • Pairs from mappers are automatically • Grouped by the key • Sorted for each reducer • Reducers get key/value pairs and emit the key/value result/s

Example 1: words count

Mapper class

Reducer class

Distribute the documents among K computers …. …. …. To be or not to be …. For each doc, return a set of (word,frequency) pairs Map Map Map Map …. …. …. (to,1), (to, 1) (be,1), (be,1) (or,1), …. to be (to,1,1, ..), (be,1, 1), (come,1,1,1), … … Count the occurrences of each word Reduce Reduce Reduce To: 180 Be: 251 Come: 123 …

Example 2: inner join from MapReduce design patterns book “MapReduce design patterns” by Donald Miner and Adam Shook

Mapper class: users records “MapReduce design patterns” by Donald Miner and Adam Shook

Mapper class: comments records “MapReduce design patterns” by Donald Miner and Adam Shook

Reducer: The actual join logic

Cool things about Hadoop • No schema imposed- decide what you want when loading • Keep full original data! • Store anything – media, text, logs • Transparent Parallelism and network programming. • Fault tolerance • Blocks are replicated • Only active nodes get assigned to jobs • Map-Reduce can handle for slow mappers jobs - a dupe of a slow running mapper is created automatically and the results of the first finishing mapper will be used • Scalability Check it http://developer.yahoo.com/hadoop/tutorial/module2.html

Hadoop eco system • Higher-level languages like Pig and Hive • Cascalog

Pig • Pig is a SQL-like query language that computes using MapReduce jobs • It is higher-level than Map/Reduce: FOREACH, GROUP BY,JOIN, DISTINCT, FILTER etc. • Custom loaders and storage functions • Reads both structured and unstructured data • It is a Data flow language

Why to use PIG • Easier to adopt by non-Java programmers • No-compilations runs • Faster to write (not necessarily faster to execute) Word count example A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount’; Join example A = JOIN comments BY userID, users BY userID; • Built-in functions - count, group by, joins, filter • Built-in optimizations of executions • Can still use map/reduce from pig (use mapreduce keyword) • Very good for quick analytics

Pig drawbacks • Might be clumsy to write tests for (but usually you don’t need tests for one-off analytics) • But cool for development- use Hawk! • You can’t do everything (for example, ifs) • Pig is not good for • Advanced string manipulations (can use UDFs) • Complex joins • Math • Complex aggregates • Iterative algorithms • But the majority can be addressed with UDF • Hard to reuse code (macros have limited functionality)

Pig UDF REGISTER mylibrary.jar; DEFINE ToUpperCasecom.mine.pig.udf.ToUpperCase(); A = LOAD ’words_data' AS (word: chararray, position: int); B = FOREACH A GENERATE ToUpperCase(word);

Cascalog • Cascalog - a compiler that produces sequences of Map-Reduce programs • Clojure-based (functional programming language) • Compiles to Java byte code => can access directly all your Java-based code • Granular testing and mocking • Runs directly on Hadoop and EMR • Wide variety of built-in functionality • Inner and outer joins • Aggregators • Functions • Subqueries • Sorting • High performance Check it out https://github.com/nathanmarz/cascalog/wiki http://www.slideshare.net/nathanmarz/cascalog http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

Examples • Example 1: clojure • + 1 2 3 • * 3 5 Check it out http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

More examples • Inner join user=> (?<- (stdout) [?person ?age ?gender] (age ?person ?age) (gender ?person ?gender)) • Full outer join user=> (?<- (stdout) [?person !!age !!gender] (age ?person !!age) (gender ?person !!gender)) • Count of followers user=> (?<- (stdout) [?person ?count] (person ?person) (follows ?person !!follower) (c/!count !!follower :> ?count)) • The numbers that equal their squares user=> (?<- (stdout) [?n] (integer ?n) (* ?n ?n :> ?n)) Cascalog detects that we are trying to rebind the ?n variable and will automatically filter out tuples where the output of the * predicate is not equal to the input. Check it out http://nathanmarz.com/blog/new-cascalog-features-outer-joins-combiners-sorting-and-more.html

What’s hot in Big Data Arena in New York • Etsy • Foursquare • Spotify • Knewton • IntentMedia

Etsy’s skyline • Etsy – the world’s largest hand-made vintage market place • Practice continuous development (30-60 deploys per day) • Optimized for recovering from failure, rather than avoiding it • Bunch of metrics (250K) are outputted and routed to failure detection software – Skyline • Kind of real time – approx. 70 seconds lag • Runs on • 150 nodes hadoop cluster Check it out http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/

Skyline: continued • Anomalies are detected through consensus model • A metric is anomalous if it latest value is over 3 s.d. above its moving average (statistical process control) • By histogram • By linear regression (distribution of residuals) • Exponentially weighted moving averages (time series with decay factor)

Skyline: continued • Problems • Seasonality • Spike influence (raises the moving average) • Normality • Parameters • As of now, generates too much of noise

Spotify • Swedish company that allows users to search for songs and play them on demand • 20m tracks, 20K more are added per day • Runs on • Hadoop 700 nodes cluster • Trying to • Recommend music to users • Provide Intelligent search functionality • Recommendations • precomputed overnight • Collaborative-filtering type • Use signals like time user started streaming the track, when did she stop, ip address location, no rating info (can use number of streams) • Build vectors (fingerprint) of users and tracks • Use cos to find top scoring recommendations • Algos: matrix factorization, probabilistic latent semantic filtering, k-nearest neighbors to narrow down the potential candidates for recommendations • Problems: new users and new tracks Check it out http://vimeo.com/71889190

Foursquare • Mobile app that allows to explore the city and connect to friends • Utilizes location data • Based on people checking-in into the restaurants, events etc • 30m people • 50m places • 3.5b check-ins • 5m check-ins per day • Use big data for • Place recommendation • how to influence users to go to some place • Place matching (where the user is checking from) Algos: ensemble of simple models ,NaïveBayes, linear models, random forests, Gaussian mixture combined with personal history and friends’ history Check it out http://vimeo.com/71889190

Foursquare • Spatial models – they compile Gaussian mixture models –eg what’s the probability of being at this place given the info received from the phone • Sentiment detection based on users reviews (Naïve Bayes) • Collaborative filtering – amazon style- people who like this also like that • Real-time places recommendations based on • Location • Time of day • Personal check-in history • Friends preferences • Venue similarities • Aggregate historical data • Familiarity

Knewton • Adaptive learning platform • Real-time recommendations tailored for a student • Trying to determine what the student should work on next and how to learn it (depending on the learning style – visual, geometric approach etc) • Their big clients: Arizona State University and University of Alabama. • Model model engagement, boredom, frustration, proficiency, the extent to which a student knows or doesn’t know a particular topic. • Algos: Item Response Theory Model (estimates the probability that a student is able to do something based on an answer to a particular question). • Signals: click stream history (did they check review page? Or checked the hint? How long it took them to answer? Did they change their mind when answering a question) • Runs on amazon web services Check it out http://www.knewton.com/

IntentMedia • End-to-end solution for e-commerce sites seeking to monetize their website traffic through advertising while still protecting conversions. • Online travel agencies convert perhaps 3% to 5% of site visitors • IntentMedia can help sites monetize on the rest of the visitors • Combines consumer-intent data with Intent Media predictive analysis to serve up competitors’ ads to consumers who are deemed unlikely to convert on the initial publisher’s site. • Runs on: Amazon web services, uses Pig, Cascalog, Hadoop • Largest job: 25m records, 440 features signals, Check it out http://intentmedia.com

Q&A? • Thanks!

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied)

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied)

Presentation Transcript

Virtual Reality Trend for Digital Animation and Global Success Cases

Why Dummy Tables are Smart! A Systematic Approach to Data Analysis for Your M.Sc. Thesis

Suggested Problems

LTC Trend Tracker SM – A Quality And Financial Performance Improvement Tool

Trend Data

SNU OOPSLA Lab.

SAARF Segmentation Tools

Audio-Lingual Approach

The Management Science Approach

Overview

Big data 實務運算 Apache Pig Hadoop course

Neo4J

APPLIED ELECTRONICS Outcome 1

On the Trend, Detrend and the Variability of Nonlinear and Nonstationary Time Series

Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data

PowerLogic EGX300 Training

Data Quality Indicators (DQIs) What are they, and how do they affect me? An US-EPA Approach

Approach to Adrenal Incidentalomas

APPLIED ELECTRONICS Outcome 1

Applied Bayesian Methods