1 / 32

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied)

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied). natalia_ponomareva sobachka yahoo.com. Hadoop. Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware.

deiter
Download Presentation

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomarevasobachka yahoo.com

  2. Hadoop • Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware Adapted from the slides of Donald Miner

  3. HDFS • Works on top of native (for example ext3, xfs, etc.) file system • Data is organized into files & directories • Files are divided into blocks, (64-128MB) • Files are distributed across cluster nodes • Files are write-once • The location of blocks can be used to optimize the Map/Reduce execution • Blocks are replicated for fault tolerance • Data integrity is ensured via checksums • HDFS is not good for random reads • HDFS is optimized for steaming reads of files • HDFS is based on design of Google File System

  4. Map/Reduce Paradigm • Jobs are described in terms of Mappers and Reducers • Mappers receive input records and eject key/value pairs • Pairs from mappers are automatically • Grouped by the key • Sorted for each reducer • Reducers get key/value pairs and emit the key/value result/s

  5. Example 1: words count

  6. Mapper class

  7. Reducer class

  8. Distribute the documents among K computers …. …. …. To be or not to be …. For each doc, return a set of (word,frequency) pairs Map Map Map Map …. …. …. (to,1), (to, 1) (be,1), (be,1) (or,1), …. to be (to,1,1, ..), (be,1, 1), (come,1,1,1), … … Count the occurrences of each word Reduce Reduce Reduce To: 180 Be: 251 Come: 123 …

  9. Example 2: inner join from MapReduce design patterns book “MapReduce design patterns” by Donald Miner and Adam Shook

  10. Mapper class: users records “MapReduce design patterns” by Donald Miner and Adam Shook

  11. Mapper class: comments records “MapReduce design patterns” by Donald Miner and Adam Shook

  12. Reducer: The actual join logic

  13. Cool things about Hadoop • No schema imposed- decide what you want when loading • Keep full original data! • Store anything – media, text, logs • Transparent Parallelism and network programming. • Fault tolerance • Blocks are replicated • Only active nodes get assigned to jobs • Map-Reduce can handle for slow mappers jobs - a dupe of a slow running mapper is created automatically and the results of the first finishing mapper will be used • Scalability Check it http://developer.yahoo.com/hadoop/tutorial/module2.html

  14. Hadoop eco system • Higher-level languages like Pig and Hive • Cascalog

  15. Pig • Pig is a SQL-like query language that computes using MapReduce jobs • It is higher-level than Map/Reduce: FOREACH, GROUP BY,JOIN, DISTINCT, FILTER etc. • Custom loaders and storage functions • Reads both structured and unstructured data • It is a Data flow language

  16. Why to use PIG • Easier to adopt by non-Java programmers • No-compilations runs • Faster to write (not necessarily faster to execute) Word count example A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount’; Join example A = JOIN comments BY userID, users BY userID; • Built-in functions - count, group by, joins, filter • Built-in optimizations of executions • Can still use map/reduce from pig (use mapreduce keyword) • Very good for quick analytics

  17. Pig drawbacks • Might be clumsy to write tests for (but usually you don’t need tests for one-off analytics) • But cool for development- use Hawk! • You can’t do everything (for example, ifs) • Pig is not good for • Advanced string manipulations (can use UDFs) • Complex joins • Math • Complex aggregates • Iterative algorithms • But the majority can be addressed with UDF • Hard to reuse code (macros have limited functionality)

  18. Pig UDF REGISTER mylibrary.jar; DEFINE ToUpperCasecom.mine.pig.udf.ToUpperCase(); A = LOAD ’words_data' AS (word: chararray, position: int); B = FOREACH A GENERATE ToUpperCase(word);

  19. Cascalog • Cascalog - a compiler that produces sequences of Map-Reduce programs • Clojure-based (functional programming language) • Compiles to Java byte code => can access directly all your Java-based code • Granular testing and mocking • Runs directly on Hadoop and EMR • Wide variety of built-in functionality • Inner and outer joins • Aggregators • Functions • Subqueries • Sorting • High performance Check it out https://github.com/nathanmarz/cascalog/wiki http://www.slideshare.net/nathanmarz/cascalog http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

  20. Examples • Example 1: clojure • + 1 2 3 • * 3 5 Check it out http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

  21. More examples • Inner join user=> (?<- (stdout) [?person ?age ?gender] (age ?person ?age) (gender ?person ?gender)) • Full outer join user=> (?<- (stdout) [?person !!age !!gender] (age ?person !!age) (gender ?person !!gender)) • Count of followers user=> (?<- (stdout) [?person ?count] (person ?person) (follows ?person !!follower) (c/!count !!follower :> ?count)) • The numbers that equal their squares user=> (?<- (stdout) [?n] (integer ?n) (* ?n ?n :> ?n)) Cascalog detects that we are trying to rebind the ?n variable and will automatically filter out tuples where the output of the * predicate is not equal to the input. Check it out http://nathanmarz.com/blog/new-cascalog-features-outer-joins-combiners-sorting-and-more.html

  22. What’s hot in Big Data Arena in New York • Etsy • Foursquare • Spotify • Knewton • IntentMedia

  23. Etsy’s skyline • Etsy – the world’s largest hand-made vintage market place • Practice continuous development (30-60 deploys per day) • Optimized for recovering from failure, rather than avoiding it • Bunch of metrics (250K) are outputted and routed to failure detection software – Skyline • Kind of real time – approx. 70 seconds lag • Runs on • 150 nodes hadoop cluster Check it out http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/

  24. Skyline: continued • Anomalies are detected through consensus model • A metric is anomalous if it latest value is over 3 s.d. above its moving average (statistical process control) • By histogram • By linear regression (distribution of residuals) • Exponentially weighted moving averages (time series with decay factor)

  25. Skyline: continued • Problems • Seasonality • Spike influence (raises the moving average) • Normality • Parameters • As of now, generates too much of noise

  26. Spotify • Swedish company that allows users to search for songs and play them on demand • 20m tracks, 20K more are added per day • Runs on • Hadoop 700 nodes cluster • Trying to • Recommend music to users • Provide Intelligent search functionality • Recommendations • precomputed overnight • Collaborative-filtering type • Use signals like time user started streaming the track, when did she stop, ip address location, no rating info (can use number of streams) • Build vectors (fingerprint) of users and tracks • Use cos to find top scoring recommendations • Algos: matrix factorization, probabilistic latent semantic filtering, k-nearest neighbors to narrow down the potential candidates for recommendations • Problems: new users and new tracks Check it out http://vimeo.com/71889190

  27. Foursquare • Mobile app that allows to explore the city and connect to friends • Utilizes location data • Based on people checking-in into the restaurants, events etc • 30m people • 50m places • 3.5b check-ins • 5m check-ins per day • Use big data for • Place recommendation • how to influence users to go to some place • Place matching (where the user is checking from) Algos: ensemble of simple models ,NaïveBayes, linear models, random forests, Gaussian mixture combined with personal history and friends’ history Check it out http://vimeo.com/71889190

  28. Foursquare • Spatial models – they compile Gaussian mixture models –eg what’s the probability of being at this place given the info received from the phone • Sentiment detection based on users reviews (Naïve Bayes) • Collaborative filtering – amazon style- people who like this also like that • Real-time places recommendations based on • Location • Time of day • Personal check-in history • Friends preferences • Venue similarities • Aggregate historical data • Familiarity

  29. Knewton • Adaptive learning platform • Real-time recommendations tailored for a student • Trying to determine what the student should work on next and how to learn it (depending on the learning style – visual, geometric approach etc) • Their big clients: Arizona State University and University of Alabama. • Model model engagement, boredom, frustration, proficiency, the extent to which a student knows or doesn’t know a particular topic. • Algos: Item Response Theory Model (estimates the probability that a student is able to do something based on an answer to a particular question). • Signals: click stream history (did they check review page? Or checked the hint? How long it took them to answer? Did they change their mind when answering a question) • Runs on amazon web services Check it out http://www.knewton.com/

  30. IntentMedia • End-to-end solution for e-commerce sites seeking to monetize their website traffic through advertising while still protecting conversions. • Online travel agencies convert perhaps 3% to 5% of site visitors • IntentMedia can help sites monetize on the rest of the visitors • Combines consumer-intent data with Intent Media predictive analysis to serve up competitors’ ads to consumers who are deemed unlikely to convert on the initial publisher’s site. • Runs on: Amazon web services, uses Pig, Cascalog, Hadoop • Largest job: 25m records, 440 features signals, Check it out http://intentmedia.com

  31. Q&A? • Thanks!

More Related