1 / 40

Machine Learning

Machine Learning. Getting Started. Andrew Loree. Got a question?. Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace. Goals. Outcome: What is Machine Learning (ML)? Understand the ML process Base knowledge of types of ML & algorithms

ogillespie
Download Presentation

Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Getting Started Andrew Loree

  2. Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace

  3. Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process • Base knowledge of types of ML & algorithms • Learning path for starting to use ML

  4. What is Machine Learning? • Using data to find patterns and based upon those patterns predict the future • When is a prediction a guess? When it is not based upon “sufficient” observation, experience or scientific reasoning • Example questions: • How long until a production server is out of disk space? • Is this email spam? • Customer retention, product recommendations, marketing campaigns, fraud detection, credit worthiness, …

  5. Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer Science/Engineering • ...just applied “domain knowledge” • Answer is all of the above and somewhere in between • Philosophical question are best left to the philosophers

  6. Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s↓ Machine Learning (boom) ~1980’s ↓ Deep Learning (boom) ~2010

  7. Does the Machine really Learn? • Pattern recognition is from learning and past experiences, and we use it every day • Which of these charges were fraudulent for my credit card? When is there enough data and when do you have too much? Enter ML

  8. Types of Questions (about Data) • Descriptive - how many of X did I sell? • Associative - is there an association between temperature and sales? (hypothesis) • Comparative - how many X sells versus Y? • Predicative - using associations and comparatives to predicate sales of X? Machine Learning can answer predictive questions

  9. Framing your questions the ML way In order of importance: 1. Are you asking the right question? - ML is not magic, desired outcomes must be definable - Days until full? Will customer leave? Fraudulent Charge? 2. Do you think you have the right data? - Prediction cannot overcome lack of data - Data insight (domain knowledge) is critical to success 3. What results is good enough? - 50% accuracy, 70%, 99%? No false positives allowed? - Wait, what is accuracy?

  10. Machine Learning Ethics • Bias: Confirmation,… • Perspective: Recommendations lead to more purchases (seller)- vs -Leads to higher debt (buyer) • Moral dilemmas

  11. The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more) ML algorithms learn the patterns • A model is generated, used to predict against new data

  12. The Machine Learning Process: Data • Can be multiple sources, BigData stores, flat files, DBMS,… • Usually never in the right format • Do you have the right “features”? * • Preprocessing almost always required – usually the hardest part

  13. The Machine Learning Process: Algorithm • Which algorithm is the “right” one?* • How do you compare one algorithms results to another?

  14. The Machine Learning Process: Model • Most cases, repeating the entire process many times • How stable is our results? • Rinse and repeat the process • Model Management – consuming and operationalization of models is a separate, but very critical topic

  15. Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create a model Three main ML categories: • Supervised learning - Categorizes outcomes or value of interest, in training data • Unsupervised learning - Organize data in a way to describe structure (clustering) • Reinforcement learning • Makes a choice, measure how “good” that was, modify the strategy going forward

  16. Machine Learning Model Types Regression – supervised learning problems, fitting data to a line or curve How long until I run out of disk space?

  17. Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or more) classes Is it spam or ham?

  18. Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the defined classes Market research from surveysto generate market segments

  19. Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of features for an observation is a commonly called a “feature vector” Target Value (or Class) – our desired outcome of prediction; With supervised learning, the value is in the training data

  20. Text (SMS) Spam Which of these five messages are spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. I HAVE A DATE ON SUNDAY WITH WILL!! Fine if that's the way u feel. That's the way its gota b U GOIN OUT 2NITE?

  21. Text Spam: Features What makes these two messages spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. What if you cannot use the message text itself?What are the “features” that are common to spam messages? • Length of the message? • Number of numeric strings? • Number of web links? • Number of currency symbols? • Number of punctuations? • Others?

  22. Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam research • Contains a “training set” of 5,574 messages, marked either SPAM or HAM • Given just a message, how can we determine if the message is spam or ham? • Who doesn’t have domain knowledge of “spam” and texting? References:UC Irvine ML Repository: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection Contributions to the Study of SMS Spam Filtering: New Collection and Results: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

  23. Text Spam: Demo • Weka Explorer • Load training data set • Try a couple algorithms using our features with Cross-validation • Compare results Azure ML Studio • Show same solution

  24. ML Algorithms • Way too many to list • Commonly used: • Decision Trees • Random Forest • Support Vector Machines (SVM) • k-Nearest Neighbor - KNN • Linear Regression • Logistic Regression

  25. ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm named C4.5 (called J48)

  26. ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees

  27. ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka version named SGD

  28. ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number of neighbors used in measure of distance • Chose odd number to avoid ties • Called IBk in Weka

  29. ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values

  30. ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no, A or B) • S-curve to fit against data

  31. ML Algorithms: Cheat Sheet https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet

  32. ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other practical measures: • Training time • Memory requirements • Scalability https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

  33. ML Algorithms: Testing • Different ways to “slice and dice” your training set data • Entire set • Percentage of set • Cross-validation - divide the set into subsets – generally best option

  34. ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true (% of overall) • Precision – more important for non-binary classifications • Lots of others, some specific to problem Type (recall, F-measure,…)

  35. ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough

  36. ML Algorithms: Pitfalls • Overfitting – memorization

  37. ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as input to the model • Sampling Bias – poor choices for training set data e.g. predict item sales for entire store chain from a single store’s data • Predict Random Outcomes – fair and unfair coins flips, dependent outcomes

  38. ML Algorithms: Text Processing • Think of “search” on top of machine learning • All of the common problems applied to classic linguistics challenge machine learning to an extent: • Tokenization – word breaking • Stemming (and lemmatization) – walk, walking, walked, walks → walk • Domain specific dictionaries – company jargon, acronyms, emojis,… • Language used – not everyone writes the Queen’s English • Semantic search – understand “meaning” – may be a better option to generate processing features

  39. ML Toolkits, Platforms & Libraries • Toolkit/Platforms • WEKA • R • Parts of Python SciPy • Microsoft Cognitive Toolkit (CNTK) • Libraries • Scikit-learn (python) • JSAT • Accord.NET FrameworkAPIs • Azure ML • Mlib • PredicationIO • Operationalize • SQL Server Machine Learning Services/Machine Learning Server

  40. Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace

More Related