1 / 25

Introduction to Machine Learning and Text Mining

Introduction to Machine Learning and Text Mining. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Data. Target Representation. Naïve Approach: When all you have is a hammer…. Data. Target Representation.

megan
Download Presentation

Introduction to Machine Learning and Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Machine Learning and Text Mining Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

  2. Data Target Representation Naïve Approach: When all you have is a hammer…

  3. Data Target Representation Slightly less naïve approach: Aimless wandering…

  4. Data Target Representation Expert Approach: Hypothesis driven

  5. Suggested Readings • Witten, I. H., Frank, E., Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition, Elsevier: San Francisco

  6. Classification Engine Learning Algorithm Data Model Prediction New Data What is machine learning? • Automatically or semi-automatically • Inducing concepts (i.e., rules) from data • Finding patterns in data • Explaining data • Making predictions

  7. Train Test

  8. Perfect on training data If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

  9. Performance on training data? Not perfect on testing data If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes

  10. If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes IMPORTANT! If you evaluate the performance of your rule on the same data you trained on, you won’t get an accurate estimate of how well it will do on new data.

  11. Simple Cross Validation Fold: 1 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 2, 3, 4, 5, 6,7 • and apply trained model to 1 • The results is Accuracy1 TEST 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7

  12. Simple Cross Validation Fold: 2 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1, 3, 4, 5, 6,7 • and apply trained model to 2 • The results is Accuracy2 TRAIN 1 TEST 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7

  13. Simple Cross Validation Fold: 3 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1, 2, 4, 5, 6,7 • and apply trained model to 3 • The results is Accuracy3 TRAIN 1 TRAIN 2 TEST 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7

  14. Simple Cross Validation Fold: 4 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1,2, 3, 5, 6,7 • and apply trained model to 4 • The results is Accuracy4 TRAIN 1 TRAIN 2 TRAIN 3 TEST 4 TRAIN 5 TRAIN 6 TRAIN 7

  15. Simple Cross Validation Fold: 5 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1, 2, 3, 4, 6,7 • and apply trained model to 5 • The results is Accuracy5 TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TEST 5 TRAIN 6 TRAIN 7

  16. Simple Cross Validation Fold: 6 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1, 2, 3, 4, 5, 7 • and apply trained model to 6 • The results is Accuracy6 TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TEST 6 TRAIN 7

  17. Simple Cross Validation Fold: 7 • Let’s say your data has attributes A, B, and C • You want to train a rule to predict D • First train on 1, 2, 3, 4, 5, 6 • and apply trained model to 7 • The results is Accuracy7 • Finally: Average Accuracy1 through Accuracy7 TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TEST 7

  18. Working with Text

  19. Cows make cheese. 110010 Hamsters eat seeds. 001101 Basic IdeaRepresent text as a vector where each position corresponds to a termThis is called the “bag of words” approach Cheese Cows Eat Hamsters Make Seeds

  20. Cows make cheese. 110010 Hamsters eat seeds. 001101 But same representation for “Cheese makes cows.”! Basic IdeaRepresent text as a vector where each position corresponds to a termThis is called the “bag of words” approach Cheese Cows Eat Hamsters Make Seeds

  21. 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or mass 13.NNS Noun, plural 14.NNP Proper noun, singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative Part of Speech Tagging http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

  22. 23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb, gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps. sing. present 32.VBZ Verb, 3rd ps. sing. present 33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh-pronoun 36.WRB wh-adverb Part of Speech Tagging http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

  23. Basic Types of Features • Unigram • Single words • prefer, sandwhich, take • Bigram • Pairs of words next to each other • Machine_learning, eat_wheat • POS-Bigram • Pairs of POS tags next to each other • DT_NN, NNP_NNP

  24. Keep this picture in mind… • Machine learning isn’t magic • But it can be useful for identifying meaningful patterns in your data when used properly • Proper use requires insight into your data ?

More Related