1 / 74

Ling 570: Day 8 Classification, Mallet

Ling 570: Day 8 Classification, Mallet. Roadmap. Open questions? Quick review of classification Feature templates. Classification Problem Steps. Input processing: Split data into training/ dev /test Convert data into a feature representation (aka Attribute Value Matrix) Training Testing

kalila
Download Presentation

Ling 570: Day 8 Classification, Mallet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ling 570: Day 8Classification, Mallet

  2. Roadmap • Open questions? • Quick review of classification • Feature templates

  3. Classification Problem Steps • Input processing: • Split data into training/dev/test • Convert data into a feature representation (aka Attribute Value Matrix) • Training • Testing • Evaluation

  4. Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly”

  5. Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include:

  6. Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly”

  7. Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX.

  8. Feature templates • Problem: predict the POS tag distribution of an unknown word • Input: “unfrobulate” • Input: “turduckenly” • Features might include: • Last three characters are “ate” • Last two characters are “ly” • Feature templates generate features given an input • Template : Last three characters == XXX. • Plug in XXX to get a binary valued feature. • Templates generate many features

  9. Machine learning

  10. Classifiers • Wide variety • Differ on several dimensions • Supervision • Learning Function • Input Features

  11. Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc

  12. Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms

  13. Supervision in Classifiers • Supervised: • True label/class of each training instance is provided to the learner at training time • Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc • Unsupervised: • No true labels are provided for examples during training • Clustering: k-means; Min-cut algorithms • Semi-supervised: (bootstrapping) • True labels are provided for only a subset of examples • Co-training, semi-supervised SVM/CRF, etc

  14. Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc

  15. Inductive Bias • What form of function is learned? • Function that separates members of different classes • Linear separator • Higher order functions • Vornoi diagrams, etc • Graphically, decision boundary + + + - - -

  16. Machine Learning Functions • Problem: Can the representation effectively model the class to be learned?

  17. Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm - - - - - - - - - ++ + + + +

  18. Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! - - - - - - - - - ++ + + + +

  19. Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! - - - - - - - - - ++ + + + +

  20. Machine Learning Functions • Problem: Can the representation effectively model the class to be learned? • Motivates selection of learning algorithm For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation! - - - - - - - - - ++ + + + +

  21. Machine Learning Features • Inputs: • E.g.words, acoustic measurements, parts-of-speech, syntactic structures, semantic classes, .. • Vectors of features: • E.g. word: letters • ‘cat’: L1=c; L2 = a; L3 = t • Parts of syntax trees?

  22. Machine Learning Features • Questions: • Which features and values should be used? • How should they relate to each other? • Issue 1: What values should they take? • Binary features – don’t do anything! • Real valued features *may* need to be normalized • Can force the values to have 0 mean and unit variance • Compute the mean and variance on the training set for real valued feature • Replace original value with • Can also bin them or binarize them – often this works better • Issue 2: Which ones are important? • Feature selection is sometimes important • Current approach

  23. Machine Learning Toolkits • Many learners, many tools/implementations

  24. Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented

  25. Machine Learning Toolkits • Many learners, many tools/implementations • Some broad tool sets • weka • Java, lots of classifiers, pedagogically oriented • mallet • Java, classifiers, sequence learners • More heavy duty

  26. Mallet: intro and data prep

  27. Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum

  28. Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source

  29. Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners

  30. Mallet • Machine learning toolkit • Developed at UMass Amherst by Andrew McCallum • Java implementation, open source • Large collection of machine learning algorithms • Targeted to language processing • Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting • Also, clustering, topic models, sequence learners • Widely used, but • Research software: some bugs/gaps; odd documentation

  31. Installation • Installed on patas • /NLP_TOOLS/tool_sets/mallet/latest/ • Directories: • bin/: script files • src/: java source code • class/: java classes • lib/: jar files • sample-data/: wikipedia docs for languages id, etc

  32. Environment • Should be set up on patas • $PATH should include • /NLP_TOOLS/tool_sets/mallet/latest/bin • $CLASSPATH should include • /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar; /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar • Check: • which text2vectors • /NLP_TOOLS/tool_sets/mallet/latest/bin

  33. Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification

  34. Mallet Commands • Mallet command types: • Data preparation • Data/model inspection • Training • Classification • Command line scripts • Shell scripts • Set up java environment • Invoke java programs • --help lists command line parameters for scripts

  35. Mallet Data • Mallet data instances: • Instance_id label f1 v1 f2 v2 ….. • Stored in internal binary format: “vectors” • Binary format used by learners, decoders • Need to convert text files to binary format

  36. Data Preparation • Built-in data importers • One class per directory, one instance per file • bin/mallet import-dir --input IF --output OF • Label is directory name • (Also text2vectors) • One instance per line • bin/mallet import-file --input IF --output OF • Line: instance label text ….. • (Also csv2vectors) • Create binary representation of text feature counts

  37. Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs

  38. Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors)

  39. Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors

  40. Data Preparation • bin/mallet import-svmlight --input IF --output OF • Allows import of user constructed feature value pairs • Format: • label f1:v1 f2:v2 …..fn:vn • Features can strings or indexes • (Also bin/svmlight2vectors) • If building test data separately from original • bin/mallet import-svmlight --input IF --output OF • --use-pipe-from previously_built.vectors • Ensures consistent feature representation • Note: can’t mix svmlight models with others

  41. Accessing Binary Formats • vectors2info --input IF

  42. Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set

  43. Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order

  44. Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct

  45. Accessing Binary Formats • vectors2info --input IF • -- print-labels TRUE • Prints list of category labels in data set • -- print-matrix sic • prints all features and values by string and number • Returns original text feature-value list • Possibly out of order • vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct • Creates random training/test splits in some ratio

  46. Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc

  47. Building & Accessing Models • bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc

  48. Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en

  49. Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Can also use pre-split training & testing files • e.g. output of vectors2vectors • --training-file, --testing-file

  50. Building & Accessing Models • bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF • Builds classifier model • Can also store model, produce scores, confusion matrix, etc • --trainer: MaxEnt, DecisionTree, NaiveBayes, etc • --report: train:accuracy, test:f1:en • Confusion Matrix, row=true, column=predicted accuracy=1.0 • label 0 1 |total • 0 de 1 . |1 • 1 en . 1 |1 • Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 • Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0

More Related