1 / 69

TagHelper & SIDE

TagHelper & SIDE. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. TagHelper Tools and SIDE. Define Summaries. Annotate Data. Visualize Annotated Data. TagHelper Tools uses text mining technology to automate annotation of conversational data.

addo
Download Presentation

TagHelper & SIDE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

  2. TagHelper Tools and SIDE Define Summaries Annotate Data Visualize Annotated Data TagHelper Tools uses text mining technology to automate annotation of conversational data SIDE facilitates rapid prototyping of reporting interfaces for group learning facilitators

  3. Setting Up Your Data For TagHelper

  4. Setting Up Your Data

  5. I versus you is not a reliable predictor Not all WH words occur in questions Not all questions end in a question mark. How do you know when you have coded enough data? What distinguishes Questions and Statements? You need to code enough to avoid learning rules that won’t work

  6. Creating a Trained Model

  7. Training and Testing • Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder • You will then see the following tool pallet • The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data • Click on Train New Models

  8. Loading a File First click on Add a File Then select a file

  9. Simplest Usage • Click “GO!” • TagHelper will use its default setting to train a model on your coded examples • It will use that model to assign codes to the uncoded examples

  10. More Advanced Usage • The second option is to modify the default settings • You get to the options you can set by clicking on >> Options • After you finish that, click “GO!”

  11. Evaluating Performance

  12. Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

  13. Output File • The output file contains • The codes for each segment • Note that the segments that were already coded will retain their original code • The other segments will have their automatic predictions • The prediction column indicates the confidence of the prediction

  14. Overview of Basic Feature Extraction from Text

  15. Customizations • To customize the settings: • Select the file • Click on Options

  16. * The three main types of Classifiers are Bayesian models (Naïve Bayes), functions (SMO), and trees (J48) Classifier Options • Rules of thumb: • SMO is state-of-the-art for text classification • J48 is best with small feature sets – also handles contingencies between features well • Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

  17. Basic IdeaRepresent text as a vector where each position corresponds to a termThis is called the “bag of words” approach Cows make cheese 110001 Hens lay eggs 001110 Cheese Cows Eggs Hens Lay Make But same representation for “Cheese makes cows.”!

  18. What can’t you conclude from “bag of words” representations? • Causality: “X caused Y” versus “Y caused X” • Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” • Who’s driving, who’s eating, and who’s preparing food?

  19. Basic Anatomy: Layers of Linguistic Analysis • Phonology: The sound structure of language • Basic sounds, syllables, rhythm, intonation • Morphology: The building blocks of words • Inflection: tense, number, gender • Derivation: building words from other words, transforming part of speech • Syntax: Structural and functional relationships between spans of text within a sentence • Phrase and clause structure • Semantics: Literal meaning, propositional content • Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) • Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis

  20. 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or mass 13.NNS Noun, plural 14.NNP Proper noun, singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

  21. 23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb, gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps. sing. present 32.VBZ Verb, 3rd ps. sing. present 33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh-pronoun 36.WRB wh-adverb Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html

  22. TagHelper Customizations • Feature Space Design • Think like a computer! • Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful • Look for approximations • If you want to find questions, you don’t need to do a complete syntactic analysis • Look for question marks • Look for wh-terms that occur immediately before an auxilliary verb

  23. TagHelper Customizations • Feature Space Design • Punctuation can be a “stand in” for mood • “you think the answer is 9?” • “you think the answer is 9.” • Bigrams capture simple lexical patterns • “common denominator” versus “common multiple” • POS bigrams capture syntactic or stylistic information • “the answer which is …” vs “which is the answer” • Line length can be a proxy for explanation depth

  24. TagHelper Customizations • Feature Space Design • Contains non-stop word can be a predictor of whether a conversational contribution is contentful • “ok sure” versus “the common denominator” • Remove stop words removes some distracting features • Stemming allows some generalization • Multiple, multiply, multiplication • Removing rare features is a cheap form of feature selection • Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space

  25. Created Features

  26. Why create new features by hand? • Rules • For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples • Features • More likely to capture meaningful generalizations • Build in knowledge so you can get by with less training data

  27. Rule Language • ANY() is used to create lists • COLOR = ANY(red,yellow,green,blue,purple) • FOOD = ANY(cake,pizza,hamburger,steak,bread) • ALL() is used to capture contingencies • ALL(cake,presents) • More complex rules • ALL(COLOR,FOOD) * Note that you may wish to use part-of-speech tags in your rules!

  28. What can you do with this rule language? • You may want to generalize across sets of related words • Color = {red,yellow,orange,green,blue} • Food = {cake,pizza,hamburger,steak,bread} • You may want to detect contingencies • The text must mention both cake and presents in order to count as a birthday party • You may want to combine these • The text must include a Color and a Food

  29. Advanced Feature Editing

  30. * For small datasets, first deselect Remove rare features. Advanced Feature Editing

  31. * Next, Click on Adv Feature Editing Advanced Feature Editing

More Related