1 / 20

CS4705

CS4705. Corpus Linguistics and Machine Learning Techniques. Review. What do we know about so far? Words (stems and affixes, roots and templates,…) Ngrams (simple word sequences) POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …). Some Additional Things We Could Find.

fionan
Download Presentation

CS4705

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS4705 Corpus Linguistics and Machine Learning Techniques CS 4705

  2. Review • What do we know about so far? • Words (stems and affixes, roots and templates,…) • Ngrams (simple word sequences) • POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)

  3. Some Additional Things We Could Find • Named Entities • Persons • Company Names • Locations • Dates

  4. What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations • Find Named Entities (person names, company names, telephone numbers, addresses,…) • Find topic boundaries and classify articles into topics • Identify a document’s author and their opinion on the topic, pro or con • Answer simple questions (factoids) • Do simple summarization/compression

  5. But first, we need corpora… • Online collections of text and speech • Some examples • Brown Corpus • Wall Street Journal and AP News • ATIS, Broadcast News • TDTN • Switchboard, Call Home • TRAINS, FM Radio, BDC Corpus • Hansards’ parallel corpus of French and English • And many private research collections

  6. Next, we pose a question…the dependent variable • Binary questions: • Is this word followed by a sentence boundary or not? • A topic boundary? • Does this word begin a person name? End one? • Should this word or sentence be included in a summary? • Classification: • Is this document about medical issues? Politics? Religion? Sports? … • Predicting continuous variables: • How loud or high should this utterance be produced?

  7. Finding a suitable corpus and preparing it for analysis • Which corpora can answer my question? • Do I need to get them labeled to do so? • Dividing the corpus into training and test corpora • To develop a model, we need a training corpus • overly narrow corpus: doesn’t generalize • overly general corpus: don't reflect task or domain • To demonstrate how general our model is, we need a test corpus to evaluate the model • Development test set vs. held out test set • To evaluate our model we must choose an evaluation metric • Accuracy • Precision, recall, F-measure,… • Cross validation

  8. Then we build the model… • Identify the dependent variable: what do we want to predict or classify? • Does this word begin a person name? Is this word within a person name? • Is this document about sports? The weather? International news? ??? • Identify the independent variables: what features might help to predict the dependent variable? • What is this word’s POS? What is the POS of the word before it? After it? • Is this word capitalized? Is it followed by a ‘.’? • Does ‘hocky’ appear in this document? • How far is this word from the beginning of its sentence? • Extract the values of each variable from the corpus by some automatic means

  9. A Sample Feature Vector for Sentence-Ending Detection

  10. An Example: Finding Caller Names in Voicemail  SCANMail • Motivated by interviews, surveys and usage logs of heavy users: • Hard to scan new msgs to find those you need to deal with quickly • Hard to find msg you want in archive • Hard to locate information you want in any msg • How could we help?

  11. Caller SCANMail Architecture SCANMail Subscriber

  12. Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) • Also recognized using ASR engine

  13. Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2 actually offered to take J home with her and then would she

  14. would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]

  15. SCANMail Demo http://www.avatarweb.com/scanmail/ Audix extension: demo Audix password: (null)

  16. Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs to present in headers • Approach: • Supervised learning from transcripts (phone #’s, caller self-ids) • Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules • Two stage approaches

  17. Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)

  18. Telephone Number Identification • Rules convert all numbers to standard digit format • Predict start of phone number with rules • This step over-generates • Prune with decision-tree classifier • Best features: • Position in msg • Lexical cues • Length of digit string • Performance: • .94 F on human-labeled transcripts • .95 F on ASR)

  19. Caller Self-Identifications • Predict start of id with classifier • 97% of id’s begin 1-7 words into msg • Then predict length of phrase • Majority are only 2-4 words long • Avoid risk of relying on correct speech recognition for names • Best cues to end of phrase are a few common words • ‘I’, ‘could’, ‘please’ • No actual names: they over-fit the data • Performance • .71 F on human-labeled • .70 F on ASR

  20. Introduction to Weka

More Related