1 / 83

Natural Language Processing

Natural Language Processing. Overview of this unit. Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either or both NLP and speech recognition Week 2 Speech Recognition Speech lab Finish projects and short critical reading Week 3

gyala
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing

  2. Overview of this unit • Week 1 Natural Language Processing • Work in partners on lab with NLTK • Brainstorm and start projects using either or both NLP and speech recognition • Week 2 Speech Recognition • Speech lab • Finish projects and short critical reading • Week 3 • Present projects • Discuss reading

  3. Natural Language Processing • What is “Natural Language”?

  4. Components of Language • Phonetics

  5. Components of Language • Phonetics – the sounds which make up a word • ie. “cat” – k a t

  6. Components of Language • Phonetics • Morphology

  7. Components of Language • Phonetics • Morphology – The rules by which words are composed • ie. Run + ing

  8. Components of Language • Phonetics • Morphology • Syntax

  9. Components of Language • Phonetics • Morphology • Syntax - rules for the formation of grammatical sentences • ie. "Colorless green ideas sleep furiously.” • Not "Colorless ideas green sleep furiously.”

  10. Components of Language • Phonetics • Morphology • Syntax • Semantics

  11. Components of Language • Phonetics • Morphology • Syntax • Semantics – meaning • ie. “rose”

  12. Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics

  13. Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics - relationship of meaning to the context, goals and intent of the speaker • ie. “Duck!”

  14. Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse

  15. Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse – 'beyond the sentence boundary'

  16. Natural Language Processing • Truly interdisciplinary

  17. Natural Language Processing • Truly interdisciplinary • Probabilistic methods

  18. Natural Language Processing • Truly interdisciplinary • Probabilistic methods • APIs

  19. NLTK • Natural Language Toolkit for Python

  20. NLTK • Natural Language Toolkit for Python • Text not speech

  21. NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers…

  22. NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers… words = book.words() bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) pos = nltk.pos_tag(words)

  23. Terminology

  24. Terminology • Token - An instance of a symbol, commonly a word, a linguistic unit

  25. Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines

  26. Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.”

  27. Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.” • “Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”

  28. Terminology • Lexeme – The set of forms taken by a single word; main entries in a dictionary • ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny

  29. Terminology • Morpheme - the smallest meaningful unit in the grammar of a language • Unladylike • Dogs • Technique

  30. Terminology • Sememe – a unit of meaning attached to a morpheme • Dog - A domesticated carnivorous mammal • S – A plural marker on nouns

  31. Terminology • Phoneme - the smallest contrastive unit in the sound system of a language • /k/ sound in the words kit and skill • /e/ in peg and bread • International Phonetic Alphabet (IPA)

  32. Terminology • Lexicon - A Vocabulary, a set of a language’s lexemes

  33. Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck”

  34. Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck” • We use POS tagging and word sense disambiguation to ATTEMPT to resolve these issues

  35. Terminology • Part of Speech - how a word is used in a sentence

  36. Terminology • Grammar – the syntax and morphology of a natural language

  37. Terminology • Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics

  38. Terminology • Concordance – list of the usages of a word in its immediate context from a specific text • >>> text1.concordance(“monstrous”)

  39. Terminology • Collocation – a sequence of words that occur together unusually often • ie. red wine • >>> text4.collocations()

  40. Terminology • Hapax – a word that appears once in a corpus • >>> fdist.hapaxes()

  41. Terminology • Bigram – sequential pair of words • From the sentence fragment “The quick brown fox…” • (“The”, “quick”), (“quick”, “brown”), (“brown”, “fox…”)

  42. Terminology • Frequency Distribution – tabulation of values according to how often a value occurs in a sample • ie. Word frequency in a corpus • Word length in a corpus •   >>> fdist = FreqDist(samples)

  43. Terminology • Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition • ie. How often is a word tagged as a noun compared to a verb • >>> cfd = nltk.ConditionalFreqDist(tagged_corpus)

  44. Tagging • POS tagging

  45. Types of taggers - Default • Default – tags everything as a noun • Accuracy - .13

  46. Types of taggers - RE • Regular Expression – Uses a set of regexes to tag based on word patterns • Accuracy = .2

  47. Types of taggers - Unigram • Unigram – learns the best possible tag for an individual word regardless of context • ie. Lookup table • NLTK example accuracy = .46 • Supervised learning

  48. Types of taggers - Unigram • Based on conditional frequency analysis of a corpus • P (word | tag) • ie. What is the probability of the word “run” having the tag “verb”

  49. Types of taggers – N gram • Ngram tagger – expands unigram tagger concept to include the context of N previous tokens • Including 1 previous token in bigram • Including 2 previous tokens is trigram

  50. Types of taggers – N gram • N-gram taggers use Hidden Markov Models • P (word | tag) * P (tag | previous n tags) • ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”

More Related