Introduction to natural language processing and text mining and the basic building blocks l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Introduction to Natural Language Processing and Text Mining and The basic building blocks PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to Natural Language Processing and Text Mining and The basic building blocks. Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur. Ambiguity. At last, a computer that understands you like your mother.

Download Presentation

Introduction to Natural Language Processing and Text Mining and The basic building blocks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to natural language processing and text mining and the basic building blocks l.jpg

Introduction to Natural Language Processing and Text MiningandThe basic building blocks

Sudeshna Sarkar

Professor

Computer Science & Engineering Department

Indian Institute of Technology Kharagpur


Ambiguity l.jpg

Ambiguity

At last, a computer that understands you like your mother.

-- 1985 McDonnell-Douglas Ad

Different interpretations:

  • The computer understands you as well as your mother understands you.

  • The computer understands that you like your mother.

  • The computer understands you as well as it understands your mother.

    Speech : ….. a computer that understands your lie cured mother …


Why is nlp difficult l.jpg

Why is NLP difficult?

  • Natural Language is highly ambiguous.

    • Syntactic ambiguity

      • The president spoke to the nation about the problem of drug use in the schools from one coast to the other.

      • has 720 parses.

      • Ex:

        • “to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb  6 places

        • “from one coast” has 5 places to attach


Why is nlp difficult4 l.jpg

Why is NLP difficult?

  • Word category ambiguity

    • book -->verb? or noun?

  • Word sense ambiguity

    • bank --> financial institution? building? or river side?

  • Words can mean more than their sum of parts

    • make up a story

  • Fictitious worlds

    • People on mars can fly.

  • Defining scope

    • People like ice-cream.

    • Does this mean that all (or some?) people like ice cream?

  • Language is changing and evolving

    • I’ll email you my answer.

    • This new S.U.V. has a compartment for your mobile phone.

    • Googling, …


Why is nlp hard l.jpg

Why is NLP hard?

  • Natural language is

    • Highly ambiguous at all levels

    • Complex

    • Probabilistic, fuzzy

    • Involves reasoning about the world

    • Deals with complex social interactions

  • Why Text is tough?

    • Abstract concepts are difficult to represent

    • Countless combinations of subtle, abstract relationships among concepts

    • Many ways to represent similar concepts

    • Concepts are difficult to visualize

    • High dimensionality - Tens or hundreds of thousands of features


How is nlp doable l.jpg

How is NLP doable?

  • But in some senses NLP is quite easy

    • Rough text features good enough for many useful tasks

  • Why Text is easy?

    • Highly redundant data

    • Just about any simple algorithm can get “good” results for simple tasks:

      • Pull out “important” phrases

      • Find “meaningfully” related words

      • Create some sort of summary from documents


Levels of text processing l.jpg

Levels of Text Processing

  • Word Level

    • Words Properties

    • Stop-Words

    • Stemming

    • Frequent N-Grams

    • Thesaurus (WordNet)

  • Sentence Level

  • Document Level

  • Document-Collection Level

  • Linked-Document-Collection Level

  • Application Level


Models and algorithms l.jpg

Models and Algorithms

  • Models: formalisms used to capture the various kinds of linguistic structure.

    • State machines (fsa, transducers, markov models)

    • Formal rule systems (context-free grammars, feature systems)

    • Logic (predicate calculus, inference)

    • Probabilistic versions of all of these + others (gaussian mixture models, probabilistic relational models, etc etc)

  • Algorithms used to manipulate representations to create structure.

    • Search (A*, dynamic programming)

    • EM

    • Supervised learning, etc etc


Language processing pipeline l.jpg

Phonetic/Phonological Analysis

OCR/Tokenization

Morphological and lexical analysis

Syntactic analysis

Semantic Interpretation

Discourse Processing

Language Processing Pipeline

speech

text

POS tagging

WSD

Shallow parsing

Deep Parsing

Anaphora resolution

Integration


The big picture l.jpg

The Big Picture

Source Language Speech Signal

Target Language Speech Signal

Speech recognition

Speech Synthesis

Target text Generation

Source text Analysis


Some building blocks l.jpg

Some Building Blocks

Source Language Analysis

Target Language Generation

Text Normalization

Text Rendering

Morphological Analysis

Morphological Synthesis

POS Tagging

Phrase Generation

Parsing

Role Ordering

Semantic Analysis

Lexical Choice

Discourse Analysis

Discourse Planning


Two approaches l.jpg

Two Approaches

Symbolic

Encode all the necessary knowledge

Good when annotated data is not available

Allows steady development

The development can be monitored

Fits well with logic and reasoning in AI

Statistical

Learn language from its usage

Supervised learning require large collections manually annotated with meta-tags

Development is almost blind

Few ways to check the correctness

Debugging is very frustrating


Resolve ambiguities l.jpg

Resolve Ambiguities

  • We will introduce models and algorithms to resolve ambiguities at different levels.

  • part-of-speech tagging -- Deciding whether duck is verb or noun.

  • word-sense disambiguation -- Deciding whether make is create or cook.

  • lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation.

  • syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing.


Languages l.jpg

Languages

  • Languages: 39,000 languages and dialects (22,000 dialects in India alone)

  • Top languages:

    • Chinese/Mandarin (885M),

    • Spanish (332M),

    • English (322M),

    • Bengali (189M),

    • Hindi (182M),

    • Portuguese (170M), Russian (170M), Japanese (125M)

  • Source: www.sil.org/ethnologue, www.nytimes.com

  • Internet: English (128M), Japanese (19.7M), German (14M), Spanish (9.4M), French (9.3M), Chinese (7.0M)

  • Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%)

  • Source: www.computereconomics.com


Slide15 l.jpg

  • Tokenization

  • Segmentation

  • Stemming/ lemmatization


Morphology l.jpg

Morphology

  • Morphology is the field of linguistics that studies the internal structure of words

  • How words are built up from smaller meaningful units called morphemes (morph = shape, logos = word)

  • We can usefully divide morphemes into two classes

    • Stems: The core meaning bearing units

    • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

      • Prefix: un-, anti-, etc (a- ati- pra- etc)

      • Suffix: -ity, -ation, etc ( -taa, -ke, -ka etc)

      • Infix: are inserted inside the stem

        • Tagalog: um + hingi humingi

      • Circumfixes – precede and follow the stem

  • Turkish can have words with a lot of suffixes (agglutinative language) Many indian languages also have agglutinative suffixes


Examples english l.jpg

Examples (English)

  • “unladylike”

    • 3 morphemes, 4 syllables

      un- ‘not’

      lady ‘(well behaved) female adult human’

      -like ‘having the characteristics of’

    • Can’t break any of these down further without distorting the meaning of the units

  • “dogs”

    • 2 morphemes, 1 syllable

      -s, a plural marker on nouns


Examples bengali l.jpg

Examples (Bengali)

  • “chhelederTaakei”

    • 5 morphemes

      chhele ‘boy’

      -der ‘plural genitive’

      -Taa ‘classifier’

      -ke ‘dative’

      -i ‘emphasizer’

      Can’t break any of these down further without distorting the meaning of the units

  • “atipraakrritake”

    ati-

    praakrrita

    -ke


Inflectional derivational morphology l.jpg

Inflectional & Derivational Morphology

  • We can also divide morphology up into two broad classes

    • Inflectional

    • Derivational

  • Inflectional morphology is grammatical

    • number, tense, case, gender

  • Derivational morphology concerns word building

    • part-of-speech derivation

    • words with related meaning


Inflectional morphology l.jpg

Inflectional Morphology

  • Inflection:

    • Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.

      • Doesn’t change the word class

      • Usually produces a predictable, nonidiosyncratic change of meaning. Eg, may add tense, number, person, mood, aspect

      • Serves a grammatical/semantic purpose different from the original

  • Highly systematic, though there may be irregularities and exceptions

    • Simplifies lexicon, only exceptions need to be listed

    • Unknown words may be guessable

      After a combination with an inflectional morpheme,

      the meaning and class of the actual stem usually do not change.

    • eat / eats pencil / pencils

    • helaa / khele / khelchhila bai / baiTAke / baiyera


Derivational morphology l.jpg

Derivational Morphology

  • Derivation:

    • The formation of a new word or inflectable stem from another word or stem.

  • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change.

    • compute / computer do / undo friend / friendly

    • Uygar / uygarlaşkapı /kapıcı

    • udaara (J) / udaarataa (N)

    • bhadra / abhadra

    • baayu / baayabiiya

  • Irregular changes may happen with derivational affixes.

  • Fairly systematic, and predictable up to a point

    • Simplifies description of lexicon: regularly derived words need not be listed

    • Unknown words may be guessable

  • But …

    • Apparent derivations have specialised meaning

    • Some derivations missing


Morphological processes l.jpg

Morphological processes

  • Affixes: prefix, suffix, infix, circumfix

  • Vowel change (umlaut, ablaut)

  • Gemination, (partial) reduplication

  • Root and pattern

  • Stress (or tone) change

  • Sandhi


Concatenative morphology l.jpg

Concatenative Morphology

  • Morpheme+Morpheme+Morpheme+…

  • Stems: also called lemma, base form, root, lexeme

    • hope+ing  hopinghop  hopping

  • Affixes

    • Prefixes: Antidisestablishmentarianism

    • Suffixes: Antidisestablishmentarianism

    • Infixes: hingi (borrow) – humingi (borrower) in Tagalog

    • Circumfixes: sagen (say) – gesagt (said) in German

  • Agglutinative Languages

    • uygarlaştıramadıklarımızdanmışsınızcasına

    • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına

    • Behaving as if you are among those whom we could not cause to become civilized


Morphophonemics l.jpg

Morphophonemics

  • Morphemes and allomorphs

    • eg {plur}: +(e)s, vowel change, yies, fves, um a,, ...

  • Morphophonemic variation

    • Affixes and stems may have variants which are conditioned by context

      • eg +ing in lifting, swimming, boxing, raining, hoping, hopping

    • Rules may be generalisable across morphemes

      • eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses

      • Applies to both {plur} (nouns) and {3rd sing pres} (verbs)


Templatic morphology l.jpg

Templatic Morphology

  • Roots and Patterns

    • Example: Hebrew verbs

    • Root:

      • Consists of 3 consonants CCC

      • Carries basic meaning

    • Template:

      • Gives the ordering of consonants and vowels

      • Specifies semantic information about the verb

        • Active, passive, middle voice

    • Example:

      • lmd (to learn or study)

        • CaCaC -> lamad (he studied)

        • CiCeC -> limed (he taught)

        • CuCaC -> lumad (he was taught)


Syntax and morphology l.jpg

Syntax and Morphology

  • Phrase-level agreement

    • Subject-Verb

      • John studies hard (STUDY+3SG)

    • Noun-Adjective

      • Achchhi Ladki

  • In some languages like Sanskrit, morphology contains a lot of information about structure


Morphology in nlp l.jpg

Morphology in NLP

  • Analysis vs synthesis

    • what does dogs mean? vs what is the plural of dog?

  • Analysis

    • Need to identify lexeme

      • Tokenization

      • To access lexical information

    • Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number)

    • Morphology can be ambiguous

      • May need other process to disambiguate (eg German –en)

  • Synthesis

    • Need to generate appropriate inflections from underlying representation


  • Login