Cse467 567 computational linguistics
Download
1 / 21

CSE467/567 Computational Linguistics - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

CSE467/567 Computational Linguistics. Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo. Levels of processing. phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSE467/567 Computational Linguistics' - tahlia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cse467 567 computational linguistics

CSE467/567Computational Linguistics

Carl Alphonce

cse-467-alphonce@cse.buffalo.edu

Computer Science & Engineering

University at Buffalo


Levels of processing
Levels of processing

  • phonetics/phonology – sounds

  • morphology – word structure

  • syntax – sentence structure

  • semantics – meaning

  • pragmatics – goals of language use

  • discourse – utterances in context

CSE 467/567



Words have internal structure
Words have internal structure

  • readable = read + able

  • readability = read + able + ity

  • the structure of words can be described using a regular grammar

CSE 467/567


Chomsky hierarchy
Chomsky hierarchy

CSE 467/567


Problem
Problem

  • I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”.

CSE 467/567


Regular expressions in perl
Regular expressions (in Perl)

“a regular expression is an algebraic notation for characterizing a set of strings” [p. 22]

Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.

CSE 467/567


Sequences of characters
Sequences of characters

Matching a sequence of characters

/…/

Examples:

/a/ matches the character ‘a’

/fred/ matches the string ‘fred’

Note:

/fred/ does not match the string ‘Fred’!

In other words, patterns are case-sensitive.

CSE 467/567


Character disjunction character classes
Character disjunction(character classes)

Square brackets are used to indicate disjunction of characters.

Examples:

/[Ff]/ matches either ‘f’ or ‘F’

/[Ff]red/ matches either ‘fred’ or ‘Fred’

This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.

CSE 467/567


Ranges
Ranges

Sometimes it is useful to specify “any digit” or “any letter”.

“Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern.

An alternative is to use a special range notation: /[0-9]/

Any letter can be specified as /[A-Za-z]/

Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.

CSE 467/567


Complementing character classes
Complementing character classes

To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets.

Examples:

/[^a]/ matches anything except ‘a’

/[^0-9]/ matches anything except a digit

CSE 467/567


Matching 0 or 1 occurrence
Matching 0 or 1 occurrence

The ‘?’ matches zero or one occurrences of the preceding expression.

Examples:

/a?/ matches ‘a’ or ‘’ (nothing)

/cats?/ matches ‘cat’ or ‘cats’

Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.

CSE 467/567


The kleene star and plus
The Kleene star and plus

The Kleene star (*) matches zero or more occurrences of the preceding expression.

Examples:

/a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc.

/[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc.

+ matches one or more occurrences

+ is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/

CSE 467/567


Wildcard
Wildcard

The period (.) matches any single character except the newline (\n).

CSE 467/567


Anchors
Anchors

Anchors are used to restrict a match to a particular position within a string.

^ anchors to the start of a string

$ anchors to the end of a string

/[Ff]red/ matches both ‘Fred’ and ‘Fred is home’

/^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’

\b anchors to a word boundary

\B anchors to a non-boundary

CSE 467/567


Conjunction
Conjunction

Two regular expressions are conjoined by juxtaposition (placing the expressions side by side).

Examples:

/a/ matches ‘a’

/m/ matches ‘m’

/am/ matches ‘am’ but not ‘a’ or ‘m’ alone

CSE 467/567


Disjunction
Disjunction

We have already seen disjunction of characters using the square bracket notation

General disjunction is expressed using the vertical bar (|), also called the pipe symbol.

This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.

CSE 467/567


Grouping
Grouping

  • Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern.

  • Ex: /[Gg](ee)|(oo)se/

CSE 467/567


Replacement
Replacement

In addition to matching, we can do replacements when a match is found:

Example:

To replace the British spelling of color with the American spelling, we can write:

s/colour/color/

CSE 467/567


Registers saving matches
Registers – saving matches

  • To save a match from part of a pattern, to reuse it later on, Perl provides registers

  • Registers are named \#, where # is the number of the register

  • Ex.

    DE DO DO DO DE DA DA DA

    IS ALL I WANT TO SAY TO YOU

    /(D[AEO].)*/ will match the first line

    /(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically

    This pattern also matches strings like DA DE DE DE DA DO DO DO

    \s matches a whitespace character

CSE 467/567


For more information
For more information

  • PERL Regular Expression TUTorial

    • http://perldoc.perl.org/perlretut.html

  • PERL Regular Expression reference page

    • http://perldoc.perl.org/perlre.html

CSE 467/567