1 / 24

# LING 388: Language and Computers - PowerPoint PPT Presentation

LING 388: Language and Computers. Sandiway Fong 9/20 Lecture 8. Administrivia. Homework 3 d ue tonight at midnight. Today’s Topic. Regular Expressions (RE ) u sed for searching text (information extraction applications and text processing)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'LING 388: Language and Computers' - lamont

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### LING 388: Language and Computers

Sandiway Fong

9/20

Lecture 8

• Homework 3

• due tonight at midnight

• Regular Expressions (RE)

• used for searching text (information extraction applications and text processing)

• an (industry) standard notation for specifying a search pattern

Write a Prolog program to enumerate the integer line

i.e. a program that would print out all and only the

numbers on the integer line (given enough time…)

Where would you start?

• Program

nn(1).

nn(N) :- nn(M), N is M+1.

int(0).

int(N) :- nn(M), (N = M ; N is – M).

• Output:

?- int(X).

X = 0 ;

X = 1 ;

X = -1 ;

X = 2 ;

X = -2 ;

X = 3 ;

X = -3 ;

X = 4 ;

X = -4 ;

X = 5 ;

X = -5 ;

X = 6

Used predicate name int/1 since integer/1 is taken in SWI Prolog

• Regular Expressions (RE)

• used for searching text (information extraction applications and text processing)

• an (industry) standard notation for specifying a search pattern

Regular

Expressions

Regular

Grammars

Regular Expressions

• (formally) equivalent to

• finite state automata (FSA), and

• regular grammars

• used in

• string pattern matching

• typically for a single word form

• search text: unix (e)grep,Perl,Microsoft Word

• caution:

• differences in notation and implementation

• Regular Expressions

shorthand for describing sets of strings

• String

• sequence of zero or more characters

• (typically, unbroken by spaces)

• Examples

• aaa

• john

• mary45

• mary 45

• NT\$

•  (empty string)

• Regular Expressions

• shorthand

• stringn

• exactly n occurrences of string

• n = 0,1,2,3,...

• examples

• a4 b3 = aaaabbb

• (uv)2 = uvuv

• ((ab)2(ba)2)2 = ababbabaababbaba

• Note:

• parentheses are used to group sequences of characters (strings)

shorthand for describing sets of strings

string+

set of one or more occurrences of string

i.e. the set {string1, string2, string3, ... }

Note: set is infinite

examples

a+

= {a, aa, aaa, aaaa, aaaaa, …}

(abc)+

= {abc, abcabc, abcabcabc, …}

Regular Expressions

shorthand for describing sets of strings

string*

set of zero or more occurrences of string

i.e. the set {string0, string1, string2, string3, ... }

string0=  (the empty string)

examples

a* = {, a, aa, aaa, aaaa, …}

(abc)* = {, abc, abcabc, …}

Note:

a a* = a+

a {, a, aa, aaa, aaaa, …}

= {a , aa, aaa, aaaa, aaaaa, …}

Regular Expressions

Language = a set of strings

matches a range of characters

.(period)

matches any single character

examples

.+ed

= set of all strings of length 3 or greater containing ed

and having at least one character preceding it

worked bed pre-education

ed education

.*fix

= set of all strings of length 3 or greater containing fix

prefix infix infixed suffix fix

Regular Expressions

• Wildcard Characters

matches a range of characters

[characters] (list of matching characters)

matches any single character in the list

• examples

• [s,z]ation

• organization

• organisation

• [a-z]

• any character in the

• range lowercase a to z

• Note: not uppercase

• [0-9]

• any digit

ASCII chart: computers only understand numbers

American Standard Code for Information Interchange.

• One of the most popular programs for searching files and returning lines that match a regular expression pattern is calledgrep

• name comes from Unix ed command g/re/p

• “search globally for lines matching the regular expression, and print them”

• [Source: http://en.wikipedia.org/wiki/Grep]

• Most programming languages, e.g. C, C++, Java (initially) etc., don’t come with regular expression search standard…

• However (later) programming languages, e.g. Perl, have standardized on grep’s syntax and expanded on its functionality.

• (Java has java.util.regex. Python has a re module.)

excerpts from thegrepmanpage

The caret ^ and the dollar sign \$ are metacharacters that respectively match the empty string at the beginning and end of a line.

The symbol \b matches the empty string at the edge of a word

The symbols \< and \> respectively match the empty string at the beginning and end of a word.

terminology

word

unbroken sequence of digits, underscores and letters

Regular Expressions: grep

• Excerpts from the manpage

• A regular expression may be followed by one of several repetition operators:

• ? The preceding item is optional and matched at most once.

• * The preceding item will be matched zero or more times.

• + The preceding item will be matched one or more times.

• {n} The preceding item is matched exactly n times

• {n,} The preceding item is matched n or more times.

• {n,m} The preceding item is matched at least n times, but not more than m times.

Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.

disjunction

Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

Regular Expressions: GNU grep

Excerpts from the manpage

gupp(y|ies)

examples

guppy

guppies

Regular Expression

beds?

examples

bed

beds

Regular Expressions: Examples

Regular Expressions: Examples

• Example

• \b99 matches

• 99 in “there are 99 bottles …”

• but not in

• 99 in “there are 299 bottles …”

• Note:

• \$99 contains two words, so \b99 will match 99 here

• word

• unbroken sequence of digits, underscores and letters

Regular Expressions: Examples

• Example (sheeptalk)

• baa!

• baaa!

• baaaa!

• regular expression

• baaa*!

• baa+!

Regular Expressions: Microsoft Word

• terminology:

• wildcard search

Regular Expressions: Microsoft Word

From American National Corpus (ANC), Slate Magazine 8/12/1999

By James Surowiecki

That position may be a bit overstated, particularly since Greenspan hasshown an unusual ability to let his thinking on inflation, productivity, andthe economy's possible growth rate evolve in response to changing data. But theessential point, that the soundness of this economy does not depend onGreenspan's presence at the head of the Fed, is right. That might not be thecase if Greenspan's successor were either an inflation dove like WilliamGreider or a perma-bear like Jim Grant. But whoever would succeed Greenspanwould be nothing of the sort. He or she would be, in a word, Greenspanian,still concerned about the possibility of an overheating economy but alsoconvinced that important technological changes have allowed this economy togrow faster than in the past without sparking inflation. If anything, in fact, the bond market should have rallied on news thatGreenspan might be stepping down, since he has long since stopped beingparanoid enough for bondholders, who seem perpetually convinced that the UnitedStates is about to become Brazil. There are certainly Fed governors out therewho would be far more likely to raise interest rates aggressively at the firsthint of price pressures than Greenspan. The momentary sell-off, though, was not driven by any rational considerationof what Greenspan's departure might mean. Instead, everyone assumes thatGreenspan's resignation will knock down the market, so Greenspan'sresignation--or rumors of it--knocks down the market. But this is not thesummer of 1998 or the fall of 1997. We don't need Greenspan to reassure us thatthe world isn't going to fall apart anymore. When he leaves, the market willhiccup. But it would be surprising if it did more than that.

• Will There Be Life After Greenspan? If you blinked you missed it, but for a short while yesterday morning thestock and bond markets dived after a rumor that Alan Greenspan was resigninghit the Street. The story was quickly ... well, it wasn't exactly refuted,since Greenspan didn't say actually say "I'm not resigning," but it wasrejected as unlikely, and both markets rebounded nicely. Fleeting as it was, the momentary episode of selling panic was interestingfor a couple of reasons. In the first place, the rumor had all the makings of astory that was being floated by someone who had taken a large short position(in other words, who was wagering that the market was going down) and wastrying to knock the market down after it opened strongly. There's somethingweirdly old-fashioned about the idea that a big Wall Street insider could say"Pssst! Hey, buddy, I hear Al's on his way out!" and send stocks tumbling. Itfits our ideas of the 1920s, when the market was incredibly manipulable, oreven of the 1980s more than it does the late 1990s. But the truth is that in the short run, markets can occasionally be pushed,especially when so many decisions to buy or sell are keyed off what everyoneelse in the market is doing. Chain reactions are not much harder to start (infact, given how quickly price moves get noticed, they may be easier) than theywere 70 years ago. All that notwithstanding, the interesting thing about the Greenspanresignation rumor was that it raised an obvious question: Would it reallymatter? As Jacob Weisberg just pointed out in " Ballot Box," Steve Forbes is apparently the only American who doesn't thinkGreenspan has done a terrific job as Fed chairman. And most of us would behappy to have Greenspan stay in office even after his current term expires inthe middle of next year. But it's interesting to note that in the past coupleof months there have been more than a few voices--including those of economistsGreg Mankiw and Robert Barr--suggesting that Greenspan has been more thebeneficiary of good economic fundamentals than the creator of them.

Data file avilable on class homepage as

Article247_300.txt

• Let’s create a regular expression in Microsoft Word to look for decades in the text (and highlight them)

• Example: … the late 1990s.