1 / 20

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 9 Sandiway Fong. Administrivia. Homework 2 graded Today's topics Homework 3 review Named Entity Recognition (NER). Homework 3 Review. extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. Examples (actual): $23 million

Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong

  2. Administrivia • Homework 2 graded • Today's topics • Homework 3 review • Named Entity Recognition (NER)

  3. Homework 3 Review • extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • Compute: • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount • Appendix: • list all the dollar amounts

  4. Homework 3 Review start with code template from last time…

  5. Homework 3 Review • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • regex (part by part): numeric • \$ • \$[\d\.,]+ • \$[\d\.,]*\d word • (tr|[mb])illions? numeric (word) • \$[\d\.,]*\d\s*((tr|[mb])illions?|)

  6. Homework 3 Review • regex so far: • \$[\d\.,]*\d\s*((tr|[mb])illions?|) • Decide on which parts we want to extract and store them: • while ($line =~ /\$([\d\.,]*\d)\s*((tr|[mb])illions?|)/g) { • $numeric = $1; • $numeric =~ s/,//g; # remove commas • $word = $2; • Compute value: • $word = million $numeric * 1000000 • $word = billion $numeric * 1000000000 • $word = trillion $numeric * 1000000000000 word numeric

  7. Homework 3 Review • What about examples like? • 16.125 Canadian dollars • 100 million Canadian dollars • regex: • numeric: \d[\d\.,]* • word: (tr|[mb])illions? • dollars: ([Cc]anadian|)\s+dollars? • code: while ($line =~ /(\d[\d\.,]*)\s*((tr|[mb])illions?|)\s*([Cc]anadian|)\s+dollars?/g) { $numeric = $1; $numeric =~ s/,//g; # remove commas $word = $2; … }

  8. Homework 3 Review Run: • perldollar.perl WSJ9_001e.txt • Number: 342 • Smallest: 1.55 • Largest: 3100000000000 • Median(#170,#171): 85944.5

  9. Named Entity Recognition (NER)

  10. Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]

  11. Example WSJ9_002.txt

  12. Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/

  13. Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution

  14. JM Chapter 22

  15. JM Chapter 22

  16. JM Chapter 22 Ambiguity: sometimes systematic, sometimes not

  17. JM Chapter 22 • Word by word labeling (IOB “inside outside beginning”)

  18. JM Chapter 22 POS information Shape Syntactic chunking

  19. JM Chapter 22

  20. JM Chapter 22 • What features to use in making a decision (used also for Machine Learning)

More Related