1 / 18

CC 437: Advanced Natural Language Engineering

CC 437: Advanced Natural Language Engineering. Week 6, Class Assignment 2. Goal of this class. We’ll go in more detail over the assignment SW that may be used. The system you have to build. Input: A string of words (possibly a complete sentence)

alyn
Download Presentation

CC 437: Advanced Natural Language Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CC 437: Advanced Natural Language Engineering Week 6, Class Assignment 2 ANLE

  2. Goal of this class • We’ll go in more detail over the assignment • SW that may be used. ANLE

  3. The system you have to build • Input: A string of words (possibly a complete sentence) LIST THE ESTATE AGENTS IN STRATFORD, LONDON. I AM LOOKING FOR A CAR MECHANIC IN WIVENHOE • Minimum Output: a query for a Web search engine (“ESTATE AGENT” OR PROPERTY OR “REAL ESTATE”) AND STRATFORD AND LONDON • Possible extension (10%): Actually access search engine E.g., GOOGLE: http://www.google.com/search?q=stratford+london+%22estate+agent%22+OR+%22real+estate%22+OR+property ANLE

  4. Reminder: the basic pipeline in IE systems LEXICAL PROCESSING SYNTACTIC PROCESSING PREPROCESSING DISCOURSE PROCESSING SEMANTIC PROCESSING ANLE

  5. Processing steps TERM IDENTIFICATION STOP WORDS POS TAGGING List the estate agents in Stratford, London. LEXICAL PROCESSING SYNTACTIC PROCESSING PREPROCESSING SYNONYMS TOKENIZATION SEMANTIC PROCESSING WEB ACCESS ANLE

  6. Processing Steps, II • Preprocessing: • Possibly: eliminate stop words LIST THE ESTATE AGENTS IN STRATFORD LONDON • Possibly: XML markup ANLE

  7. Preprocessing, I: tokenizing List the estate agents in Stratford, London PARAGRAPH MARKUP; TOKENIZER <W C=‘w’>List</W> <W C=‘w’>the</W> <W C=‘w’>estate</W> <W C=‘w’>agents</W> <W C=‘w’>in</W> <W C=‘w’>Stratford</W> <W C=‘w’>,</W> <W C=‘w’>London</W> ANLE

  8. Processing Steps, II • LEXICAL PROCESSING: • POS TAGGING THE -> THE/DT; ESTATE -> ESTATE/NN • STEMMING / LEMMATIZATION AGENTS -> AGENT (or even: AGENT + N +PL) ANLE

  9. Lexical Processing, I: POS tagging <W C=‘VB'>List</W> <W C=‘DT'>the</W> <W C=‘NN'>estate</W> <W C=‘NNS'>agents</W> <W C=‘IN'>in</W> <W C=‘NNP'>Stratford</W> <W C='CM'>,</W> <W C=‘NNP'>London</W> ANLE

  10. Lexical Processing, II:lemmatizing / stemming <W C=‘VB'>List</W> <W C=‘DT'>the</W> <W C=‘NN'>estate</W> <W C=‘NNS'>agent</W> <W C=‘IN'>in</W> <W C=‘NNP'>Stratford</W> <W C='CM'>,</W> <W C=‘NNP'>London</W> ANLE

  11. Processing Steps, II • SYNTACTIC PROCESSING: • Identify terms: “ESTATE AGENT” • Remove stopwords (e.g., words tagged as DT, IN, VB, … ) ANLE

  12. Practical (partial) parsing:identifying search terms, filtering <SEARCHTERM> <W C=‘NN'>estate</W> <W C=‘NN'>agent</W> </SEARCHTERM> <SEARCHTERM> <W C=‘NNP'>Stratford</W> </SEARCHTERM> <BOOL> <W C='CM'>,</W> </BOOL> <SEARCHTERM> <W C=‘NNP'>London</W> </SEARCHTERM> ANLE

  13. Processing Steps, II • SEMANTIC PROCESSING: “ESTATE AGENT” OR PROPERTY • QUERY FORMATION: • Abstract query • Concrete query ANLE

  14. Semantic processing: finding synonyms, (or better keywords); interpreting stop words. <SEARCHTERM> <W C=‘NN'>estate</W> <W C=‘NN'>agent</W> </SEARCHTERM> <BOOL TYPE=‘OR’></BOOL> <SEARCHTERM> <W C=‘NN'>real</W> <W C=‘NN'>estate</W> </SEARCHTERM> <BOOL TYPE=‘AND’></BOOL> <SEARCHTERM> <W C=‘NNP'>Stratford</W> </SEARCHTERM> <BOOL TYPE=‘AND’> <W C='CM'>,</W> </BOOL> <SEARCHTERM> <W C=‘NNP'>London</W> </SEARCHTERM> ANLE

  15. Available tools: • LINUX: • Overall system control: Shell scripts, Perl, Java • Tokenizing: Perl + Regular Expressions • POS: Brill tagger • Lexical Expansion: WordNet (Java interface, command line) • WINDOWS: • Overall system control: Java, Batch files, Perl • Tokenizing, POS tagging: Xerox (Tokenizer, POS + Lemmatizer) • WordNet: Use Java interface ANLE

  16. Marking Scheme ANLE

  17. Optionals • Write a simple Web page interface to your search engine • Write your own lexical resource (see following classes) ANLE

  18. Deadline • Friday, December 12th, 12:00 ANLE

More Related