1 / 12

A Model for Learning Words by Crawling the Web

A Model for Learning Words by Crawling the Web. Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming. Overview. Goal: create an autonomous language learning system Use Web crawler technology Extract meaning from paragraphs and sentences to create language understanding Major issues

lavender
Download Presentation

A Model for Learning Words by Crawling the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Model for Learning Words by Crawling the Web Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming CAINE November 2009

  2. Overview • Goal: create an autonomous language learning system • Use Web crawler technology • Extract meaning from paragraphs and sentences to create language understanding • Major issues • Irregularity of natural language constructions • Understanding paragraphs and sentences • Determining meaning of new words CAINE November 2009

  3. Handling irregularities • Most major parts of a language (English, anyway) can be generalized • Exceptions require preprocessing to fit them into generalizable categories • Example: Inflectional endings on verbs bat is bats am batting are batted was CAINE November 2009

  4. Handling irregularities • Idiomatic phrases require understanding of the entire phrase in a colloquial context “Go jump in the lake” vs. “Go cook yourself an egg” • Pronoun resolution “Three boys each bought a pizza. They ate them in the park.” CAINE November 2009

  5. Extracting understanding • Paragraph understanding • Matching paragraph structure to common forms • Finding the nucleus of the paragraph’s meaning • Sentence understanding • Matching sentence structure to common forms • Determining the meaning of the words in the sentence CAINE November 2009

  6. Our approach • Exception-first processing • Preprocessing to handle irregularities • Linguistic classifications based on tree structure CAINE November 2009

  7. Our approach • Parser (incorporated into Web crawler) to determine structure • Some structures are disregarded when keywords are already classified • Word classification • Type, gender, number • Unknown words are analyzed according to rules using placement in sentence and surrounding classified words CAINE November 2009

  8. Our approach • Keyword recognition • Use “word chains” (sequences of words) with application of linguistic knowledge • Word-level understanding • Reduce words to root form to process them as keywords • Reduce irregular forms using an exception database created at preprocessing CAINE November 2009

  9. System model • Exception database • Separates generalizable and exception verbs • Processes word endings • Scans exception database for exception • Processes “normal” words according to rules CAINE November 2009

  10. System model • Categorization generator • Separates generalizable and exception words • Processes word endings • Scans exception database for exceptions and processes these first • Processes “normal” words according to rules • Sentence parser with disregard capacity • Paragraph understanding rules CAINE November 2009

  11. System model • Web crawler searches for source material • Processes the material and enhances its own rules and exceptions • Eventually will learn enough to understand most material in a given language • Future work • Implement a pilot version of this system • Determine how to control for a “given” language CAINE November 2009

  12. Questions? CAINE November 2009

More Related