1 / 18

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science. 152020 Pereslavl-Zalessky Russia. INEX: Tools for Information Extraction. Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky

livia
Download Presentation

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia

  2. INEX: Tools for Information Extraction Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia +7 48535 98065 inex@epk.botik.ru

  3. Information extraction Objective: • extract meaningful information of a pre-specified type from (typically large amounts of) texts for further analytical purposes Output: • data structures of a pre-specified format (filled scenario templates)

  4. Examples • Sports report: <winner>, <loser>, <score>, <location>, <date>… • Database on rental accommodation opportunities: <location>,<renting price>, <bedrooms number>, <phone number>…

  5. Possible IE application scenarios: • inference of new information (knowledge acquisition) • query formulation and answering in human-computer systems • automatic generation of abstracts and summaries • visualization of document content, etc.

  6. The `Newsmaking’ task • <newsmaker> • <type of newsmaker> (person or organization) • <message> • <type of message> (original, cited, a reference to another newsmaker)

  7. IE system architecture

  8. Tokenisation & sentence segmentation • Tokenisation identification of words, punctuation marks, delimiters, special characters • Sentence segmentation recognizing sentence boundaries

  9. Morphological analysis • maps every word-form of the input text to (a) canonical form(s) • recognizes the word's morphological properties Results are typically ambiguous.

  10. Filtering • reduces the text to be subjected to further processing to potentially relevant portions

  11. Disambiguation • a side effect of other processes (e.g., microsyntactic analysis) • a stand-alone stage

  12. Microsyntactic analysis • identifies noun phrases (NP) • identifies some regularly formed constructions (numbers, dates, personal proper names)

  13. Macrosyntactic analysis • identifies clause boundaries • constructs clause hierarchy within a sentence

  14. Named entity recognizer • identifies proper names • assigns semantic features to certain items

  15. Information extraction rules • a domain knowledge representation formalism (scenario templates) • a set of patterns to identify template elements in a text (covering the many possible ways to talk about the target event elements)

  16. IE pattern includes: • a set of rules that define how to retrieve this pattern in a text • a set of constraints imposed on textual elements to fit into a particular slot of the target

  17. Coreference Resolver • recognizes different occurrences of the same entity in a text

  18. Merging partial results • merging partially filled templates to produce a final, maximally filled template

More Related