1 / 40

Annotation Free Information Extraction

Annotation Free Information Extraction. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002. Introduction. TEXT IE AutoSlog-TS Semi IE IEPAD.

reyna
Download Presentation

Annotation Free Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002

  2. Introduction • TEXT IE • AutoSlog-TS • Semi IE • IEPAD

  3. AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text Ellen Riloff University of Utah AAAI96

  4. AutoSlog-TS • AutoSlog-TS is an extension of AutoSlog • It operates exhaustively by generating an extraction pattern for every noun phrase in the training corpus. • It then evaluates the extraction patterns by processing the corpus a second time and generating relevance statistics for each pattern. • A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.

  5. AutoSlog-TS Concept

  6. Relevance Rate • Pr(relevant text | text contains patterni) = rel-freqi / total-freqi rel-freqi: the number of instances of patterni that were activated in relevant texts. total-freqi: the total number of instances of patterni that were activated in the training corpus. • The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.

  7. Rank function • Next, we use a rank function to rank the patterns in order of importance to the domain: relevance rate * log2(frequency) • So, a person only needs to review the most highly ranked patterns.

  8. Texts Extraction Patterns AutoSlog: 772 relevant 1237 450 AutoSlog-TS: 1500,50% relevant 32345 11225 210 Experimental Results Setup • We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain. • We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992). • Training

  9. Testing • To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts) • We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing. • Correct: If an item matched against the answer keys. • Mislabeled: If an item matched against the answer keys but was extracted as the wrong type of object. • Duplicate: If an item was referent to an item in the answer keys. • Spurious: If an item did not refer to any object in the answer keys. • Missing: Items in the answer keys that were not extracted

  10. Experimental Results • We scored three items: perpetrators, victims, and targets.

  11. Experimental Results • We calculated recall as correct / (correct + missing) • Compute precision as: (correct + duplicate) / (correct + duplicate + mislabeled + spurious)

  12. Behind the scenes • In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not. • AutoSlog-TS produced 158 patterns with a relevance rate ≧ 90% and frequency ≧ 5. Only 45 of these patterns were in the original AutoSlog dictionary. • The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.

  13. Future Directions • A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance. • The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.

  14. IEPAD: Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10

  15. Semi-structured Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records

  16. Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

  17. Pattern Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

  18. HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

  19. 1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

  20. Various Encoding Schemes

  21. 2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) 000110001010110011100 000110001010110011100

  22. The Constructed PAT Tree

  23. Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

  24. Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node  is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

  25. 3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

  26. Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5

  27. 4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern

  28. Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

  29. Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

  30. Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

  31. Pattern Viewer • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

  32. The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

  33. Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control

  34. # of Patterns Discovered Using BlockLevel Encoding • Average 117 maximal repeats in our test Web pages

  35. Translation • Average page length is 22.7KB

  36. Accuracy and Retrieval Rate

  37. Summary • IEPAD: Information Extraction based on Pattern Discovery • Rule generator • The extractor • Pattern viewer • Performance • 97% retrieval rate and 94% accuracy rate

  38. Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently

  39. References • TEXT IE • Riloff, E. (1996) Automatically Generating Extraction Patterns from Untagged Text, (AAAI-96) , 1996, pp. 1044-1049. • Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.

  40. References • Semi-structured IE • D.W. Embley, Y.S. Jiang, and W.-K. Ng, Record-Boundary Discovery in Web Documents, SIGMOD'99 Proceedings • C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong. • B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000

More Related