1 / 75

3 Typical Work on Automatic Relation Extraction

3 Typical Work on Automatic Relation Extraction. 自动关系抽取的三种重要方法 武文娟 2009.06.04. Outline. DIPRE,1998 KnowItAll, 2005 Open IE, 2007. 1 DIPRE: Dual Iterative Pattern Expansion. Sergey Brin, Extracting Patterns and Relations from the World Wide Web,

Download Presentation

3 Typical Work on Automatic Relation Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3 Typical Work on Automatic Relation Extraction 自动关系抽取的三种重要方法 武文娟 2009.06.04

  2. Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007

  3. 1 DIPRE: Dual Iterative Pattern Expansion Sergey Brin, Extracting Patterns and Relations from the World Wide Web, In : Proceedings of the International Workshop on the Web and Databases, 1998.

  4. 1 DIPRE: Dual Iterative Pattern Expansion • 首次利用迭代方法发现数据实体间的模式和关系,并成功的发现了作者/作品数据对。 • Input: 5本书的样本集(author, title) • Output: 自动扩展到了15,000本书 • 有些书是最大的网上书店亚马逊也没有的。

  5. 1.1 Idea extract pattern tuple discover 模板和关系之间存在对偶性

  6. 1.2 Algorithm • 七元组 • (author, title, order, url, prefix, middle, suffix) R (Tuple set) Occurrence FindOccurrence (R, D) Generate & filter Search Patterns

  7. Pattern generation • Occurrence • 七元组 • (author, title, order, url, prefix, middle, suffix) O1, O2, …, Ok Group by Order, middle For each Oi GenOnePattern(Oi) • Pattern p • 五元组 • (order, urlprefix, prefix, middle, suffix) URL: 匹配urlprefix* 内容: *prefix, author, middle, title, suffix* 输出 p YES p is specific? NO

  8. 1.3 Experiments • Corpus • A repository of 24 million web pages • 147G

  9. 1.3 Experiments: Initial sample

  10. 1.3 Experiments: 3 Patterns in First Iteration

  11. 1.3 Experiments: 4047 new pairs in First Iteration

  12. 1.3 Experiments: review

  13. 1.4 Conclusion • DIPRE: • 半监督关系学习方面的最初的工作 • 利用了关系和模板之间的对偶性,在Web这样的大规模语料库上,通过少量的sample作为种子,以迭代的方法,不断地抽取新的模板和实例。

  14. Outline • DIPRE,1998 • KnowItAll, 2005 • Open IE, 2007

  15. KNOWITALL Oren Etzioni etc. University of Washington Unsupervised Named-Entity Extraction from the Web: An Experimental Study AAAI 2005

  16. Introduction • 以前的工作:HMM, CRF • 小规模的语料库 • 需要提供种子数据 • KNOWITALL: an unsupervised, domain-independent system that extracts information from the Web • 关键挑战: • 保证准确率:a novel generate-and-test architecture • 提高召回率: • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)

  17. 1 Flowchart of the main components in KnowItAll • For every predicate: • creates extraction rules and discriminators • Train discriminators • “cities such as ” NPList

  18. Information Focus • 唯一领域相关的输入是一组predicate,用来指定所关注的领域。

  19. 通用的抽取模板

  20. Extraction Rules • 通用的抽取模板,结合predicate的标签,生成相应领域的抽取规则 • Class1 = ‘city’,规则即为 • “cities such as ” NPList • “towns such as ” NPList • Keywords: “cities such as ” , “towns such as ” (提交给搜索引擎)

  21. Discriminator • 用来确认某个抽取到的信息是否validate • 利用PMI

  22. Training discriminator: Bootstrapping

  23. The result of training • A set of discriminator, eg. • Discriminator: <I> is a city Learned threshold T: 0.000016 Conditional probabilities P(PMI > T | class) = 0.83 P(PMI > T | ¬class) = 0.08

  24. An Example • Predicate: city • Bootstrapping: • Generate extraction rules and discriminators • Train all discriminators, and selected the 5 best discriminators

  25. An Example:Trained discriminator

  26. An ExampleMain cycle: extract • Suppose that • the query is “and other cities” • from a rule with extraction pattern: NP “and other cities”. • 2 instances: Fes, East Coast

  27. An ExampleMain cycle: Assess • To compute the probability of City (Fes) • sends six queries • “Fes” has 446,000 hits; • “Fes is a city” has 14 hits • “cities Fes” (201 hits) • “cities such as Fes” (10 hits); • “cities including Fes” (4 hits) • 0 hits for “Fes and other towns”. • City (East Coast) • below threshold for all discriminators Sum up all the probability, The final probability is 0.99815 The final probability is 0.00027.

  28. 1.2 Experiment noise tolerance

  29. 1.2 Experiment find negative training seeds for assessor

  30. 1.2 Experiment: search cutoff metric • Signal to Noise ratio (STN): 正例与负例的比值 • Query Yield Ratio (QYR):n个网页抽取到的新信息量

  31. 2 如何提高召回率 • Pattern Learning (PL): • 抽取规则 • 评价实例准确性的确认模板 • Subclass Extraction (SE): • 自动识别子概念,便于抽取 • 例如,为了抽取科学家的实例,可以先找到科学家的子概念(物理学家、地理学家等),再抽取这些子概念的实例。 • List Extraction (LE): • learns a “wrapper” for each list, and uses the wrapper to extract list elements. • 使用通用抽取模板抽取到的信息作为这三种方法的初始种子,因此它们都不需要人事先给出训练数据。

  32. 2.1 Pattern Learning (PL): • 通用模板对特定领域来说通常并不是最有效的模板 • “the film <film> starring” • “headquartered in <city>”

  33. Pattern Learning algorithm • Estimating recall & precision efficiently • take the positive examples of one class to be negative examples for all other classes. Filter: Recall & precision Context of i Best patterns I: A set of seed instances Search

  34. 3 of the most productive rules

  35. 如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE)

  36. 2.2 Subclass ExtractionBasic subclass extraction (SEbase) • Extracting candidate subclasses • 通用抽取规则在抽取实例的同时也抽取子类.如何区分? • 实例:专有名词,大写 Scientists such as Einstein, Newton,… • 子类: 普通名词 Scientists such as physical scientist, biologist, … • Assessing Candidate Subclasses, a combination method • 子类名是否包含了父类名 • “microbiologist” is a subclass of “biologist” • 在WordNet中是否有父子关系 • SEbase Assessor: • bootstrap training method

  37. Rules for subclass extraction

  38. Improving Subclass Extraction Recall • 对抽取到的候选子类,用table2中后两条规则来抽取它们兄弟,得到更多的候选子类。 • 两种子类 • Context-independent subclass • Person - Priest • Context-dependent subclass • Person - Pharmacist • 两种assessing method • SEself: 用自训练的方式训练一个分类器 • SEiter:迭代地为每个抽取规则计算置信度

  39. Experimental result: Context-independent subclass

  40. Experimental result: Context-dependent subclass

  41. 如何提高召回率 • Pattern Learning (PL) • Subclass Extraction (SE) • List Extraction (LE) • 不同于前两种方法处理无结构文本 • LE利用网页中的结构来抽取信息

  42. 2.3 List Extractor • 网页中很多列表都是从数据库中生成的,因此通常具有明显的结构特征 • 基本方法 • 定位网页中的list • 学习一个wrapper,自动抽取所有list中的item

  43. Learning a Wrapper

  44. An Example W3 is the BEST • 对应的HTML块尽量小 • 匹配尽量多keywords

  45. Experiments of LE

  46. Discussion • 使用LE可以用较少的查询,抽取到大量的信息 • 虽然准确率不够高,但是 • 帮助缩小了候选信息的数量,使得Assessor工作量大大减少. • 可以发现在标准IE方法没有抽取到的信息 • 在HTML文档中,长选择列表中的一些罕见城市

  47. 2.4 PL,SE和LE的比较:recall film city scientist 对于通用概念的实例抽取,SE更有效

  48. extraction rate = num (unique extraction) / num (query) PL,SE和LE的比较: extraction rate

  49. the Trade-off between Recall and Precision

  50. 3 Conclusion • KnowItAll: Unsupervised information extraction from the Web • Input a set of predicate names • no hand-labeled training examples of any kind • 准确率 • utilizes a novel generate-and-test architecture • Extractor, Assessor • 召回率 • Pattern learning, Subclass Extraction, List Extraction

More Related