Research o n Chinese Automatic Summarization Zhang Jin spaces.msn/jadesor

Research on Chinese Automatic Summarization Zhang Jin http://spaces.msn.com/jadesor Institute of Computing Technology

Outline • Introduction • Basic Methods • Framework of Summarization System • Our Work • Challenges • DUC • Recent & Future Work

Introduction • Related technology • Brief history • Definition • Objective • Classification

NLP Related Technologies

A Brief History of Summarization DUC, hold by NIST, from 2001 http://duc.nist.gov By Mark T. Maybury and Inderjeet Mani

Definition • A short passage with the purpose of conveying main idea of the document without any explanations and comment. (GB6447-86) • An express of a certain document without any explanations and comment. It's unnecessary to know who writes the summary. (ANSI) • A concise and accurate express of the document without any explanation and comment. A summary is independent on the author of the summary. (ISO214-1976(E))

Objective • Concise(简洁) • Accurate(准确) • Explicit(清楚)

Summary Classification • Classified by user's requirement Generic Summarization (GS) vs User-query Summarization (UQS) • Classified by text object Single Document Summarization vs Multiple Document Summarization • Classified by method Summarization Based on Extraction (SBE) vs Summarization Based on Understanding (SBU) • Classified by need corpus Supervised Summarization (SS) vs Unsupervised Summarization (US)

Related Work • 国外研究现状国外研究主要是面对英文信息的处理，比较有代表性的系统有： • 美国哥伦比亚大学的多文档自动文摘系统Newsblaster。其研究主要针对新闻领域。可以对每天发生的同主题新闻进行摘要。目前该系统达到了每天上万人次的访问记录。 • 美国密西根大学研究开发的WebInEssence。这是一个个性化的基于Web的多文档自动文摘和内容推荐系统。 • 美国南加利福尼亚大学的信息科学研究所原型系统 NeATS。也是一个比较有名的多文档自动文摘系统。进行多文档自动文摘相关研究和开发的公司有： • nVivisimo公司（http://www.vivisimo.com） • ninfonetware公司（http://www.infonetware.com）这两个公司对搜索引擎返回的结果进行了有效地聚类整理。而文档聚类是多文档自动文摘的一个关键的预处理步骤。 • DUC (Document Understanding Conference)

Related Work • 国内研究现状 • 国内目前在单文档文摘上的研究相对比较深入，如东北大学、上海交通大学、中科院、哈工大等科研机构进行的研究。 • 在多文档自动文摘方面，国内复旦大学开发了一个基于统计的文本自动综述系统，该方法利用文档内和文档之间段落的语义相关性，实现多文档的自动综述；哈工大基于语义相似度的最大边缘相关技术展开了研究。 • 在文档聚类方面，北京大学计算机科学与技术系提出了一种快速的Web文档聚类方法： PCCS部分聚类分类。中科院陈宁等人提出了基于模糊概念图的聚类方法，中科院计算所吴斌提出了基于群体智能的文档聚类算法。

Basic Methods 在自动摘要的研究开发过程中提出以下七种方法： • 位置法美国的P.E.Baxendale的研究结果显示：人工摘要中的句子为段首句的比例为 85%，是段尾句的比例为7%。美国康奈尔大学G.Salton提出了寻找文章的中心段落为文摘核心的思想。 e.g 除了论题句、段首、段尾等句子之外，段落的第二句常常表示段落的主题。 • 提示字串法文章中常常有一些特殊的线索词(短语、字串、字串链)，它们对文章主题具有明显的提示作用，可以利用它们来获取文章的主题。 e.g Edmundson的文摘系统中的线索词词典: • 取正值的褒义词(Bonus Words) • 取负值的贬义词(Stigma Words) • 无效词(Null Words)

Basic Methods • 频率统计法能够指示文章主题的所谓有效词(或称实词)往往是中频词。根据句子中实词的个数来计算句子的权值，这是[Luhn,1958]首先提出的。[V.A.Oswald]主张句子的权值应按其所含代表性的“词串”的数量来计算；而Doyle则重视共现频度最高的“词对”； [Lisa.F.Rau,1995]采用相对词频的方法实现ANES(Autormatic News Extraction System)系统。相关实验表明：高频字串往往与主题相关度极大。 • 信息提取法信息提取法常用于对一些特殊领域(如气象预报等)的文献资料做摘要。该方法根据用户的需求，首先构造出一个用户喜闻乐见的文摘框架(Abstract Frame)，文摘框架以空槽的形式提出应该从原文中获取的各项内容，然后再把文摘框架中的内容转换为文摘(文字或图表)。因此，该方法常称之为二段式：抽取有关信息，然后生成摘要。

Basic Methods • 框架法借助于文章的大小标题与语义段来作所谓的目次性摘要，这也很受欢迎。统计表明：大部分科技文献(99.8%)的标题都能基本反映主题。捷克Janos把文中句分为主干句与枝叶句，删枝叶句留主干句的文摘方法可划归于此。 • 理解分析法基于理解的自动文摘常包含语法分析、语义分析、信息提取和文摘生成，作者文摘应属于此。我们的研究表明：理解首先应着重篇章理解、段落理解，也就是理解应该是分层的，高层理解比低层理解更为重要。 • 仿人算法仿人算法就是对人工方法的学习，模仿与发挥所产生的综合性方法。手工文摘人员在编制文摘时并不一定通读全文，往往只着重观察标题、前言、结束语及其论题句，以发现其主题，再挑选句子并修饰稍加组织生成文摘。人工很多经验都是值得注意的，同一篇文献，不同用户兴趣点和观察角度可能不同，文摘的结果应当不同。

Framework of Summarization System

Text Formalization 一个中文文本表现为一个由汉字和标点符号组成的字符串，由字构成词，由词构成短语，进而形成句、段、节、章、篇等结构。这里，我们把字、词、短语等等称为语义特征项。从文本所蕴涵信息的角度来看，一个中文文本可以由特征项的频率及其相互之间的顺序来完整表达。 [G. Salton，1988]提出的向量空间模型VSM（Vector Space Model）即是使用向量来表示文本，并成功应用到SMART系统中，是应用最成功的模型，它的核心概念可以描述如下： • 项（字、词、句） • 项的频率 • 向量空间模型（VSM） • 相似性度量（Similarity）

Document Structure Analysis • Paragraph analysis • 段落的位置 • 句子在段落中的位置 • Sentence boundary analysis • 句子级标号 • 一元标号：。/：/；/！/？ • 二元标号：。”/ ！”/？” • 分句后的指代消歧

Document Structure Analysis • 根据结构划分可以将文档中句子划分为以下类型： • 其它:

Sentence Importance--Score • Statistical features • Tf • Tf-idf • Linguistic features • Location • Semantic • Integrative features

Redundancy-- Similarity • Cosine • Okapi BM25 • Pivoted TFIDF • ……

Summary Algorithm • Base-line Algorithm • Clustering Algorithm • Dual Iterate Algorithm • MMR Algorithm • ……

Base-line Algorithm • 直接选取每篇文章的首尾句; • 在聚类基础上对中心文献进行单文档摘要(Centroid Document Summary ); • Others

Clustering Algorithm • 理论依据语篇语言学的理论认为，语篇在意义上存在一种层次关系，即： • 语篇的中心意思＝各组成意义段的中心意思按一定逻辑关系的组合意义 • 段的中心意思＝各组成子意义段的中心意思按一定逻辑关系的组合子意义 • 段的中心意思＝各组成下位子意义段的中心意思按一定逻辑关系的组合直至不能再划分为更小的子意义段。

Clustering • Sentences clustering • Adaptive clustering of paragraphs

Sentences clustering • MD-Mutual Dependence F(X),F(η):frequency of the two variables F(S): the co-occurrence frequency of them L: the length of a document The range of MD [0,1/4 * logL) • Cluster sentences the adaptive fuzzy-C means method is deployed to cluster sentences into C classes. Each class is taken to express one subtopic or background knowledge.

Adaptive clustering of paragraphs • Adopt K-means clustering algorithm as well as a novel clustering analysis algorithm, we can capture the number of different latent topic regions in a document adaptively. • Topic representative sentences are selected from each topic region to form the final summary.

Dual Iterate Algorithm • 文本和特征项的重要性之间存在着这样的一种对偶关系： • 一个重要的文本就是包含许多重要词的文本； • 一个重要的词就是经常出现在重要文本中的词。开始时赋予Wf, Wt随机值，这里我们使用Wf表示文本权重向量，Wt表示词权重向量，即： Wf= ( Wf1, Wf2 ….. Wfm); Wt= ( Wt1, Wt2 ….. Wtn); 每次迭代操作做向量和矩阵的乘法运算：对于任意给定的初始值，这种迭代过程都是收敛的，并且最后的稳定值恰好是矩阵A*Tr(A)和Tr(A)*A的一个特征向量。

Dual Iterate Algorithm • 　定理:对于任意给定的初始向量Wf0和Wt0，迭代过程都是收敛的。Wf将稳定于矩阵A*Tr(A)的某个特征向量上，Wt将稳定在Tr(A)*A的某个特征向量上。[Clever Team, 1999] • 　推论 Wf和Wt的稳定值Wf*和Wt*满足下面的关系式： • 　特征向量选取

MMR Algorithm • MMR Goldstein等提出的MMR技术的最重要特点是：在选择文摘句时，使要进入文摘的句子既和主题的相关度较高，又使该句和已选文摘句之间的冗余度尽可能的小，来保证和主题或用户Query的相关度，同时减少冗余信息，增加有特色的内容，使得到的文摘质量较高。 • MMI-MS 日本横滨国立大学开发的一个多文档自动文摘系统将MMR技术和IGR (Information Gain Ratio)技术结合起来，称为MMI-MS (Maximal Marginal Importance – Multi-Sentence)来选取文摘句。 • MMR-MD Goldstein等提出了在多文档文摘系统中采用基于MMR-MD (Maximal Marginal Relevance Multi-Document)的方法。 • MMR-SS 哈工大刘寒磊,关毅等提出了基于句子语义相似的最大边缘相关方法:MMR-SS (Semantic Similarity based Maximal Marginal Relevance)来选择文摘句，生成关于同一主题的通用文摘。

MMI-MS

MMR-MD

MMR-SS

Our Work • Based on Dual Iterate Algorithm • Based on MMR Algorithm

Based on Dual Iterate Algorithm • demo

Based on MMR Algorithm • demo

Challenges • Compression requirement • Evaluate summarization

Compression requirement Extracts to single documents usually aim to be five to 30 percent of the source length. However, compression targets in summarizing multiple sources or in providing summaries for handled devices are much smaller. These high reduction rates pose a challenge because they are hard to attain without a reasonable amount of background knowledge. • Summarization compression • Sentence compression

Compression requirement From Language Technology Institute, Carnegie Mellon University

Evaluation • Intrinsic • Users judge the quality of summarization by directly analyzing the summary • Extrinsic • Users judge a summary’s quality according to how it affects the completion of some other task

Evaluation (cont’d) • Edmundson的评估 • 第一种是客观评估，采用系统没有做过的原文，在自动文摘系统实验后进行。此评估将各种方法得到的文摘与目标文摘相比较，得出句子重合率(coselection rate)，然后比较各种方法的平均重合率。 • 第二种是主观评估，与第一种方法一样，也是在实验后进行，即由专家比较机械文摘与目标文摘所含的信息，然后给机械文摘一个等级分，等级为：完全不相似，基本相似，很相似，完全相似等。 • 第三种是统计错摘的句子，其目的是比较实验得到的文摘与实验前给定的目标文摘的差距，即对于实验效果的评估。

原文(题目) 机械文摘系统专家文摘机械文摘评价 Evaluation (cont’d) • 北大机械文摘自动评价模型评价是文摘自动评测的核心部分。在进行评价时，有以下几个基本规定： ①．专家文摘和机械文摘都存入文本文件中； ②．比较的基本单位是句子。句子是两个句子级标点符号之间的部分。句子级标号包括：“。”“：”“；”“！”“？” ； ③．为使专家文摘与机械文摘具有可比性，只允许专家从原文中抽取句子，而不允许专家根据自己对原文的理解重新生成句子； ④．专家文摘和机械文摘的句子都按照在原文中出现的先后顺序给出； ⑤．定义：重合率p＝匹配句子数／专家文摘句子数×100％每一个机械文摘的重合率为按三个专家给出的文摘得到的重合率的平均值。平均重合率＝（Pi为相对于第i个专家的重合率，n为专家的数目）

Evaluation (cont’d) • Other methods • 华中师范大学 • Topic Completeness (TC) • Representation Entropy (RE) • Ratio of completeness and redundancy (R)

Evaluation (cont’d) • Other methods • Ranked & Absolute & Factoid Evaluation • Ranked Evaluation present evaluators with original two document sentences; they also see a list of hypothesis summaries and are asked to rank them relative to one another. • Absolute Evaluation evaluators are presented with the reference summary and a hypothesis and are asked to produce an absolute score for the hypothesis. • Factoid Evaluation manually inspect the information content of each hypothesis.

DUC The Document Understanding Conference (DUC) is a series of summarization evaluations that have been conducted by the National Institute of Standards and Technology (NIST) since 2001. Its goal is to further progress in automatic text summarization and enable researchers to participate in large-scale experiments in both the development and evaluation of summarization systems.

DUC 2005 System Task • Task definition in [Amigo et al, 04] • … topic-oriented, informative multi-document summarization, … compressed version of a set of documents … • Topic creation instructions • to formulate a topic out of interesting aspects • “At least 25 documents must each contribute some material to the answer” of a quest of the topic • Our view of the task • A general, and topic-oriented summary. • …….

DUC 2005 Targeted Sentences • Good DUC 2005 summary: an extract consists of sentences that • highly representative • highly relevant to the topic • General • Specific: named entities are favored • with minimal redundancy

Recent & Future Work • Recent work • Feature selection • Clustering and minimal document set • Future work • Set up an efficient Chinese evaluation platform • Support multi-language • Delta-closure & Semantic

References 1.www.nist.org 2.www.duc.org 3.Paice,C.D.Constructing literature abstracts by computer:techniques and prospects.Information Processing & Management,1990(1):171～186 4.Karen Sparck Jones.Summarizing:where are we now?where should we go?talk in ACL/EACL Workshop on Intelligent Scalable Text Summarization,1997 5.Zhiqi Wang,Yongcheng Wang.Research on mathematical model for automatic summarization.2005 6.G.Salton.Automatic Text Processing. Addison-Wesley Publishing Company,1988. 7.Luhn.The Automatic Creation of Literature Abstracts IBM Journal of Research and Development. Vol. 2 No. 2 1958 8.陈宁，陈安，周龙骧.基于模糊概念图的文档聚类及其在Web中的应用.软件学报，2002(08)：1598-08 9.郑毅，吴斌. 由鸟群和蚂蚁想到的——基于主体的仿真和群集智能的研究[J]. 微电脑世界, 2001(1): 7-13 10.卜东波博士论文 11.Edmundson, H. P. New Methods in Automatic Abstracting, Journal of the ACM, 1969, 16: 264～285.

References 12.俞士汶,段慧明,田剪秋. 机械文摘自动评测的原理及实现 13.Po Hu,Tingting He,Donghong Ji,et. A Study of Chinese Text Summarization Using Adaptive Clustering of Paragraphs. CIT'04 14.Yanjun Li,Soon M Chung. Text Document Clustering Based on Frequent Word Sequences. CKIM'05,293~294 15.Wei Dai,Rohini Srihari.Minimal Document Set Retrieval. CKIM'05,752~759 16.Hal Daume III,Daniel Marcu. Generic Sentence Fusion is an Ill-Defined Summarization Task. 17.Yu ShiWen, Duan Huiming, Tian Jianqiu. The theory and mplementation of automatic evaluation of mechanical abstraction[C].1997 18.刘群,李素建.基于《知网》的词汇语义相似度计算. 19.董振东,董强。http://www.keenage.com 20.Goldstein,J.,V.Mittal, and J. Carbonell. Creating and Evaluating Multi-Document Sentence Extract Summaries. In CIKM’00: Ninth International Conference on Information Knowledge Management. 2000 21.王建会,胡运发,李荣陆.自适应确定摘要长度

Thanks a lot! Acknowledge

Question-Answering Q&A

Research o n Chinese Automatic Summarization Zhang Jin spaces.msn/jadesor