User oriented text summarization

User oriented text summarization BY Asef poormasoomi

Motivation • summaries which are generic in nature do not cater to the user’s background and interests • results show that each person has different perspective on the same text • So a good summary should change in accordance to preferences of its reader

Motivation • Marcu-1997: found percent agreement of 13 judges over 5 texts from scientific America is 71 percent. • Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap • Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap

Users Feedback • Query History: • is the most widely used implicit user feedback at present. • http://www.google.com/psearch • Data Click: • when a user clicks on a document, the document is considered to be of more interest to the user than other unclicked ones • Attention Time : • often referred to as display time or reading time • Other types of implicit user feedbacks : • Other types of implicit user feedbacks include display time, scrolling, annotation, bookmarking and printing behaviors

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents 2008 IEEEChandan Kumar, Prasad Pingali, Vasudeva Varma

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • extract the personal information of the user using information available on the web • Generic Sentence Scoring In General: compute the probability distribution over the words w appearing in the input D, p(w|D) : • For each sentence S in the input, assign a weight equal to the average probability of the words in the sentence

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Estimating User Background model : used search engine to extract the personal information of the user using information available on the web. • put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”) • ’n’ top documents are taken and retrieved. • After performing the removal of stop words and stemming, a unigram language model is learned on the extracted text content. • This model can be interpreted as the probability of a word w being related to the person’s profile U :

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • User Specific Sentence Scoring : • the term probability of the document set D p(w|D), and the user profile U p(w|U) have been merged using a linear weighted combination. The score of a sentence S for user u is given as :

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • After sentence scoring, eliminate redundancy : • for redundancy identification, use the measure of number of terms overlapping between the already generated summary and the new sentence being considered • sentence are arranged based on chronological ordering (between documents i.e.based on the time stamp) and order of occurrence (within the document).

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Example : Topic of summary generation is ”Microsoft to open research lab in India” • 8 articles published in different new sources forms the news cluster • In the example we are showing the condensed summary(100 words) for two users. User A is from NLP domain and User B from networksecuritydomain. • The italic text in user specific summary shows the differnce compare to generic summary

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Generic summary: The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said PadmanabhanAnandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • User A Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said PadmanabhanAnandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition.In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • User B Specific summary : The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Evaluation • The evaluation of this technique was carried out on five different research scholars working in different fields of computer science • News articles of science and technology domain were considered for summarization. 25 different topics were chosen with each topic having 5-10 articles. • Each researcher was asked to judge the relevance of both versions of summaries for all 25 topics(1-5 score). • Result show that the users prefer profile based personalized summariescompared to a generic summary given by general automaticsummarization system

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Evaluation • This figure shows the scores given by a particular user across different topics. • for most of the topics user find personalized summaries relevant for him. • personalized summaries for the topics strongly related to the user’s domain are more relevant to him • For topics which are not closely related to user’s field, the personalized and generic summaries are quite similar • For a few rare topics the user did not find personalized summary better

Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE,Chandan Kumar, Prasad Pingali, Vasudeva Varma • Evaluation

Article2: User-oriented Document Summarization Through Vision-based Eye-tracking2009 ACMSonghua Xu, Hao Jiang , Francis C.M. Lau

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • MAIN IDEA • The key idea is to rely on the attention (reading) time of individual users spent on single words in a document. • The prediction of user attention over every word in a document is based on the user’s attention during his previous reads • algorithm tracks a user’s attention times over individual words using a vision-based commodity eye-tracking mechanism. • user attention time over any arbitrary word is predicted by a data mining process • use simple web camera and an existent eye-tracking algorithm “Opengazer project” • The error of the detected gaze location on the screen is between 1–2 cm, depending which area of the screen the user is looking at (a 19” screen monitor).

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • Anchoring Gaze Samples onto Individual Words • the detected gaze central point is positioned at (x; y) on the screen space • compute the central displaying point of the word which is denoted as (xi; yi). • and are the average width and height of a word’s displaying bounding box in the document • For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner. • The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process • During processing, remove the stop words.

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • PREDICTION OF USER ATTENTION OVER A SENTENCE • attention time prediction for a word is based on the semantic similarity of two words. • Sim(wi,wj) to denote the semantic similarity between word wi and word wj , where Sim(wi,wj) € [0; 1] • use the algorithm proposed in: Y. Li, Z. A. Bandar, and D. Mclean. An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering. • for an arbitrary word w which is not among , calculate the similarity between w and every wi(i = 1,…, n) andthen select kwords which share the highest semantic similarity with w.(k is set as min(10; n) )

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • Predicting User Attention for Sentences • estimate the total attention of a certain user on a sentence as the sum of the user’s attention over all the words in the sentence : • AT(wi;Uj) is user Uj ’s attention over the word wi, which is either sampled from the user’s previous reading activities via (1) or predicted via (2). • = 0 if the word wi is a stop word; = 0:6 if there is no attention sample for the user Uj over the word wi, = 1,otherwise

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • A Hybrid Summarization Approach • In early experiments, noticed that the performance of our user-oriented document summarization algorithm heavily depends on the amount of available user attention time samples • To address the issue, integrate new method with a conventional automatic document summarization algorithm(MEAD) • = 1 if sentence si is selected by MEAD in its document summarization result, = 0 otherwise. • kisfree parameter and is user tunable.

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • EXPERIMENT RESULTS • comparing the document summarization results with those generated by two popular text summarization algorithms. • use two sets of articles. Articles in the first set are all about science (60 articles from “Science” magazine) and articles in the second set are all about entertainment and leisure (sixty articles are randomly selected from the travel and sports section on “New York Times”) • 12 people with different knowledge backgrounds read some selected articles from the two article sets. • they are asked to provide a summary for the article they just read

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • EXPERIMENT RESULTS • to measure the performance , three measurements :Recall (R), Precision (P) and F-rate (F) are introduced • SUe is the human summary result

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • EXPERIMENT RESULTS

Article2: User-Oriented Document Summarization through Vision-Based Eye-Tracking, 2009 ACM,Songhua Xu, Hao Jiang , Francis C.M. Lau • EXPERIMENT RESULTS • experiment to evaluate the performance of hybrid approach under different settings for the parameter K.

Article3: Webpage Summarization Using Clickthrough Data 2005 ACMJiantao Sun, Dou Shen, HuajunZeng, Qiang Yang , Yuchang Lu , Zheng Chen

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Main Idea • use extra knowledge of the clickthrough data to improve Web-page summarization • collection of clickthrough data, can be represented by a set of triples < u; q; p > • Typically, a user's query words , reflect the true meaning of the target Web-page content • In new algorithm, adapt two text-summarization methods to summarize Web pages. • The first approach is based on significant-word selection adapted from Luhn's method • The second method is based on Latent Semantic Analysis (LSA)

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Problems • Web pages may have no associated query words • the clickthrough data are often very noisy • Solution • thematic lexicon : (using the annotated hierarchical taxonomy of Web pages such as the one provided by ODP web-site (http://dmoz.org/))

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Adapted Significant Word (ASW) Method • each sentence is assigned a significance factor(word frequency) and the sentences with high significance factors are selected to form the summary • customized factor : • Adapted Latent Semantic Analysis (ALSA) Method • The corpus can be represented by a term-document matrix.

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Summarize Web Pages Not Covered by Clickthrough Data • build a thematic lexicon • use TS(c) to represent a set of terms associated with category c. • thematic lexicon is a set of TS, which correspond with categories in ODP. • The lexicon is built as follows : • first, TS corresponding to each category is set empty • for each page covered by the clickthrough data, its query words are added into TS • if a page belongs to more than one category, its query terms will be added into all TS associated with all its categories. • At last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF). • For each Web page that are not covered by the clickthroughdata,first look up the lexicon for TS according to the page's category, Then the summarization methods are used. • When a TS does not have sufficient terms, TS corresponding with its parent category is used

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • EXPERIMENTS • data set contains about 44.7 million records of 29 days from Dec 6 of 2003 to Jan 3 of 2004 (MSN search engine ) • 3,074,678 Web pages of the ODP directory are crawled. Web pages crawled • At last got 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queries. • DAT1, consists of 90 pages which are selected from the browsed pages. • Three human evaluators were employed to summarize these pages • they also use a relatively large scale data set, denoted by DAT2, to evaluate summarization methods(10,000 pages).

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Summarization Results on DAT1 (ASW) • ROUGE is a software package adopted by DUC for automatic summarization evaluation (http://www.isi.edu/ cyl/ROUGE/)

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • Summarization Results on DAT1(ALSA)

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • evaluation summarization method using the thematic lexicon • clickthrough data contains only 260,763 pages, and lexicon contains 141,869 categories, which is a subset of the ODP category structure. • If terms under this category have more than P% overlap with distinct terms in the Web page, then they are used for summarization. Otherwise, use lexicon terms of its parent category. • This process continues until we find a category which covers enough query terms or until we reach the root of the thematic lexicon

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • evaluation summarization method using the thematic lexicon (ASW)

Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM,JianTaoSun, Dou Shen, HuaJunZeng, Qiang Yang , Yuchang Lu , Zheng Chen • evaluation summarization method using the thematic lexicon (ALSA)

thanks

User oriented text summarization

User oriented text summarization

Presentation Transcript

Text summarization

Multi-topic based Query-oriented Summarization

Text Summarization

Text summarization

Automatic Text Summarization

Automatic Text Summarization

Automatic Text Summarization

Text Summarization: News and Beyond

Automatic Text Summarization

Text Summarization: News and Beyond

AUTOMATIC TEXT SUMMARIZATION

Automatic Text Summarization

Text summarization

Text summarization

Text summarization

Text Summarization

Automatic Text Summarization

Automatic text summarization

Computer Aided Text Summarization

Text Summarization via Semantic Representation

Text Summarization

Text Summarization: News and Beyond