Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion ShoaibJameel, Wai Lam and XiaojunQian The Chinese University of Hong Kong

Outline • Introduction to Readability/Conceptual Difficulty • Motivation • Related Work • Our method (Sequential Term Transition Model (STTM)) • Empirical Evaluation • Conclusions and Future Work

1 http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 2 http://en.wikipedia.org/wiki/Proton

Search for a keyword Results – Sometimes irrelevant and mixed order of readability

An attempt by Google

Our Objective Query Retrieve web pages (considering relevance) Automatically accomplished Re-rank web pages based on readability

What has been done so far? • Heuristic Readability formulae • Unsupervised approaches • Supervised approaches

Heuristic Readability Methods • Have been there since 1940’s • Semantic Component– Number of syllables per word, length of the syllables per word etc. • Syntactic Component– Length of sentences etc.

Example – Flesch Reading Ease water -> wa-ter proton -> pro-ton embryology -> em-bry-ol-o-gy star -> star Problem Syntactic component Semantic component Manually tuned numerical parameters

Supervised Learning Methods • Language Models • Unigram Language Model based method • SVMs (Support Vector Machines) • Use of query Log and user profiles • Can address the problem on individual basis

Smoothed Unigram Model [1] • Recast the well-studied problem of readability in terms of text categorization • and used straightforward techniques from statistical language modeling. [1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).

Smoothed Unigram Model Limitation of their method: Requires training data, which sometimes may be difficult to obtain

Domain-specific Readability • Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10). Based on web-link structure algorithm HITS and SALSA. • Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). Based on an ontology. Tested only in the medical domain Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis I will focus on this work.

Overview • The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts. • The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH). • Authors have pointed out the readability based formulae are not directly applicable to web pages.

MeSH Ontology Concept difficulty increases Concept difficulty decreases

Overall Concept Based Readability Score where, DaCw = Dale-Chall Readability Measure PWD = Percentage of difficult words AvgSL = Average sentence length in di len(ci,cj)=function to compute shortest path between concepts ci cj in the MeSH hierarchy N = total number of domain concepts in document di Depth(ci)=depth of the concept ci in the concept hierarchy D= Maximum depth of concept hierarchy Number of associations = Total number of mutual associations among concepts Their work focused on word level readability, hence considered only the PWD

Use of Query Log data • Have been conducted by the search engine companies • Requires proprietary data, not available publicly • Thus not very useful to the research community because it cannot be replicated J. Kim, K. Collins-Thompson, P. N. Bennett, S. Dumais. Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Proceedings of WSDM 2012. (Microsoft Research) Chenhao Tan, EvgeniyGabrilovich, and Bo Pang. 2012. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. (Yahoo! Research)

Our approach • Sequential Term Transition Model (STTM) • A conceptual difficulty determination model which is: • Unsupervised • Does not require any knowledge base or annotated data

Methodology • We first build a term document matrix • We then perform Singular Value Decomposition (SVD) on the matrix • SVD : W≈W’=USVT • U is a Txf matrix of left singular vectors • V is a Dxf matrix of right singular vectors • S is a fxf diagonal matrix of singular values • T is the number of terms in the vocabulary • D is the number of documents in the collection • f is number of factors

Observation in the SVD space • Terms which are central to a document come close to their document vectors • General terms are distant away from their document vectors • Semantically related terms cluster close to each other • Unrelated terms cluster away from each other

Computing Term Difficulties Normalized term vector Normalized document vector Matrix of normalized document vectors that contain the term

General Idea about Linear Embedding D6 D1 w6 w1 D2 w2 t w5 w3 D5 D3 w4 D4

Cohesion • When units tend to “stick together”, the property is called cohesion • We compute cohesion between terms in sequence • The more cohesive terms in the document are, the easy it is for a person to comprehend a discourse

Computation of Cohesion • We know related terms cluster close to each other in the latent space obtained via SVD • We have to compute the cluster memberships of each of the terms as SVD does not directly give term memberships to clusters • We use k-means because of its simplicity and ability to handle large datasets

How we compute cohesion? W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Determine the cluster memberships of the two consecutive terms w1 and w2 W1 W2 C1 C1 Same cluster, we conclude they are cohesive W1 C1 W2 W3 C1 C1 C4 Same cluster, we conclude they are cohesive W1 W2 W3 W4 C1 C1 C4 Compute cosine similarity

Cohesion using cosine similarity • If the cluster centroids are close to each other, then cosine similarity will be high • When cosine similarity is high means that the two cluster are closely related

Conceptual Difficulty Score Conceptual difficulty score for document j Cohesion score of document j Term difficulty score for document j Parameter controlling the relative weights between [0,1]

Empirical Evaluation - Dataset • Standard test collections do not have readability judgments • We chose Psychology domain • Crawled web pages from Wikipedia, Psychology.com, Simple English Wikipedia • Total web page count = 167,400 • No term stemming • Tested with both stopwords and no stopwords

Retrieval of web pages • Indexed the web pages using a small scale search engine. We used Zettair • Retrieved web pages for a query based on relevance • Followed INEX’s query/topic generation guidelines • Re-ranked web pages based on conceptual difficulty • Annotated some top-10 documents for each query

Evaluation Metric • Normalized Cumulative Discounted Gain (NDCG) • We suited for ranking evaluation because it takes into account the position of an entity in the ranked list unlike Precision, recall measures or Rank order correlation

Results when β=0.5

Conclusions and Future Work • We proposed a conceptual difficulty ranking model • Required no training data or ontology • Main novelty – use of a conceptual model • Significant improvement • In the future, we would study how link-structure of the web could aid us in conceptual difficulty ranking

Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong

Shoaib Jameel , Wai Lam and Xiaojun Qian The Chinese University of Hong Kong

Presentation Transcript

Economics Freedom of Hong Kong: Lessons and Challenges Kui-Wai Li City University of Hong Kong

Shengyu Zhang The Chinese University of Hong Kong

Chun Lam Chan , Pak Hou Che and Sidharth Jaggi The Chinese University of Hong Kong

Dah Ming Chiu Chinese University of Hong Kong

Poon Wai Yee Emily The Open University of Hong Kong

Shoaib Jameel , Wai Lam , Xiaojun Qian

The Chinese University of Hong Kong

Shengyu Zhang The Chinese University of Hong Kong

The Chinese University of Hong Kong

Chinese University of Hong Kong

The Chinese University of Hong Kong

Shengyu Zhang The Chinese University of Hong Kong

Chinese University of Hong Kong Faculty of Medicine

(1) ISEIS, Chinese University of Hong Kong, NT, Shatin, Hong Kong

Xiaojun Feng , Jin Zhang, and Qian Zhang Hong Kong University of Science and Technology

Shengyu Zhang The Chinese University of Hong Kong

The Chinese University of Hong Kong Hong Kong Institute of Educational Research MCLS 6508

Chinese University of Hong Kong

Corinne Maxwell-Reid The Chinese University of Hong Kong

Department of Information Engineering The Chinese University of Hong Kong

The Chinese University of Hong-Kong, September 2008