200 likes | 326 Views
Extracting Key Terms From Noisy and Multi - theme D ocuments. Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS. Outline. Key terms extraction: traditional approaches and applications Using Wikipedia as a knowledge base for Natural Language Processing
E N D
Extracting Key Terms FromNoisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS
Outline • Key terms extraction: traditional approaches and applications • Using Wikipedia as a knowledge base for Natural Language Processing • Main techniques of our approach: • Wikipedia-based semantic relatedness • Network analysis algorithm to detect community structure in networks • Our method • Experimental evaluation
Key Terms Extraction • Basic step for various NLP tasks: • document classification • documentclustering • text summarization • inferring a moregeneral topic of a text document • Core task of Internet content-basedadvertising systems, such as GoogleAdSense andYahoo! Contextual Match • Web pages are typically noisy (side bars/menus, comments, future announces, etc.) • Dealing with multi-theme Web pages (portal home pages, etc.)
Approaches to Key Terms Extraction • Based on statistical learning: • use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document (KEA) • compute statistical features over Wikipedia corpus (Wikify! ) • require training set • Based on analyzing syntactic or semantic term relatedness within a document • compute semantic relatedness between terms (using, for example, Wikipedia) • modeling document as a semantic graph of termsand applying graph analysis techniques to it (TextRank) • no training set required
Using Wikipedia as a Knowledge Base for Natural Language Processing • Wikipedia (www.wikipedia.org) – free open encyclopedia • Today Wikipedia is the biggest encyclopedia (more than 2.7 million articles in English Wikipedia) • It is always up-to-date thanks to millions of editors over the world • Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages => rich resource for bootstrapping NLP and IR tasks
Basic Techniques of Our Method:Semantic Relatedness of Terms • Semantic relatedness assigns a score for a pair of terms thatrepresents the strength of relatedness between the terms • We use Wikipedia compute terms semantic relatedness • We use semantic relatedness to model document as a graph of terms
Basic Techniques of Our Method:Semantic Relatedness of Terms • Wikipedia-based semantic relatedness for the two terms can be computed using: • the links found within theircorresponding Wikipedia articles • Wikipediacategories structure • the article’s textual content • Using Dice-measure for Wikipedia-based semantic relatedness
Basic Techniques of Our Method:Detecting Community Structure in Networks • We discover terms communities in a document graph • Community – densely interconnected group of nodes in a network • Girvan-Newman algorithm for detection community structure in networks: • betweenness – how much is edge “in between” different communities • modularity - partition is a good one, if there are many edges within communities and only a few between them
Our Method • Candidate terms extraction • Word sense disambiguation • Building semantic graph • Discovering community structure of the semantic graph • Selecting valuable communities
Our Method:Candidate Terms Extraction • Goal: extract all terms from the document and for each term prepare a set of Wikipedia articlesthat can describe its meaning • Parse the input document and extract all possible n-grams • For each n-gram (+ its morphological variations) provide a set of Wikipedia article titles • “drinks”, “drinking”, “drink” => [Wikipedia:] Drink; Drinking
Our Method:Word Sense Disambiguation • Goal:choose the most appropriate Wikipediaarticle from the set of candidate articles for eachambiguous term extracted on the previous step • Use of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms • Denis Turdakov, Pavel Velikhov • “Semantic Relatedness Metric for Wikipedia Concepts Based on • Link Analysis and its Application to Word Sense Disambiguation” • SYRCoDIS, 2008
Our Method:Building Semantic Graph • Goal: building document semantic graph using semantic relatedness between terms Semantic graph built from a news article "Apple to Make ITunes More Accessible For the Blind"
Our Method:Detecting Community Structure of the Semantic Graph
Our Method:Selecting Valuable Communities • Goal:rank term communities in a way that: • the highest ranked communities contain key terms • the lowest ranked communities contain not importantterms, and possible disambiguation mistakes • Use: • density of community – sum of inner edges of community divided by the number of vertices in this community • informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms • Community rank: density*informativeness
Our Method:Selecting Valuable Communities • In 73% of web pages decline in communities scores separates key-terms communities from non-important ones
Advantages of the Method • No training. Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia • Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages • Thematically grouped key terms. Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph • High accuracy. Evaluated using human judgments (further in this presentation)
Experimental Evaluation on Noise-free dataset • Classical – TFxIDF, Yahoo! Terms Extractor • Wikipedia-based – Wikify!, TextRank • Evaluation on noise-free dataset (blog posts) using human judgment
Experimental Evaluation on Web Pages • Performance of our method on different kinds of Web pages • Comparison to other methods
Experimental Evaluation on Web Pages • Multi-theme stability evaluated on compound Web pages (popular news site, portal homepages, etc.)
Thank You!Any Questions? Email upa@grinev.net maxim@grinev.net lizorkin@ispras.ru