Download
linking and summarizing information on entities n.
Skip this Video
Loading SlideShow in 5 Seconds..
Linking and Summarizing Information on Entities PowerPoint Presentation
Download Presentation
Linking and Summarizing Information on Entities

Linking and Summarizing Information on Entities

139 Views Download Presentation
Download Presentation

Linking and Summarizing Information on Entities

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Linking and Summarizing Information on Entities Presented by Min-Yen Kan Web IR / NLP Group (WING) Department of Computer ScienceNational University of Singapore, Singapore This talk archived as http://wing.comp.nus.edu.sg/~kanmy/talks/080407-nihLMC.htm

  2. Singapore, the garden city 4M+ people, sandwiched between Malaysia and Indonesia 50 km from the equator: hot and humid year-long Known for: urban planning, fondness for acronyms and aversion to bubble gum litterers :-D WING @ NUS http://wing.comp.nus.edu.sg • 1 postdoc, 6 Ph.D. students, 5 undergraduates • Projects of in natural language processing, digital libraries, and information retrieval. NIH Lister Hill Medical Center

  3. Entity Centric Information Management “Collate all studies on SBP2 that new findings in the last year.” “Oh, I meant the PROTEIN SBP2, not the gene.” “What other proteins does SBP2 bind to?” “Tell me more about the contradiction from previous results.” “Which Miller did the study on SBP2 in 2002?” NIH Lister Hill Medical Center

  4. Entity Centric Information Management Two consequences to discuss today: • Linkage Joint work with Yee Fan TAN, Dongwon LEE (PSU) et al. • Summarization Joint work with Ziheng LIN et al. NIH Lister Hill Medical Center

  5. Aggregating data on an object together from heterogeneous resources Problem: Entity names are ambiguous! Medical terms Person names Products Customer records These problems exist even when we have controlled vocabulary and lexicons (Specialist, UMLS, MeSH) What’s Entity Linkage? By UV cross-linking and immunoprecipitation, we show that SBP2 specifically binds selenoprotein mRNAs both in vitro and in vivo. The SBP2 clone used in this study generates a 3173 nt transcript (2541 nt of coding sequence plus a 632 nt 3’ UTR truncated at the polyadenylation site). Protein Gene NIH Lister Hill Medical Center

  6. Dongwon Lee, 110 E. Foster Ave. #410, State College, PA, 16802 Honda Fix Joint Conf. on Digital Libraries Apple iPod Nano 4GB Entity Linkage LEE Dong, 110 East Foster Avenue Apartment 410, University Park, PA 16802-2343 Honda Jazz JCDL 4GB iPod nano 4GB De-duplication Examples of Split Records Ironic, isn’t it? NIH Lister Hill Medical Center

  7. All over the web! Jeffrey D. Ullman (Stanford University) NIH Lister Hill Medical Center

  8. Record linkage, formally defined • Input • Two lists of records, A and B • Output • For each record a in A and for each record b in B,does a and b refer to the same entity? • Note: • Entities do not come with unique identifiers • To disambiguate (deduplicate) items in a single list L, we set A = B = L NIH Lister Hill Medical Center

  9. Talk Outline • Linkage using the Web • Introduction >> Record linkage using internal knowledge • String matching • Classification or clustering • Graphical formalisms • Blocking • Record linkage using search engines • Update Summarization NIH Lister Hill Medical Center

  10. Fellegi-Sunter model no-decision region (hold for human review) designate as definite non-match designate as definite match * true matches○ true non-matches Frequency of Similarity false non-matches false matches Similarity (a, b) NIH Lister Hill Medical Center

  11. String matching • String similarity • Strings as ordered sequences • Edit distance • Jaro and Jaro-Winkler • Strings as unordered sets • Jaccard similarity • Cosine similarity • Abbreviation matching • Pattern detection: e.g. “National Institute of Health (NIH)” ([a], [b], [c]) ≠ ([c], [b], [a]) {[a], [b], [c]} = {[c], [b], [a]} NIH Lister Hill Medical Center

  12. Machine Learning • Create features • String similarity, relationships (e.g. collaborators) • Then learn a model • Naïve Bayes, Support Vector Machine, K-means, Agglomerative Clustering, … Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem inDigital Libraries: OpenDBLP Approach. ECDL 2004. Same Person? Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges andResearch Directions. Information Systems Frontiers 1999. NIH Lister Hill Medical Center

  13. J. C. Latombe T.-H. Chiang D. Hsu A. Dhanik Y. Wang L. Qiu M.-Y. Kan Y. F. Tan H. Cui T.-S. Chua Graphical Methods: Social network analysis • Nodes: entities • Edges: relationships • Analysis • Connected components • Distance between nodes • Node/edge centrality • Cliques • Bipartite subgraphs • … NIH Lister Hill Medical Center

  14. Talk Outline • Linkage using the Web • Introduction • Record linkage using internal knowledge >> Record linkage using search engines • Search Engine Features • Adaptive Queries • Query Probing • Update Summarization NIH Lister Hill Medical Center

  15. Record linkage using search engines Previously… • We assumed input data records contain sufficient information to perform linkage What if… • There is insufficient or only noisy information? • e.g., linking short forms to long forms Ask other people! • I.e., consult external (vs. internal) sources of knowledge • Use web as collective knowledge base NIH Lister Hill Medical Center

  16. Anatomy of Search Engine Results Number of results Ranked list Title Programmatically accessible through APIs Snippet URL Web page NIH Lister Hill Medical Center

  17. Counts Co-occurrence measure between count(q1), count(q2) and count(q1 and q2) Hyperlinkage Count of web pages of q1 point to pages of q2, and vice versa? Incorporate additional indirect links with less weight(e.g., q 1  p  q2) Snippets or web pages (Cosine) similarity using tokens Counts of specific terms e.g. number of snippets for q1 containing the string q2 Further natural language processing Derivable Features NIH Lister Hill Medical Center

  18. Web page features • Named entities (NE) • We consider people, organizations, locations • Each NE token a feature • NE-targeted (NE-T) • Motivation: middle names and titles • For NEs having a token of target name • Extract tokens that are not in target name as features Charles, Chelsea, Morrice,Edward, Fox, London, … Born Edward Charles Morrice Fox in Chelsea,London… Charles, Morrice, … NIH Lister Hill Medical Center

  19. Where web pages are located is also useful Hypothesis: If web pages of q1 and web pages of q2 overlap a lot, q1 and q2 are the same entity Measure this using URL / Host information Caveat: Not all hosts are equally telling citeseer vs. harvard.edu for author names pubmed vs. diabetes-info.com for diabetic terms Solution: Weight by Inverse Host Frequency Using URLs NIH Lister Hill Medical Center

  20. URL Features (cont.) • Page URLs Hypothesis: URL itself tells quite a lot • Home page of “lindek” • CS department, University of Alberta, Canada • MeURLin (Kan and Nguyen Thi, 2005) • Tokens (http, www, cs, ualberta, ca, lindek) • URI parts (scheme:http, hostname:cs, user:lindek, …) • N-grams (ca ualberta, uaberta cs, cs www, www lindek) • Length of tokens • … http://www.cs.ualberta.ca/~lindek/ NIH Lister Hill Medical Center

  21. Test whether q1 and q2 should be linked Hypothesis: Web pages of q1 and web pages of q2 share some representative data I Similar to disconnected triples: “Jeffrey D. Ullman” = 384K pgs “Jeffrey D. Ullman” + “aho” =174K pgs “J. Ullman” = 124K pgs “J. Ullman” + “aho” = 41K pgs “Shimon Ullman” = 27.3K pgs “Shimon Ullman” + “aho”= 66 pgs Web search engine linkage q1 q2 NIH Lister Hill Medical Center

  22. Evaluation - Full web pages in WEPS • Goal • To compare the usefulness of various features for the Web People Search Task • Architecture Cosine similarity + Single link hierarchical agglomerative clustering + Minimum similarity threshold Input web pages Feature vectors Clusters NIH Lister Hill Medical Center

  23. Evaluation • F(α = 0.5) and similarity threshold 0.2 NIH Lister Hill Medical Center

  24. Evaluation - Author Disambiguation • Dataset • Manually-disambiguated dataset of 24 ambiguous names in computer science domain • Each ambiguous name represented 2 unique authors (k = 2) except for one where it represented 3 • Each name is attributed to 30 citations on average • Proportion of largest class ranges from 50% to 97% • Search engine • Google (http://www.google.com/) NIH Lister Hill Medical Center

  25. Single link performs best Good for clustering citations from different publication pages together (some pages list only selected publications) Some authors have disparate research areas, not well represented by a centroid vector Resolving hostnames to IP addresses give best accuracy Evaluation Classification accuracyaveraged over all names NIH Lister Hill Medical Center

  26. Discussion Per-name accuracies using single link Per-name average number of URLsreturned per citation NIH Lister Hill Medical Center

  27. Discussion • Apparent correlation between accuracy and average number of URLs returned per citation • Author names with few URLs tend to fare poorly since results are mainly aggregator web sites What’s the cost? • Lots of queries needed • Web page downloads are expensive • Hence, slow Can we speed this up? Sure thing… NIH Lister Hill Medical Center

  28. Query probing • Consider some publication venues: • Joint Conference on Digital Libraries • European Conference on Digital Libraries • Digital Libraries • Query probing • Use common n-gram “digital libraries” as query probe • If we can obtain information on all three conferences, we save two queries NIH Lister Hill Medical Center

  29. Combine two methods when needed Methods Ms: stronger method but very slow (e.g. web page similarity) Mw: weaker method but fast (e.g. host overlap) Aim Accuracy close to Ms Significantly reduced running time than Ms Algorithm Execute Mw If heuristic suggests that Mw results are likely incorrect Execute Ms Adaptive querying NIH Lister Hill Medical Center

  30. Entity Linkage - Conclusion • Important problem with a rich history • New external methods poll contextual evidence for judgment • Need to combine methods to obtain best aspect of each NIH Lister Hill Medical Center

  31. Talk Outline • Linkage using the Web >> Graph-based Update Summarization • Introduction • Timestamped Graphs • Evaluation and Conclusions “Now that all this data is linked, how do we process it?’’ NIH Lister Hill Medical Center

  32. Applications of Summarization Doing Less Work Decision Support NIH Lister Hill Medical Center

  33. More seriously: an exciting challenge ... ...put a book on the scanner, turn the dial to ‘2 pages’, and read the result... ...download 1000 documents from the web, send them to the summarizer, and select the best ones by reading the summaries of the clusters... ...forward the Japanese email to the summarizer, select ‘1 par’, and skim the translated summary. …get a weekly digest of new treatments and therapies for pressure ulcers An update task NIH Lister Hill Medical Center

  34. Simplifying summarization Select important sentences verbatimfrom the input text to form a summary • Input: A text document with k sentences • Output: Top n (n << k) sentences with the highest numericscores (each sentence in the input document is assigned a numeric score) Extractive Summarization NIH Lister Hill Medical Center

  35. Summarization Heuristics for extractive summarization • Cue/stigma phrases • Sentence position (relative to document, section, paragraph) • Sentence length • TF×IDF, TF scores • Similarity (with title, context, query) Machine learning to tune weights by supervised learning Recently, graphical representations of text have shed new light on the summarization problem NIH Lister Hill Medical Center

  36. Revisiting Social Networks: Prestige One motivation was to model the problem as finding prestige of nodes in a social network • PageRank: random walk In summarization, lead to TextRank and LexRank • Did we leave anything out of our representation for summarization? Yes, the notion of an evolving network NIH Lister Hill Medical Center

  37. Social networks change! Natural evolving networks (Dorogovtsev and Mendes, 2001) • Citation networks: New papers can cite old ones, but the old network is static • The Web: new pages are added with an old page connecting it to the web graph, old pages may update links NIH Lister Hill Medical Center

  38. Talk Outline • Linkage using the Web • Graph-based Update Summarization • Introduction >> Timestamped Graphs • Evaluation and Conclusion NIH Lister Hill Medical Center

  39. Evolutionary models for summarization Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way Consider the evolution of texts using a very simplistic model • Writers write from the first sentence onwards in a text • Readers read from the first sentence onwards of a text A simple model: sentences get added incrementally to the graph NIH Lister Hill Medical Center

  40. Timestamped Graph Construction These assumptions suggest us to iteratively add sentences into the graph in chronological order. At each iteration, consider which edges to add to the graph. • For single document: simple and straightforward: add 1st sentence, followed by the 2nd, and so forth, until the last sentence is added • For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1st sentences of all documents, followed by all 2nd sentences, and so forth • NB: Doesn’t really model chronological ordering between articles, fix later NIH Lister Hill Medical Center

  41. Timestamped Graph Construction Model: • Documents as columns • di = document i • Sentences as rows • sj = jth sentence of document NIH Lister Hill Medical Center

  42. Timestamped Graph Construction • A multi document example doc3 doc2 doc1 sent1 sent2 sent3 NIH Lister Hill Medical Center

  43. An example TSG: DUC 2007 D0703A-A NIH Lister Hill Medical Center

  44. Timestamped Graph Construction These are just one instance of TSGs Let’s generalize and formalize them Def: A timestamped graph algorithm tsg(M) is a 9-tuple (d, e, u, f,σ, t, i, s, τ) that specifies a resulting algorithm that takes as input the set of texts M and outputs a graph G Input text transformation function Properties of edges Properties of nodes NIH Lister Hill Medical Center

  45. Edge properties (d, e, u, f) • Edge Direction (d) • Forward, backward, or undirected • Edge Number (e) • number of edges to instantiate per timestep • Edge Weight (u) • weighted or unweighted edges • Inter-document factor (f) • penalty factor for links between documents in multi-document sets. NIH Lister Hill Medical Center

  46. Node properties (σ, t, i, s) • Vertex selection function σ(u, G) • One strategy: among those nodes not yet connected to u in G, choose the onewith highest similarity according to u • Similarity functions: Jaccard, cosine, concept links (Ye et al.. 2005) • Text unit type (t) • Most extractive algorithms use sentences as elementary units • Node increment factor (i) • How many nodes get added at each timestep • Skew degree (s) • Models how nodes in multi-document graphs are added • Skew degree = how many iterations to wait before adding the 1st sentence of the next document • Skip for today… NIH Lister Hill Medical Center

  47. Timestamped Graph Construction • Representations • We can model a number of different algorithms using this 9-tuple formalism: • (d, e, u, f, σ, t, i, s, τ) • The given toy example: • (f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null) • LexRank graphs: • (u, N, 1, 1, cosine-based, sentence, Lmax, 0, null) • N = total number of sentences in the cluster; Lmax = the max document length • i.e., all sentences are added into the graph in one timestep, each connected to all others, and cosine scores are given to edge weights NIH Lister Hill Medical Center

  48. System Overview • Sentence splitting • Detect and mark sentence boundaries • Annotate each sentence with the doc ID and the sentence number • E.g., XIE19980304.0061: 4 March 1998 from Xinhua News; XIE19980304.0061-14: the 14th sentence of this document • Graph construction • Construct TSG in this phase NIH Lister Hill Medical Center

  49. System Overview • Sentence Ranking • Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes • Sentence extraction • Extract the top-ranked sentences • Two different modified MMR re-rankers are used, depending on whether it is main or update task NIH Lister Hill Medical Center

  50. Talk Outline • Linkage using the Web • Graph-based Update Summarization • Introduction • Timestamped Graphs >> Evaluation and Conclusion NIH Lister Hill Medical Center