DIVERSUM: Towards Diversified Summarization of Entities in Knowledge Graphs

Marcin Sydow, Mariusz Pikula, Ralf Schenkel DIVERSUM:Towards Diversified Summarization of Entities in Knowledge Graphs

Large Knowledge Graphs • Amount of information available in knowledge representations (e.g., RDF) increasing rapidly What are the 10 most important facts about Tom Cruise? DESWeb Workshop, Long Beach, CA

Frequency-based summarization Prefer facts frequently mentioned in orextracted from reference corpus (IMDB, Wikipedia) bad summary for yellow press great summary for movie critics DESWeb Workshop, Long Beach, CA

Diversity-Aware Summarization • Cover many different facets of the entity • Allow for subsequent refinement Topic of this talk: Given an entity, compute diversity-aware summary with limited size DESWeb Workshop, Long Beach, CA

Background: Diversification in Text IR descending score with diversification without diversification Relevance of document d for info need u(both modeled as set of info nuggets): Goal: represent all facets of a topic in search results Model: represent facets through info nuggets N jaguar [from Clarke et al, SIGIR 2008] DESWeb Workshop, Long Beach, CA

From Text to Knowledge Graphs Weight of arc (based on relative frequency among similar facts, Informativeness, extraction confidence, freshness, …) Relative frequency of label ni in knowledge graph (global prior) Relative frequency of label ni around query entity (local prior) Mapping of concepts from text to graphs: • „query“: entity in the graph • „document“: arc in the graph • „info nugget“: label of an arc • „result“: connected subgraph of the knowledge graphwith up to k arcs Remaining problem: Estimation of and Tom_Cruise isMarriedTo Mimi_Rogers Tom_Cruise isMarriedTo DESWeb Workshop, Long Beach, CA

Formalizing the Optimization Problem Coverage of nugget n in subgraph S Probability that user wants to seenugget n for entity q Given a query entity q, find a connected subgraph S containing q, with |S|≤ k, that maximizes where V(a|q,n) is the value of arc a for nugget n and query node q (measured by its weight) [similar to Agrawal et al, WSDM 2009] DESWeb Workshop, Long Beach, CA

Towards an Algorithm • Problem difficult to solve in general (NP-hard) • Approximation algorithms with tight bounds exist (and even exact when 1 nugget/doc) • General setup: Pick greedily next arc that gives highest improvement • Additional simplifications: • Handle arcs in order of distance to query entity • Consider each label at most once per distance DESWeb Workshop, Long Beach, CA

Greedy Algorithm (don‘t read it, we‘ll have an example.) DESWeb Workshop, Long Beach, CA

Greedy Algorithm by Example (k=3) A2 X T Q D B Step 1: select nodes reachable from S at distance 1 from q C1 C2 Y:0.5 Step 2: among those, select most frequent arc label Y:0.8 Y:0.6 A2 Step 3: among those, select arc with max weight and add it to S X:0.8 A3 A1 Repeat as long as unhandled arc labels at distance 1 X:0.6 X:0.3 q Step 4: select nodes reachable from S at distance 2 from q T:0.6 Q:0.9 Q:0.6 Step 5: among those, select most frequent arc label B E D Step 6: among those, select arc with max weight and add it to S Done. Preliminary Result S: q DESWeb Workshop, Long Beach, CA

Conclusion and Future Work Main contribution:Diversity-aware summarization with limited size Future work: • Extensive user study to evaluate different variants • Comparison with other summarization methods • Integration into a GUI for knowledge exploration • Relaxation of independence assumptions between facts (prefer anticorrelated facts) DESWeb Workshop, Long Beach, CA

DIVERSUM: Towards Diversified Summarization of Entities in Knowledge Graphs