SlideSeer: A DL of aligned document and presentation pairs

SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group)National University of Singapore

Find articles to print, read offline Browse, select research work Assess authors, publication venues, research groups Papers (documents) don’t store all of the information about a discovery: Datasets Tools Implementation details / conditions They also don’t help a person learn the research: Textbooks Slide presentations Scholarly Digital Libraries: what do we use them for? We’ll focus on this Web IR / NLP Group @ NUS

Qualities of slide presentations Good slide sets complement a document. They often: • focus and highlight findings in the document • create a bridge into the document itself • are a visual and oral summary of a document How can we leverage slides in a digital library? What about poor slides? “ PowerPoint is presenter-oriented, not content-oriented or audience-oriented…” The remedy?: “Visual reasoning usually works more effectively when the relevant evidence is shown adjacent in space within the eyespan.” (Tufte, 2006) Four score and seven years ago Web IR / NLP Group @ NUS

Documents and presentations as duals Present identical or highly overlapping materials • Document: for archival and reference purposes • Presentation: for introducing and summarizing the work As the two can be seen as duals, we should allow them to be viewed together. • Would like random access of the presentation and document pair Answer: find pairs of documents and presentations. Web IR / NLP Group @ NUS

A model: MIT’s Open CourseWare A better answer: add fine-grained alignment. Audio of lecture Slides in context Simplified transcript of lecture Web IR / NLP Group @ NUS

Offline Online Search Engine Resource discovery sv dv pv ssv search DataStore WebServer Converters cz-ppt2txt cz-ppt2gif Javascript-enabled browser Aligner pdftohtml convert Talk Outline 1. Resource Discovery 3. User Interface Motivation Architecture 1. Resource Discovery 2. Alignment 3. User Interface Demo Status and Conclusions 2. Alignment Web IR / NLP Group @ NUS

1. Resource Discovery Algorithm: • Obtain suitable document metadata • Web search to find candidate presentations • Post process to useable form Web IR / NLP Group @ NUS

1. Resource Discovery – Obtaining Metadata Start with CiteSeer (thanks to IST: CL Giles, I Councill) • 750K records with parsed header metadata • Complete with .pdf documents Enhancement: Merge DBLP snapshot (Aug 2006; 1.2M docs) with CiteSeer • Large scale record linkage task, O(nm) complexity unacceptable • Indexed DBLP into Lucene, use each CS record to retrieve DBLP variants, resulting in O(n) complexity • Result size: 1.5M Web IR / NLP Group @ NUS

1. Resource Discovery – Finding presentations Google API on title, author to find corresponding presentation • Use simple Jaccard similarity threshold to decide matches • threshold λ3 for title+author similarity CiteSeer λ2 + DBLP merge λ3 Present-ations λ1 DBLPLucene Index Web filetype:ppt Web IR / NLP Group @ NUS

1. Resource Discovery – Conversion Final results: ~85% precision, recall difficult to calculate (~80%) 11K pairs after processing 200K of 1.5M records Many caveats: • only .pdf and .ppt formats currently handled • conversion fails often, pdf conversion difficult • current work: use OCR to redo text extraction • Via pdftohtml • text • formatted text • Via czppt2gif/convert • png • text Web IR / NLP Group @ NUS

2. Alignment – Problem formulation Q: What are we aligning? A: Text of slides to document text • Use paragraphs to delimit text units in documents • Use document headers to delimit sections Q: What type of alignment is necessary? A: Depends. Presentation or document centered view? • Presentation: 1 slide aligned to 0 to more paragraphs • Document: 1 section aligned to 0 to more slides Q: What’s the approach? A: Two stages: • Basic similarity measure to calculate a similarity matrix • Alignment schemes to establish alignment mapping Concentrate on this Text Units 1 p 1 Similarity Matrix Slides s Web IR / NLP Group @ NUS

2. Alignment – Related Work • Narration to presentation alignment • Usually naturally synchronous: Monotonic alignment • Multilingual text alignment • Used in Machine Translation (MT) • Polynomial complexity (~O(n3)) but heuristics tend to work well • Slide/abstract to document alignment • Use Hidden Markov Model (HMM) for alignment • Doesn’t handle missing materials well. Desiderata: • Should take context into account • But shouldn’t enforce monotonicity • Nil (zero) alignments needed, when materials don’t overlap Web IR / NLP Group @ NUS

2. Alignment – Similarity Measures Take text units, cut into tokens. Then calculate similarity using: • Cosine • Standard IR metric • TF×IDF for token weight • Calculate slide, paragraph vector similarity using cosine • Jaccard • unigram tokens • bigram • unigram + bigram • Use IDF weighting for tokens. For both schemes, use IDF weighting from WebBase corpus Web IR / NLP Group @ NUS

1. Max Similarity Baseline Can’t do nil alignment 2. Edit Distance Efficient dynamic programming But outputs only monotonic alignments 3. Local Jump Model Variation on #2 to allow local backward jumps Backward jumps within 5% of text units Still doesn’t handle reordered sections 4. Hidden Markov Model Word-based Attempts to find origin of s in p Only handles overlapping information p6 p5 p3 p1 p2 wj p4 p6 2. Alignment - Schemes Using matrix of <p,s> similarity, align using: si-5: … si-1: … si: wj-5 wj-1 wj+1 wj+5 si+1: … si+5: … p1>p2>p3>p4>p5>p6 Web IR / NLP Group @ NUS

2. Alignment – Span Extension As Maximum Similarity does quite well, let’s extend the algorithm Idea: post-process to extend from points to spans • Retrieve top n (n=10) most sim paragraphs • Try all (n) possible spans for alignment alignment_score (x,y) = span_sim × ln(span_length) 2 Slightly favor longer spans Web IR / NLP Group @ NUS

2. Alignment – Alignment Correction Neighboring alignments can help to correct a spurious one • (a) monotonic alignment → ok • (b) si jumps back from si-1, but then proceeds monotonically→ probably ok, minor penalty • (c) si jumps back, but si+1 jumps back forward → looks more like an error, major penalty applied Final alignment score: alignment_score × (1-penalty) p1 pn p1 pn p1 pn si-1 si-1 si-1 si si si si+1 si+1 si+1 (a) (b) (c) Web IR / NLP Group @ NUS

2. Alignment – Nil classifier But not all text units should be aligned Use machine learning (SVM) to learn a binary classifier Features • Similarity score • Number of words on slide Few words can indicate figures, pictures with less preference for alignment • Words on slide Cue phrases: “outline”, “questions”, “thanks” • Alignment path Jumping alignments (e.g., outline slides) Web IR / NLP Group @ NUS

2. Alignment – Evaluation Dataset • Manually compiled alignment dataset by author and fellow researcher • Gold standard: annotate all acceptable spans, or nil 20 presentation and document pairs from databases • Dataset is freely downloadable Web IR / NLP Group @ NUS

2. Alignment – Evaluation 40%? Why is it so difficult? • Noise in conversion process. Other studies have used clean data. • Other have used soft accuracy (any overlap is correct) Use Weighted Jaccard accuracy as metric • Fractional accuracy for partially correct answers • Give false positives (extra spurious alignments) less weight Weighted Jaccard Accuracy Web IR / NLP Group @ NUS

Coordinated Views Learning / Comprehension Summarization Offline Viewing Collection Interface Comparing pairs Searching for suitable materials 3. User Interface – Rationale How might fine-grained aligned pairs be utilized in a large DL? Web IR / NLP Group @ NUS

3. UI – Coordinated Views Gallery View SlideshowView Full Document View Document View Slide View Print View Slide centric Document centric Web IR / NLP Group @ NUS

SlideSeer Prototype Demo Production environment differs from demo

3. UI – Collection Interface • Searching • Lucene indexing of the static print view • Show title along with the set of results • Spider-friendly • Main content loaded dynamically by Javascript, not spiderable • Currently use print view (as it is static) for spiderable interface • URLs • Most material in the form <subject/surname/year/title/view/type?offset> • Implies hierarchy of papers • Constructed URLs to promote browsing access • Simple keyboard shortcuts • For expert user navigation Web IR / NLP Group @ NUS

Conclusion • Alignment of documents to presentations • Simple approach works well thus far • Tweaks to get more mileage out of simple approach • Span alignment, nil alignment modifications • But certainly more models to try! • 40% best performance, certainly much room to improve Deployment status • In Alpha (development) • Beta hopefully in mid 2008 • Usability testing underway • Interested in digital anthologies? • Join our mailing list (web: dAnth) • Current: text extraction project for ACL Anthology Web IR / NLP Group @ NUS

Other slides

Future Work • Planning to hook up current work in progress • 2 stage CRF/SVM re-ranking citation segmentation algorithm • Automatic keyphrase extraction program • Automatic synthetic image classification • Automatic de-duplication module • Partnering with Simone Teufel (Cambridge U.) to do argumentative zoning of documents • What is a citation used for? Web IR / NLP Group @ NUS

Poor slides • Often represent a biased view of the full results • Cherry picking evidence to support claims • Imply that evidence is independent (when it is statistically correlated) • May summarize other findings inaccurately (secondary or tertiary sources Web IR / NLP Group @ NUS

SlideSeer: A DL of aligned document and presentation pairs