1 / 34

Tell us the first step you would do to comprehend the below passage?

Tell us the first step you would do to comprehend the below passage?.

addison
Download Presentation

Tell us the first step you would do to comprehend the below passage?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tell us the first step you would do to comprehend the below passage? Slumdog Millionaire, the latest megahit flip, talks about rags to riches story a slum dweller. The movie, an adaptation of novel, is based on popular Indian version of American soap contest – who wants to be millionaire –which was well accepted by the Masses. Freida Pinto is heroin of the movie. She hails from Mumbai. Even though it was her debut movie, because of her exemplary performance she has received offers for many Hollywood movie. Slumdog received numerous accolades from all over the world. Apart from the Oscar, some notables where – Toronto International festival, Cannes etc. Dept of CSE -IIT Bombay

  2. DiscourseSegmentation CS 626 Course Seminar Dept of CSE,IIT B Group-1: Sriraj (08305034) Dipak(08305901) Balamurali(08405401)

  3. The way we go…. Introduction Motivation TextTiling Context Vector and Segmentation Lexical Chains and Segmentation Segmentation with LSA Conclusion Reference

  4. INTRODUCTION Discourse comes from Latin word 'discursus' Discourse: A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke, or narrative -(Crystal 1992) Discourse: Novels, as well as short conversations or groans(cries) Dept of CSE -IIT Bombay

  5. Beaugrande definition of discourse • Cohesion - grammatical relationship between parts of a sentence essential for its interpretation; • Coherence - the order of statements relates one another by sense. • Intentionality - the message has to be conveyed deliberately and consciously; • Acceptability - indicates that the communicative product needs to be satisfactory in that the audience approves it; • Informativeness - some new information has to be included in the discourse; • Situationality - circumstances in which the remark is made are important; • Intertextuality - reference to the world outside the text or the interpreters' schemata; Dept of CSE -IIT Bombay

  6. DISCOURSESTRUCTURE - SALIENT FEATURES • Existence of a Hierarchy. • Segmentation at semantic level. • Domain-specific knowledge Dept of CSE -IIT Bombay

  7. DISCOURSESEGMENTATION "Partition of full length text into coherent  multi-paragraph units " - Marti Hearst Dept of CSE -IIT Bombay

  8. MOTIVATION • Text Summarization • Question and Answering • Sentiment Analysis • Topic Detection Dept of CSE -IIT Bombay

  9. TEXTILING Use of  TF-IDF concept within a document Analogy      IR     : Document->Entire Corpus     NLP  : Block-> Entire Document A term used more inside a block weighs more. Adjuscent Blocks contain more related terms -              - An evidence of strong cohesion Dept of CSE -IIT Bombay

  10. CONTD... Algorithm - • Divide Text into blocks(say k sentence long). • Compute cosine similarity with adjacent blocks. cos(b1,b2) = • Smoothed Interpolated similarity v/s sentence gap number is plotted. • Lowermost portion of valleys - Boundaries Dept of CSE -IIT Bombay

  11. CONTD... Source :[1] Are we satisfied? Dept of CSE -IIT Bombay

  12. TEXTTILING - WHAT WENT WRONG?? • Same word need not be repeated - But similar word could be • WSD was not performed - Polysemy  issues. • The contextual information not considered. Dept of CSE -IIT Bombay

  13. CONTEXT VECTORS & SEGMENTATION Capture contextual information in different blocks. Steps: • Encoding of contextual information. - context vector creation • Creation of Block Vectors • Measurement of similarity. -instead of TF-IDF, use context vector • cos(v,w) = Dept of CSE -IIT Bombay

  14. DID IT DO THE TRICK? Yes! • Precision increased 32 to 52% • Recall increased 40 to 51% Lets try to improvise a bit more! Dept of CSE -IIT Bombay

  15. LEXICAL CHAINS • A lexical cohesion computing Technique. • A sequence of related words in the text. • Independent of the grammatical structure. • Provides a context for disambiguation. • Enable identification of the concept. Dept of CSE -IIT Bombay

  16. Different forms of Lexical Cohesion • Repetition • Repetition through synonymy • Police, officers • Word Association through • Specialization/Generalization • murder weapon, knife • part-whole/whole-part relationship • Committee, members • Statistical association between words • Osama Bin laden and Word Trade center

  17. How • Uses an auxiliary resource to cluster words into sets of related concepts (wordnet)‏ • Areas of low cohesive strength are good indicators of topic boundaries • Process • Tokeniser • Lexical chainer • Boundary detector

  18. Process • Tokenizer • POS tagging is done • Morphological analysis is done • Lexical Chainer • To find relation between tokens • Single pass clustering • First token is start of first chain • Tokens added to most recently updated chain that it shares the strongest relationship

  19. Process Contd... • Boundary Detection • A high concentration of chains begin and end between two adjacent textual units • Boundary Strength w(n,n+1) = E * S • E = number of lexical chains whose span ends at sentence n • S = number of chains that begin their span at sentence n+1 • Take the mean of all non zero scores • This mean acts as minimum allowable boundary strength

  20. And the Improvement is … • Evaluation Metrics • Precision • Recall

  21. Latent Semantic Analysis ( LSA )

  22. Problems with Frequency Vector Based Similarity Short Passages • Similarity estimate is inaccurate for short passages • An additional occurrence of a common word (reflected in numerator) causes a disproportionate increase in sim(x,y) unless the denominator is large

  23. Problems with Frequency Vector Based similarity..cont’d(2) Term Matching Problem • Car; Automobile • Car; Petrol • Similar/related but distinct words are considered negative evidence • Solutions • Stemming • Thesaurus, Wordnet based similarity measures • Latent Semantic Analysis

  24. Introduction to LSA • LSA stems from work in IR • Represents word and passage meaning as high-dimensional vectors in the semantic space • Does not use humanly constructed dictionaries, knowledge bases, semantic networks, etc. • Meaning of word : Average of the meaning of all passages in which it appears • Meaning of passage: Average of the meaning of all the words it contains

  25. Training LSA • Input: set of texts • Vocabulary

  26. Training LSA …cont’d(2) • The values are scaled according to general form of inverse document frequency • Dimensionality reduction using SVD

  27. Training LSA …cont’d(3) • is k – dimensional LSA space for • LSA feature vector for word wi • Benefitsof applying SVD • is concise representation. Storage and complexity of the similarity matrix is reduced • Captures major structural associations between • words and documents • Noise is removed simply by omitting the less salient dimensions in U

  28. Applying LSA • A sentence siis represented by its term frequency vector fi where fij is the frequency of term j in si • Meaning of si

  29. Significance of k • Finding optimal dimensionality: Important step in LSA • Hypothetically, the optimal space for the reconstruction has the same dimensionality as the source that generates discourse. • Source generate passages by choosing words from a k-dimensional space in such a way that words in the same paragraph tend to be selected from nearby locations.

  30. LSA results • LSA is twice as accurate as the word similarity based co-occurrence vector. (Error reduced from 22% to 11 %) • LSA values become less accurate as more dimensions are incorporated into the feature vectors

  31. Conclusion • Text tiling, context vector based similarity, lexical chaining and LSA all are bag of word approaches. • Bag-of-word approaches are sufficient .. to some extent. “LSA makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it manages to extract reflections of passage and word meanings quite well without these aids, but it must still be suspected of resulting incompleteness or likely error on some occasions” Excerpt from [5]. 1

  32. Contd.. • LSA is purely statistical whereas other approaches use some form of external knowledge bases in addition to statistical techniques. • Role of external Knowledge. • To move to next level we need some linguistics. • We need right mix of statistical  and linguistics approaches to move forward. Dept of CSE -IIT Bombay

  33. Reference [1]. Hearst, M. A. 1993 Texttiling: a Quantitative Approach to Discourse segmentation. Technical Report. UMI Order Number: S2K-93-24., University of California at Berkeley. [2]. Kaufmann, S. 1999. Cohesion and collocation: using context vectors in text segmentation. In Proceedings of the 37th Annual Meeting of the Association For Computational Linguistics on Computational Linguistics , Pages 99-107 [3]. Landauer, T. K., Foltz, P. W., & Laham, D. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages 259-284. [4]. Barzilay, Regina and Michael Elhadad. 1997.Using lexical chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS-97), Madrid, Spain [4]. Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. InProceedings of NAACL , pages 26-33 [5]. Freddy Y. Y. Choi, Peter Wiemer-hastings, Johanna Moore. 2001. Latent semantic analysis for text segmentation. InProceedings of EMNLP, pages 109-117 [6]. Stokes, N., Carthy, J., Smeaton, A.F. 2002. Segmenting Broadcast News Streams Using Lexical Chains. in Proceedings of 1st Starting AI Researchers Symposium (STAIRS 2002), volume 1, pp.145-154. Dept of CSE -IIT Bombay

  34. Contd.. [7]. http://www.freewebs.com/hsalhi/Discourse%20Analysis%20Handout.doc [8]. http://ilze.org/semio/005.htm [9]. http://www.dfki.de/etai/SpecialIssues/Dia99/denecke/ETAI-00/node11.html [10]. http://www.csi.ucd.ie/staff/jcarthy/home/Lex.html Dept of CSE -IIT Bombay

More Related