1 / 18

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu. Janyce Wiebe University of Pittsburg wiebe@cs.pitt.edu. Subjectivity analysis.

livi
Download Presentation

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu Janyce Wiebe University of Pittsburg wiebe@cs.pitt.edu

  2. Subjectivity analysis • Subjectivity analysis (opinions and sentiments) • Used in a wide variety of applications • Tracking sentiment timelines in news (Lloyd et. al, 2005) • Review classification (Turney, 2002; Pang et. al, 2002) • Mining opinions from product reviews (Hu and Liu, 2004) • Expressive text-to-speech synthesis (Alm et. al, 2005) • Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli and Sebastiani, 2006) • Question answering (Yu and Hatzivassiloglou, 2003) • Much work on subjectivity analysis has focused on English • Japanese (Takumura et. al, 2006), Chinese (Hu et. al, 2005), German (Kim and Hovy, 2006)

  3. Proportion of Languages on the Web internetworldstats.com ~ updated November 30, 2007

  4. Objective • Develop a method for subjectivity analysis that • Requires few electronic resources • Can be easily ported to a new language • Applicable to the large number of languages that have scarce electronic resources

  5. Related Work • Tools that rely on manually or semi-automatically constructed lexicons • Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006 • Enable the efficient rule-based subjectivity and sentiment classifiers that rely on the presence of lexicon entries in text • These tools assume the availability of • advanced language processing tools: • Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003) • broad-coverage rich lexical resources • WordNet (Essuli and Sebastiani, 2006) • Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarity • We address the task of acquiring a subjectivity lexicon • We rely on fewer, smaller-scale resources

  6. Our Method • Based on bootstrapping • Requires: • A small seed set of subjective entries • One/multiple electronic dictionaries • A small training corpus (approx. 500,000 words) • Experiments focused on Romanian • Applicable to other languages as well

  7. Candidate synonyms query seeds Online dictionary Max. no. of iterations? no yes Selected synonyms Candidate synonyms Variable filtering Bootstrapping Process Fixed filtering

  8. Seed Set 60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs. Manually selected Seed sources: XI-th grade curriculum for Romanian Language and Literature Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)

  9. Expansion Definition Seed Candidate synonyms All open-class words, that have a definition in the dictionary longer than 3 letters Diacritics are removed Romanian dictionary: http://www.dexonline.ro Dictionaries for other languages are also available, or can be obtained from paper dictionaries through OCR

  10. Filtering • Candidates are filtered based on a measure of similarity with the original seeds • We use Latent Semantic Analysis (LSA)(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993) • After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion • Example: • Seed: dulce (sweet) • Candidate synonyms: cu gust dulce (sweet-tasting). placut (pleasant), dulceag (quasi-sweet)

  11. Filtering • Several iterations of the bootstrapping process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds • A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon • Filtering parameters: • Similarity threshold • Number of iterations

  12. Lexicon Acquisition

  13. Evaluation • Rule-based classifier of subjectivity • (Riloff and Wiebe, 2003) • Subjective sentence: three or more subjective entries. • Objective sentence: two subjective entries or less. • Gold standard data set • (Mihalcea, Banea and Wiebe, 2007) • 504 sentences from five SemCor documents (manually translated in Romanian) • Labeled by two annotators • Agreement (all): 83% (=0.67) • Agreement (uncertain removed): 89% (=0.77) • Baseline: 54% (all subjective)

  14. Number of Iterations F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5

  15. Similarity Threshold F-measure for the fifth bootstrapping iteration for varying LSA scores

  16. Comparison • Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5

  17. Conclusions • Our bootstrapping method uses few electronic resources: • A small seed set • One/multiple dictionaries • A small corpus of half a million words • A large subjectivity lexicon of approx. 4000 entries was extracted • Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved

  18. Questions?

More Related