An effective statistical approach to blog post opinion retrieval
1 / 22

An Effective Statistical Approach to Blog Post Opinion Retrieval - PowerPoint PPT Presentation

  • Uploaded on

An Effective Statistical Approach to Blog Post Opinion Retrieval. Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008). Introduction. Blogs have recently emerged as a new grassroots publishing medium.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'An Effective Statistical Approach to Blog Post Opinion Retrieval' - eryk

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An effective statistical approach to blog post opinion retrieval

An Effective Statistical Approach to Blog Post Opinion Retrieval

Ben He, Craig Macdonald, Jiyin He, Iadh Ounis

(CIKM 2008)


  • Blogs have recently emerged as a new grassroots publishing medium.

  • A key feature that distinguishes blog content from other Web content is their subjective nature.

  • Bloggers tend to express opinions and comments towards some given targets, such as persons, organizations or products.


  • Under the TREC opinion finding task, only a handful of groups achieved an improvement over their baseline, using techniques such as NLP or SVM classifiers.

  • These proposed approaches either involve considerable manual efforts in collecting evidence for opinions, or lead to little improvement over a baseline that does not include any opinion finding feature.


  • This paper proposes a statistical and light-weight automatic dictionary-based approach.

  • Also shows that despite its apparent simplicity, it provides statistically significant improvements over robust baselines, including the best TREC baseline run, without any manual effort.

The statistical dictionary based approach to opinion retrieval
The Statistical Dictionary-basedApproach to Opinion Retrieval

  • Automatically generates a dictionary from the collection without requiring manual effort.

  • Assigns a weight to each term in the dictionary, which represents how opinionated the term is.

  • Assigns an opinion score to each document in the collection using the top weighted terms from the dictionary as a query.

  • Appropriately combines the opinion score with the initial relevance score produced by the retrieval baseline.

Dictionary generation
Dictionary Generation

  • To derive the dictionary, we filter out too frequent or too rare terms in the collection.

  • We remove those terms because if a term appears too many or too few times in the collection, then it probably contains too little or too specific information so that it can not be generalized to different queries in indicating opinion.

Dictionary generation1
Dictionary Generation

  • We firstly rank all terms in the collection by their within-collection frequencies in descending order.

  • The terms, whose rankings are in the range (s·#terms, u·#terms), are selected in the dictionary.

  • We apply s = 0.00007 and u = 0.001.

Term weighting
Term Weighting

  • D(Rel): relevant document set.

  • D(opRel): opinionated relevant document set.

  • For each term t in the opinion term dictionary, we measure wopn(t), the divergence of the term’s distribution in D(opRel) from that in D(Rel).

  • This divergence value measures how a term stands out from the opinionated documents, compared with all relevant documents.

  • The higher the divergence is, the more opinionated the term is.

Term weighting1
Term Weighting

  • A commonly used measure for term weighting is the KL divergence from a term’s distribution in a document set to its distribution in the whole collection.

Term weighting2
Term Weighting

  • KL divergence measure considers only the divergence from one distribution to the other, while ignoring how frequent a term occurs in the opinionated documents.

  • The weights of the terms in the opinion dictionary might be biased towards the terms with high KL divergence values, but containing low information in the opinionated document set D(opRel).

Term weighting3
Term Weighting

  • Another method: Bo1 term weighting model, which measures how informative a term is in the set D(opRel) against D(Rel).

    λ= tfrel/Nrel

Generating the opinion score
Generating the Opinion Score

  • We take the X (in the experiment, set X=100) top weighted terms from the opinion dictionary, and submit them to the retrieval system as a query Qopn.

  • The retrieval system assigns a relevance score to each document in the collection.

  • Such a relevance score reflects the extent to which the top weighted opinionated terms are informative in the document, capturing the overall opinionated nature of the document.

  • This is called the opinion score: Score(d, Qopn).

Score combination
Score Combination

  • Linear combination:

  • Log. combination:

Experiment data
Experiment: Data

  • Dataset: Blog06 collection.

  • Use permalinks, which are the blog posts and their associated comments.

  • Each term is stemmed using Porter’s English stemmer, and standard English stopwords are removed.

Experiment baseline
Experiment: Baseline

  • InLB document weighting model:


Experiment external opinion dictionary
Experiment: External Opinion Dictionary

  • We also manually generate a dictionary compiled from various external linguistic resources.

  • The dictionary contains approximately 12,000 English words, mostly adjectives, adverbs and nouns, which are supposed to be subjective.

  • In this paper, we denote the manually edited dictionary by the external dictionary, and we denote the automatically derived one by the internal dictionary.

Experiment external opinion dictionary1
Experiment: External Opinion Dictionary

Experiment evaluation1
Experiment: Evaluation

Use Bo1 term weighting method. Set a=0.25, k=250.

Conclusions and future work
Conclusions and Future Work

  • This paper has proposed an effective and practical approach to retrieving opinionated blog posts without the need for manual effort.

  • The use of the automatically generated internal dictionary provides a retrieval performance that is as good as the use of an external dictionary manually compiled from various linguistic resources.

Conclusions and future work1
Conclusions and Future Work

In the future:

  • Extend the work to detecting the polarity or the orientation of the retrieved opinionated documents.

  • Study the connection of the opinion finding task to question answering.

    • Ex. Extracting the opinionated sentences within a blog post about a given target.