Turning down the noise in the blogosphere
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Turning Down the Noise in the Blogosphere PowerPoint PPT Presentation


  • 54 Views
  • Uploaded on
  • Presentation posted in: General

Turning Down the Noise in the Blogosphere. Khalid El- Arini , Gaurav Veda, Dafna Shahaf , Carlos Guestrin. Millions of blog posts published every day Some stories become disproportionately popular Hard to find information you care about. Our Goal: Coverage.

Download Presentation

Turning Down the Noise in the Blogosphere

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Turning down the noise in the blogosphere

Turning Down the Noise in the Blogosphere

Khalid El-Arini, Gaurav Veda, DafnaShahaf, Carlos Guestrin


Turning down the noise in the blogosphere 1357680

  • Millions of blog posts published every day

  • Some stories become disproportionately popular

    • Hard to find information you care about


Our goal coverage

Our Goal: Coverage

  • Turn down the noise in the blogosphere

    • Select a small set of posts that covers the most important stories

January 17, 2009


Our goal coverage1

Our Goal: Coverage

  • Turn down the noise in the blogosphere

    • Select a small set of posts that covers the most important stories


Our goal personalization

Our Goal: Personalization

  • Tailor post selection to user tastes

Posts selected without personalization

But, I like sports! I want articles like:

After personalization based on Zidane’s feedback


Main contributions

Main Contributions

  • Formalize notion of covering the blogosphere

  • Near-optimal solution for post selection

  • Learn a personalized coverage function

    • No-regret algorithm for learning user preferences using limited feedback

  • Evaluate on real blog data

    • Conduct user studies and compare against:


Approach overview

Approach Overview

Blogosphere

Post Selection

Coverage Function

Feature Extraction


Document features

Document Features

  • Low level

    • Words, noun phrases, named entities

      • e.g., Obama, China, peanut butter

  • High level

    • e.g., Topics from a topic model

    • Topic = probability distribution over words

Inauguration Topic

National Security Topic


Coverage

Coverage

  • cover ( ) = amount by which covers

  • cover ( ) = amount by which { , } covers

Features

Posts

Document d

Feature f

coverd(f)

Feature f

Set A

coverA(f)


Simple coverage max cover

Simple Coverage: MAX-COVER

Find k posts that cover the most features

  • cover ( ) = 1 if at least or contain

  • Problems with MAX-COVER:

Feature Significance in Document

Feature Significance in Corpus

… at George Mason University

in Fairfax, Va.


Feature significance in document

Feature Significance in Document

  • Solution: Define a probabilistic coverage function

    • coverd(f) = P(feature f | post d)

Feature Significance in Document

Feature Significance in Corpus

Not really about Washington

cover (Washington) = 0.01

Feature Significance in Document

Feature Significance in Corpus

e.g., with topics

as features

≡ P(post d is about topic f)


Feature significance in corpus

Feature Significance in Corpus

  • Some features are more important

    • Want to cover the important features

  • Solution:

    • Associate a weight wf with each feature f

      • e.g., frequency of feature in corpus

    • Cover an important feature using multiple posts

Feature Significance in Document

Feature Significance in Corpus

Carlos Guestrin

Barack Obama


Incremental coverage

Incremental Coverage

cover ( )= 1 – P(neither nor cover )

= 1 – (1 – 0.5) (1 – 0.4)

= 0.7

probability at least one post in set A covers feature f

cover( )

0.5

  • Obama: Tight noose on Bin Laden as good as capture

  • What Obama’s win means for China

0.4

cover ( ) < 0.7 < cover ( )+cover ( )

Gain due to covering using multiple posts

Diminishing returns


Post selection optimization

Post Selection Optimization

  • Want to select a set of posts A that maximizes

  • This function is submodular

    • Exact maximization is NP-hard

    • Greedy algorithm leads to a (1 – 1/e) ~ 63% approximation, i.e., a near-optimal solution

    • We use CELF (Leskovec et al 2007)

weights on

features

probability that set A covers feature f

feature set


Approach overview1

Approach Overview

Blogosphere

Post Selection

Coverage Function

Feature Extraction

Submodular function optimization


Evaluating coverage

Evaluating Coverage

  • Evaluate on real blog data from Spinn3r

    • 2 week period in January

    • ~200K posts per day (after pre-processing)

  • Two variants of our algorithm

  • User study involving 27 subjects to evaluate:

TDN+LDA: High level features

Latent Dirichlet Allocation topics

TDN+NE: Low level features

Topicality & Redundancy


Topicality user study

Topicality User Study

Downed jet lifted from ice-laden Hudson River

NEW YORK (AP) - The airliner that was piloted to a safe emergency landing in the Hudson…

Is this post topical? i.e.,

is it related to any of the

major stories of the day?

Reference Stories

Post for evaluation


Results topicality

Results: Topicality

Higher is better

TDN+NE

TDN +LDA

Named entities and common noun phrases as features

LDA topics as features

We do as well as Google & Yahoo!


Evaluation redundancy

Evaluation: Redundancy

  • Israel unilaterally halts fire as rockets persist

  • Downed jet lifted from ice-laden Hudson River

  • Israeli-trained Gaza doctor loses three daughters and niece to IDF tank shell

  • ...

Is this post redundant

with respect to any of

the previous posts?


Results redundancy

Results: Redundancy

Lower is better

TDN +LDA

TDN+NE

Google performs poorly

We do as well as Yahoo!


Results coverage

Results: Coverage

  • Google: good topicality, high redundancy

  • Yahoo!: performs well on both, but uses rich features

    • CTR, search trends, user voting, etc.

Lower is better

Higher is better

TDN +LDA

TDN +NE

Topicality

Redundancy

TDN +LDA

TDN +NE

We do as well as Yahoo! Using

only text based features


Results january 22 2009

Results: January 22, 2009


Personalization

Personalization

  • People have varied interests

  • Our Goal: Learn a personalized coverage function using limited user feedback

Barack Obama

Britney Spears


Approach overview2

Approach Overview

Blogosphere

Post Selection

Pers. Post Selection

Coverage Function

Feature Extraction

Personalized coverage Fn.

Personalization


Modeling user preferences

Modeling User Preferences

  • ¼f represents user preference for feature f

  • Want to learn preference ¼ over the features

Importance of feature in corpus

User preference

¼1

¼2

¼3

¼4

¼5

¼1

¼2

¼3

¼4

¼5

¼ for a politico

¼ for a sports fan


Learning user preferences

Learning User Preferences

Multiplicative Weights Update

Multiplicative Weights Update

Before any feedback

After 1 day of personalization

After 2 days of personalization


No regret learning

No-Regret Learning

Theorem:For TDN,

avg( ) – avg( ) 0

learned ¼

learned

using TDN

optimal fixed

  • i.e., we achieve no-regret

Given the user ratings in advance,

compare with the optimal fixed ¼

optimal fixed ¼

(in hindsight)


Approach overview3

Approach Overview

Blogosphere

Submodular function optimization

Pers. Post Selection

Personalized coverage fn.

Feature Extraction

User feedback

Personalization

Online learning


Simulating a sports fan

Simulating a Sports Fan

  • likes all posts from Fan House (a sports blog)

Personalized Objective

Personalization Ratio

Unpersonalized Objective

Dead Spin (Sports Blog)

Personalization ratio

Fan House (Sports Blog)

Unpersonalized

Huffington Post (Politics Blog)

Days of sports personalization


Personalizing for india

Personalizing for India

  • Like all posts about India

  • Dislike everything else

  • After 5 epochs:

  • 1. India keeps up pressure on Pakistan over Mumbai

  • After 10 epochs:

  • 1. Pakistan’s shift alarms the U.S.

  • 3. India among 20 most dangerous places in world

  • After 15 epochs:

  • 1. 26/11 effect: Pak delegation gets cold vibes

  • 3. Pakistan flaunts its all weather ties with China

  • 4. Benjamin Button gets 13 Oscar nominations [mentions Slumdog Millionaire]

  • 8. Miliband was not off-message, he toed the UK line on Kashmir


Personalization user study

Personalization User Study

  • Generate personalized posts

    • Obtain user ratings

  • Generate posts without using feedback

    • Obtain user ratings

Blogosphere

Blogosphere


Personalization evaluation

Personalization Evaluation

Personalized

Higher is better

Unpersonalized

Users like personalized posts more than unpersonalized posts


Summary

Summary

  • Formalized coveringthe blogosphere

  • Near-optimal optimization algorithm

  • Learned a personalized coveragefunction

    • No-regret learning algorithm

  • Evaluated on real blog data

    • Coverage: using only post content, we perform as well as other techniques that use richer features

    • Successfully tailor post selection to user preferences

www.TurnDownTheNoise.com


  • Login