slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SEARCHING THE BLOGOSPHERE PowerPoint Presentation
Download Presentation
SEARCHING THE BLOGOSPHERE

Loading in 2 Seconds...

play fullscreen
1 / 41

SEARCHING THE BLOGOSPHERE - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

SEARCHING THE BLOGOSPHERE. Nilesh Bansal. Nick Koudas University of Toronto. BLOGOSPHERE. 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS. WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT. WHY SHOULD WE CARE?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SEARCHING THE BLOGOSPHERE' - betty


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

SEARCHING THE BLOGOSPHERE

Nilesh Bansal

Nick Koudas

University of Toronto

slide4

67M KNOWN BLOGS

100K NEW EVERYDAY

DOUBLING EVERY 200 DAYS

slide5

WHAT ARE THEY WRITING ABOUT??

PERSONAL LIFE

PRODUCT REVIEWS

POLITICS

TECHNOLOGY

TOURISM

SPORTS

ENTERTAINMENT

slide7

HUGE DATA REPOSITORY

WILL CONTINUE TO GROW

EXTRACT PUBLIC OPINION

VALUABLE INSIGHTS

slide8

KEY INSIGHTS

MARKET RESEARCH

PUBLIC RELATION STRATEGIES

CUSTOMER OPINION TRACKING

slide12

MACHINE CREATED WEBLOGS

MORE THAN HALF OF BLOGSPOT IS SPAM

33% OF WEBSPAM HOSTED AT BLOGSPOT

slide16

Gruhl et al., The Predictive Power of Online Chatter, KKD 2005

Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003

Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006

Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006

Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

slide19

CRAWLER RUNNING 24x7

TRACKING 9M BLOGS

INDEXING 70M ARTICLES

AGGREGATION AND PREPROCESSING

INTERACTIVE SEARCH AND ANALYSIS

slide20

ANY STREAMING TEXT SOURCE

NEWS

MAILING LISTS

FORUMS

SOCIAL MEDIA

slide22

Geo

Search

Related

Terms

Search

Results

Popularity

Curve

slide23

Taiwan

Undersea

Earthquake

Sumatra Earthquake

Hawaii Earthquake

slide24

December 15 2006

March 06 2007

slide28

CRAWLS RSS FEEDS

250 THOUSAND NEW POSTS DAILY

PING SERVER: WEBLOGS.COM

slide29

LINK BASED ANALYSIS IS NOT EFFECTIVE

SPAMMERS ARE INTELLIGENT

WE USE HEURISTICS

ON GOING BATTLE

[Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007

[Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004

[Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

slide30

INTERACTIVE APPLICATION

TWO SECOND RESPONSE TIME

HUGE AMOUNTS OF DATA

SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY

SCALABILITY

slide32

BURST DETECTION

[Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007

[Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

slide33

POPULARITY = BASE + ZERO MEAN GAUSSIAN

BURST = STATISTICAL OUTLIER

slide35

COLLOCATIONS

POINTWISE MUTUAL INFORMATION

EXPENSIVE

[Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis

[Manning and Schutze] Foundation of Natural Statistical Language Processing

[Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

slide36

FAST COMPUTATION OF RELATED TERMS

RANDOM SAMPLE

MUTUAL INFORMATION IN EXPECTATION

USE TF WITH PRECOMPUTED IDF

slide38

POPULAR DOES NOT MEAN HOT

INTERESTING = SURPRISING

MIXTURE OF DIFFERENT SCORING FUNCTIONS

DEVIATION FROM EXPECTED

slide39

INTELLIGENT ALERT SERVICE

BURST SYNOPSIS

AUTHORATIVE RANKING

slide40

JUST THE BEGINNING

Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007.

Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

slide41

THANK YOU. QUESTIONS?

Source: xkcd.com