slide1
Download
Skip this Video
Download Presentation
DBSocial 2013, New York

Loading in 2 Seconds...

play fullscreen
1 / 22

DBSocial 2013, New York - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' DBSocial 2013, New York' - grazia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions

DBSocial 2013, New York

motivation enblogue 1
MotivationenBlogue (1)
  • enBlogue: Identifies emergent topics
  • Input: A stream of documents annotated with hash-tags (e.g. Tweets)
  • Restricts the focus to the more recent documents using a time sliding window
motivation enblogue 2
MotivationenBlogue (2)
  • Tracks the correlation of co-occurring hash-tags over time
  • Reports on unexpected changes in the correlation

correlation

time

jaccard coefficient
Jaccard Coefficient
  • T : A set containing the document ids annotated with tag t
  • Pair of tags :
  • Set of n tags :
jaccard coefficient computation
Jaccard Coefficient Computation
  • Maintain counters for all subsets of co-occurring tags
inclusion exclusion principle
Inclusion – Exclusion Principle
  • Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:
inclusion exclusion principle advantages
Inclusion – Exclusion PrincipleAdvantages
  • Needs to maintain less counters
  • Adapts more easily to changes in the load
problem
Problem
  • For each subset of co-occurring tags
    • Number of documents annotated each tag
    • Number of documents annotated with all tags
  • A big number of co-occurring tag sets
  • New documents arrive fast changing the numbers

Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

outline
Outline
  • Motivation
    • enBlogue
    • Jaccard Coefficient
    • Inclusion – Exclusion Principle
    • Problem
  • Idea
    • Architecture
    • Partition Tags
    • Updating Counters
  • Results
    • Theoretical Results
    • Experimental Results
  • Conclusion
architecture
Architecture

Nodes computing the partitions

Nodes computing the Jaccard coefficients

partition tags requisites
Partition TagsRequisites
  • Treat tag-sets as inseparable units
  • Minimise the overlap of single tags tracked by different nodes
partition tags algorithm
Partition TagsAlgorithm
  • Phase 1: Create an initial assignment of the tags to the nodes
  • Max-k cover : Selects k out of n sets that cover the maximum number of elements

Phase 2: Make sure all sets of tags are assigned to some node

partition tags example
Partition TagsExample

PHASE 1: MAX-2 COVER

PHASE 2: ASSIGNING REMAINING SETS

finding nodes
Finding nodes

Inverted Index

outline1
Outline
  • Motivation
    • enBlogue
    • Jaccard Coefficient
    • Inclusion – Exclusion Principle
    • Problem
  • Idea
    • Architecture
    • Distributing Tags
    • Updating Counters
  • Results
    • Theoretical Results
    • Experimental Results
  • Conclusion
theoretic expectation
Theoretic expectation
  • k partitions
  • v total tags (vocabulary)
  • m randomly selected tags per set
  • n total tag-sets
real data experiments
Real Data Experiments
  • Dataset: Tweets of 15th March 2013
  • Partitions: 10
outline2
Outline
  • Motivation
    • enBlogue
    • Jaccard Coefficient
    • Inclusion – Exclusion Principle
    • Problem
  • Idea
    • Architecture
    • Distributing Tags
    • Updating Counters
  • Results
    • Theoretical Results
    • Experimental Results
  • Conclusion
conclusion
Conclusion
  • An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.
  • Applicable to all measures using intersection and/or unions of sets (e.g. Dice)
  • Results show small replication
  • Load equally distributed to the nodes.
ad