Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag P...
Download
1 / 22

DBSocial 2013, New York - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' DBSocial 2013, New York' - grazia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions

DBSocial 2013, New York


Motivation enblogue 1
Motivation Short Sets using (Almost) Disjoint Tag PartitionsenBlogue (1)

  • enBlogue: Identifies emergent topics

  • Input: A stream of documents annotated with hash-tags (e.g. Tweets)

  • Restricts the focus to the more recent documents using a time sliding window


Motivation enblogue 2
Motivation Short Sets using (Almost) Disjoint Tag PartitionsenBlogue (2)

  • Tracks the correlation of co-occurring hash-tags over time

  • Reports on unexpected changes in the correlation

correlation

time


Jaccard coefficient
Jaccard Short Sets using (Almost) Disjoint Tag Partitions Coefficient

  • T : A set containing the document ids annotated with tag t

  • Pair of tags :

  • Set of n tags :


Jaccard coefficient computation
Jaccard Short Sets using (Almost) Disjoint Tag Partitions Coefficient Computation

  • Maintain counters for all subsets of co-occurring tags


Inclusion exclusion principle
Inclusion – Exclusion Principle Short Sets using (Almost) Disjoint Tag Partitions

  • Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:


Inclusion exclusion principle advantages
Inclusion – Exclusion Principle Short Sets using (Almost) Disjoint Tag PartitionsAdvantages

  • Needs to maintain less counters

  • Adapts more easily to changes in the load


Problem
Problem Short Sets using (Almost) Disjoint Tag Partitions

  • For each subset of co-occurring tags

    • Number of documents annotated each tag

    • Number of documents annotated with all tags

  • A big number of co-occurring tag sets

  • New documents arrive fast changing the numbers

Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets


Outline
Outline Short Sets using (Almost) Disjoint Tag Partitions

  • Motivation

    • enBlogue

    • Jaccard Coefficient

    • Inclusion – Exclusion Principle

    • Problem

  • Idea

    • Architecture

    • Partition Tags

    • Updating Counters

  • Results

    • Theoretical Results

    • Experimental Results

  • Conclusion


Architecture
Architecture Short Sets using (Almost) Disjoint Tag Partitions

Nodes computing the partitions

Nodes computing the Jaccard coefficients


Partition tags requisites
Partition Tags Short Sets using (Almost) Disjoint Tag PartitionsRequisites

  • Treat tag-sets as inseparable units

  • Minimise the overlap of single tags tracked by different nodes


Partition tags algorithm
Partition Tags Short Sets using (Almost) Disjoint Tag PartitionsAlgorithm

  • Phase 1: Create an initial assignment of the tags to the nodes

  • Max-k cover : Selects k out of n sets that cover the maximum number of elements

Phase 2: Make sure all sets of tags are assigned to some node


Partition tags example
Partition Tags Short Sets using (Almost) Disjoint Tag PartitionsExample

PHASE 1: MAX-2 COVER

PHASE 2: ASSIGNING REMAINING SETS


Update counters
Update Counters Short Sets using (Almost) Disjoint Tag Partitions


Finding nodes
Finding nodes Short Sets using (Almost) Disjoint Tag Partitions

Inverted Index


Outline1
Outline Short Sets using (Almost) Disjoint Tag Partitions

  • Motivation

    • enBlogue

    • Jaccard Coefficient

    • Inclusion – Exclusion Principle

    • Problem

  • Idea

    • Architecture

    • Distributing Tags

    • Updating Counters

  • Results

    • Theoretical Results

    • Experimental Results

  • Conclusion


Theoretic expectation
Theoretic expectation Short Sets using (Almost) Disjoint Tag Partitions

  • k partitions

  • v total tags (vocabulary)

  • m randomly selected tags per set

  • n total tag-sets


Theoretical results
Theoretical Results Short Sets using (Almost) Disjoint Tag Partitions


Real data experiments
Real Data Experiments Short Sets using (Almost) Disjoint Tag Partitions

  • Dataset: Tweets of 15th March 2013

  • Partitions: 10


Outline2
Outline Short Sets using (Almost) Disjoint Tag Partitions

  • Motivation

    • enBlogue

    • Jaccard Coefficient

    • Inclusion – Exclusion Principle

    • Problem

  • Idea

    • Architecture

    • Distributing Tags

    • Updating Counters

  • Results

    • Theoretical Results

    • Experimental Results

  • Conclusion


Conclusion
Conclusion Short Sets using (Almost) Disjoint Tag Partitions

  • An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.

  • Applicable to all measures using intersection and/or unions of sets (e.g. Dice)

  • Results show small replication

  • Load equally distributed to the nodes.


Thank you

Thank you! Short Sets using (Almost) Disjoint Tag Partitions


ad