Efficient visualization of document streams
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Efficient Visualization of Document Streams PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation

Download Presentation

Efficient Visualization of Document Streams

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient visualization of document streams

Efficient Visualization of Document Streams

Miha [email protected]}

Vid Podpečan

Matjaž Juršič

Prof. Dr. Nada Lavrač

Jozef Stefan Institute, Dept. of Knowledge Technologies

Ljubljana, Slovenia

Discovery Science, Canberra, October 2010


Outline

Outline

  • Motivation

  • Original algorithm

    • Document corpus visualization pipeline

  • Our modified algorithm

    • Visualization of document streams

  • Experiments (speed tests)

  • Conclusions and further work

DS 2010


Motivation visualization of document corpora

MotivationVisualization of Document Corpora

DS 2010


Motivation goal visualization of document streams

MotivationGoal: Visualization of Document Streams

Documentstream

Outdateddocuments

DS 2010


Corpus visualization pipeline

Corpus Visualization Pipeline

Paulovich et al. (2006)

Neighborhoodscomputation

Corpus preprocessing

k-means clustering

Least-squaresinterpolation

Stressmajorization

Document

corpus

Layout

DS 2010


Corpus visualization pipeline1

Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Tokenization

  • Stop-word removal

  • Lemmatization

  • n-grams

    Sparse TF-IDF vectors in a high-dimensional space

DS 2010


Corpus visualization pipeline2

Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Iterative

method

DS 2010


Corpus visualization pipeline3

Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Iterative

method

High-dimensional  2D

DS 2010


Corpus visualization pipeline4

Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

DS 2010


Corpus visualization pipeline5

Corpus Visualization Pipeline

1

(0,0)

1

(0,0)

-1/k

1

-1/k

-1/k

(x1,y1)

1

(x2,y2)

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • pi = (1/|Np|)rNpr

  • pi + rNp(–1/k)r = (0, 0),

    k = |Np|

  • ci = (xi*, yi*)

1

Iterative

method

1

1

1

=

1

1

(0,0)

1

(0,0)

(xn-1,yn-1)

1

(x1*,y1*)

(xn,yn)

1

1

1

(xr*,yr*)

argminX{||AX – B||2}

AX = B

DS 2010


Stream visualization pipeline

Stream Visualization Pipeline

Neighborhoodscomputation

Preprocessing

k-means clustering

Least-squaresinterpolation

Stress majorization

Buffer

(FIFO)

Documentstream

Outdateddocuments

DS 2010


Stream visualization pipeline1

Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • TF-IDF weights

    • TF: the number of times the term occurs in the document

    • DF: the number of documents in the corpus containing the term

    • IDF: log(|D| / DF)

  • Not possible to compute IDF from (infinite) real-time streams

DS 2010


Stream visualization pipeline2

Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • TF vector

  • TF-IDF vector

VocabularyDF values

  • TF vector

TF vector

  • TF vector

  • TF vector

  • TF vector

DS 2010


Stream visualization pipeline3

Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Warmstart!

DS 2010


Stream visualization pipeline4

Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Warmstart!

DS 2010


Stream visualization pipeline5

Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Remove outdated instances

  • Add new instances

DS 2010


Stream visualization pipeline6

Stream Visualization Pipeline

1

(0,0)

1

(0,0)

1

(x3,y3)

(x4,y4)

1

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Remove outdated instances

  • Add new instances

1

Warmstart!

1

1

1

=

1

(0,0)

1

(0,0)

1

(x1,y1)

1

(0,0)

1

(x2,y2)

(0,0)

(0,0)

1

1

(x3,y3)

1

(0,0)

1

(x4,y4)

1

(xn-1,yn-1)

1

=

1

(x1*,y1*)

1

(xn,yn)

1

(0,0)

1

1

(0,0)

1

(xn-1,yn-1)

1

(x1*,y1*)

(xn,yn)

1

1

(xr*,yr*)

1

1

(xr*,yr*)

DS 2010


Speed tests

Speed Tests

  • First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute)

  • Experimental setting

    • Maximum rate?

    • 10 news in a batch (u = 10)

    • Buffer capacity: nQ = 5,000 news

    • 100 control points, 30 + 30 neighbors

DS 2010


Speed tests1

Speed Tests

DS 2010


Speed tests2

Speed Tests

Processing delay: ~9 sec

+ 4 sec to form a batch

Exit delay: ~4 sec

Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec)

Neighborhoodscomputation

Preprocessing

k-means clustering

Least-squaresinterpolation

Stress majorization

Buffer

(FIFO)

Documentstream

Outdateddocuments

DS 2010


Speed tests3

Speed Tests

DS 2010


Conclusions and further work

Conclusions and Further Work

  • Conclusions

    • Efficient online distance-preserving document stream visualization technique (2.5 docs / sec, 5 parallel processes)

    • Tricks: warm start, pipelining, parallelization

  • Further work

    • Performance at different nQand u?

    • Optimize k-means (done!) and k-NN (easy)

    • Find use cases, perform user studies

      • Decision making in financial domain (FIRST)

      • Press clipping (media monitoring)

DS 2010


  • Login