Efficient visualization of document streams
Download
1 / 22

Efficient Visualization of Document Streams - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Visualization of Document Streams' - george-guthrie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient visualization of document streams

Efficient Visualization of Document Streams

Miha [email protected]}

Vid Podpečan

Matjaž Juršič

Prof. Dr. Nada Lavrač

Jozef Stefan Institute, Dept. of Knowledge Technologies

Ljubljana, Slovenia

Discovery Science, Canberra, October 2010


Outline
Outline

  • Motivation

  • Original algorithm

    • Document corpus visualization pipeline

  • Our modified algorithm

    • Visualization of document streams

  • Experiments (speed tests)

  • Conclusions and further work

DS 2010


Motivation visualization of document corpora
MotivationVisualization of Document Corpora

DS 2010


Motivation goal visualization of document streams
MotivationGoal: Visualization of Document Streams

Documentstream

Outdateddocuments

DS 2010


Corpus visualization pipeline
Corpus Visualization Pipeline

Paulovich et al. (2006)

Neighborhoodscomputation

Corpus preprocessing

k-means clustering

Least-squaresinterpolation

Stressmajorization

Document

corpus

Layout

DS 2010


Corpus visualization pipeline1
Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Tokenization

  • Stop-word removal

  • Lemmatization

  • n-grams

    Sparse TF-IDF vectors in a high-dimensional space

DS 2010


Corpus visualization pipeline2
Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Iterative

method

DS 2010


Corpus visualization pipeline3
Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Iterative

method

High-dimensional  2D

DS 2010


Corpus visualization pipeline4
Corpus Visualization Pipeline

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

DS 2010


Corpus visualization pipeline5
Corpus Visualization Pipeline

1

(0,0)

1

(0,0)

-1/k

1

-1/k

-1/k

(x1,y1)

1

(x2,y2)

  • Corpus preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • pi = (1/|Np|)rNpr

  • pi + rNp(–1/k)r = (0, 0),

    k = |Np|

  • ci = (xi*, yi*)

1

Iterative

method

1

1

1

=

1

1

(0,0)

1

(0,0)

(xn-1,yn-1)

1

(x1*,y1*)

(xn,yn)

1

1

1

(xr*,yr*)

argminX{||AX – B||2}

AX = B

DS 2010


Stream visualization pipeline
Stream Visualization Pipeline

Neighborhoodscomputation

Preprocessing

k-means clustering

Least-squaresinterpolation

Stress majorization

Buffer

(FIFO)

Documentstream

Outdateddocuments

DS 2010


Stream visualization pipeline1
Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • TF-IDF weights

    • TF: the number of times the term occurs in the document

    • DF: the number of documents in the corpus containing the term

    • IDF: log(|D| / DF)

  • Not possible to compute IDF from (infinite) real-time streams

DS 2010


Stream visualization pipeline2
Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • TF vector

  • TF-IDF vector

VocabularyDF values

  • TF vector

TF vector

  • TF vector

  • TF vector

  • TF vector

DS 2010


Stream visualization pipeline3
Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Warmstart!

DS 2010


Stream visualization pipeline4
Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

Warmstart!

DS 2010


Stream visualization pipeline5
Stream Visualization Pipeline

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Remove outdated instances

  • Add new instances

DS 2010


Stream visualization pipeline6
Stream Visualization Pipeline

1

(0,0)

1

(0,0)

1

(x3,y3)

(x4,y4)

1

  • Preprocessing

  • k-means clustering

  • Stress majorization

  • Neighborhoods

  • Least-squares interpolation

  • Remove outdated instances

  • Add new instances

1

Warmstart!

1

1

1

=

1

(0,0)

1

(0,0)

1

(x1,y1)

1

(0,0)

1

(x2,y2)

(0,0)

(0,0)

1

1

(x3,y3)

1

(0,0)

1

(x4,y4)

1

(xn-1,yn-1)

1

=

1

(x1*,y1*)

1

(xn,yn)

1

(0,0)

1

1

(0,0)

1

(xn-1,yn-1)

1

(x1*,y1*)

(xn,yn)

1

1

(xr*,yr*)

1

1

(xr*,yr*)

DS 2010


Speed tests
Speed Tests

  • First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute)

  • Experimental setting

    • Maximum rate?

    • 10 news in a batch (u = 10)

    • Buffer capacity: nQ = 5,000 news

    • 100 control points, 30 + 30 neighbors

DS 2010


Speed tests1
Speed Tests

DS 2010


Speed tests2
Speed Tests

Processing delay: ~9 sec

+ 4 sec to form a batch

Exit delay: ~4 sec

Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec)

Neighborhoodscomputation

Preprocessing

k-means clustering

Least-squaresinterpolation

Stress majorization

Buffer

(FIFO)

Documentstream

Outdateddocuments

DS 2010


Speed tests3
Speed Tests

DS 2010


Conclusions and further work
Conclusions and Further Work

  • Conclusions

    • Efficient online distance-preserving document stream visualization technique (2.5 docs / sec, 5 parallel processes)

    • Tricks: warm start, pipelining, parallelization

  • Further work

    • Performance at different nQand u?

    • Optimize k-means (done!) and k-NN (easy)

    • Find use cases, perform user studies

      • Decision making in financial domain (FIRST)

      • Press clipping (media monitoring)

DS 2010


ad