15 826 multimedia databases and data mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

15-826: Multimedia Databases and Data Mining PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

15-826: Multimedia Databases and Data Mining. Addendum to Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos. Must-read Material.

Download Presentation

15-826: Multimedia Databases and Data Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


15 826 multimedia databases and data mining

15-826: Multimedia Databasesand Data Mining

Addendum to Lecture #21:

Independent Component Analysis (ICA)

Jia-Yu Pan and Christos Faloutsos

(c) C. Faloutsos and J-Y Pan (2012)


Must read material

Must-read Material

  • AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto

    PAKDD 2004, Sydney, Australia

(c) C. Faloutsos and J-Y Pan (2012)


Outline

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q1 find patterns in data

Motivation: (Q1) Find patterns in data

  • Motion capture data: broad jumps

Left Knee

Energy exerted

Take-off

Right Knee

Landing

Energy exerted

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q1 find patterns in data1

Human would say

Pattern 1: along diagonal

Pattern 2: along vertical axis

How to find these automatically?

Take-off

R:L=60:1

Landing

R:L=1:1

Motivation: (Q1) Find patterns in data

Each point is the measurement

at a time tick (total 550 points).

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables

Motivation: (Q2) Find hidden variables

Hidden variables (=‘topics’ =

concepts)

Stock prices

Alcoa

American Express

“General trend”

Boeing

Citi Group

“Internet bubble”

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables1

0.94

0.64

0.63

0.03

Motivation:(Q2) Find hidden variables

Caterpillar

Intel

“Internet bubble”

“General trend”

Hidden variables

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables2

Motivation:(Q2) Find hidden variables

Caterpillar

Intel

?

B1,CAT

B2,INTC

B1,INTC

B2,CAT

?

?

Hidden variable 2

Hidden variable 1

(c) C. Faloutsos and J-Y Pan (2012)


Motivation find hidden variables

Motivation:Find hidden variables

  • There are two sound sources in a cocktail party…

=“blind source separation”

(= we don’t know the sources,

nor their mixing)

(c) C. Faloutsos and J-Y Pan (2012)


Outline1

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Formulation finding patterns

Formulation: Finding patterns

Given n data points,

each with m attributes.

Find patterns that describe

data properties the best.

(c) C. Faloutsos and J-Y Pan (2012)


Linear representation

Linear representation

  • Find vectors that describe the data set the best.

  • Each point: linear combination of the vectors (patterns):

(c) C. Faloutsos and J-Y Pan (2012)


Patterns as data vocabulary

Patterns as data “vocabulary”

Good pattern

≈ sparse coding

Only b1 is needed

to describe xi.

(Q) Given data xi’s,

compute hi,j’s and bi’s that are “sparse”?

(c) C. Faloutsos and J-Y Pan (2012)


Patterns in motion capture data

Left

Right

?

?

Data

matrix

Hidden

variables

Basis

vectors

Patterns in motion capture data

Sparse ~ non-Gaussian ~ “Independent”

b2

b1

n=550 ticks

“Independent”: e.g., minimize mutual information.

(c) C. Faloutsos and J-Y Pan (2012)


Patterns in motion capture data1

?

?

Data

matrix

Hidden

variables

Basis

vectors

Patterns in motion capture data

Sparse ~ non-Gaussian ~ “Independent”

X = (U L ) VT

(c) C. Faloutsos and J-Y Pan (2012)


Outline2

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Pattern discovery with ica autosplit pakdd 04 wiri 05

Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05]

(Q) Different

modalities

Video frames

Step 1:

Data points (matrix)

or

Stock prices

Step 2:

Compute patterns

(Q) What pattern?

or

Step 3:

Interpret patterns

(Q) How?

Text documents

Data mining

(Case studies)

(c) C. Faloutsos and J-Y Pan (2012)


Finding patterns in high dimensional data

details

Dimensionality

reduction

Finding patterns in high-dimensional data

PCA finds the hyperplane.

ICA finds the correct patterns.

(c) C. Faloutsos and J-Y Pan (2012)


Outline3

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

    • Visual vocabulary for retinal images

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery on text streams

Topic discovery on text streams

  • Data: CNN headline news (Jan.-Jun. 1998)

  • Documents of 10 topics in one single text stream

    • Documents are sorted by date/time

    • Subsequent documents may have different topics

Topic 3

Topic 1

Topic 1

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery on text streams1

Topic discovery on text streams

  • Data: CNN headline news (Jan.-Jun. 1998)

  • Documents of 10 topics in one single text stream

    • FIND: the document boundaries

    • AND: the terms of each topic

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)

??


Topic discovery on text streams2

Topic discovery on text streams

  • Known: number of topics = 10

  • Unknown: (1) topic of each document (2) topic description

Topic 3

Topic 1

Topic 1

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery in documents

Step 2

Step 3

aaron

animal

zoo

X[nxm’] = H[nxm’] B[m’xm’]

b’i = [0, 0.7, …, 0.6]

  • Find hyperplane (m’=10)

  • Find patterns

B’[10x3887]

Topic discovery in documents

Step 1

aaron

zoo

xi = [1, 5, …, 0]

Windowing

(n=1659)

X[nxm]

New stories

(30 words)

m=3887 (dictionary size)

(Q) What does b’i mean?

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 interpret the patterns

Step 3: Interpret the patterns

aaron

animal

zoo

Top words: “animal”, “zoo”, …

b’i = [0, 0.7, …, 0.6]

A hidden topic!

m=3887 (dictionary size)

Topics

found

General idea: related to the data attributes

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns

Step 3: Evaluate the patterns

AutoSplit finds correct topics.

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns1

Step 3: Evaluate the patterns

AutoSplit’s topics are better than PCA.

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns2

Step 3: Evaluate the patterns

Topic 1

Topic 2

PCA vectors mix the topics.

AutoSplit’s topics are better than PCA.

(c) C. Faloutsos and J-Y Pan (2012)


Outline4

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Find hidden variables djia stocks

Find hidden variables (DJIA stocks)

  • Weekly DJIA closing prices

    • 01/02/1990-08/05/2002, n=660 data points

    • A data point: prices of 29 companies at the time

Alcoa

American Express

Boeing

Caterpillar

Citi Group

(c) C. Faloutsos and J-Y Pan (2012)


Formulation find hidden variables

Formulation: Find hidden variables

AA1, …, XOM1

AAn, …, XOMn

H11, H12, …, H1m

Hn1, Hn2, …, Hnm

B11, B12, …, B1m

Bm1, Bm2, …, Bmm

?

=

?

Date

Hidden variable

Date

(c) C. Faloutsos and J-Y Pan (2012)


Characterize hidden variable by the companies it influences

Characterize hidden variable by the companies it influences

Caterpillar

Intel

B1,CAT

0.94

0.64

B2,INTC

0.63

0.03

B1,INTC

B2,CAT

“Internet bubble”

“General trend”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 1

Companies related to hidden variable 1

“General trend”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 11

Companies related to hidden variable 1

All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

(c) C. Faloutsos and J-Y Pan (2012)


General trend and outlier

General trend (and outlier)

“General trend”

AT&T

United Technologies

Walmart

Exxon Mobil

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 2

Companies related to hidden variable 2

Techcompany

2000-2001 “Internet bubble”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 21

Companies related to hidden variable 2

Tech company

Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related.

Other companies are un-related (weights < 0.15).

(c) C. Faloutsos and J-Y Pan (2012)


Outline5

Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

    • Visual vocabulary for retinal images

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Conclusion

Conclusion

  • ICA: more flexible than PCA in finding patterns.

  • Many applications

    • Find topics and “vocabulary” for images

    • Find hidden variables in time series (e.g., stock prices)

    • Blind source separation

ICA

PCA

(c) C. Faloutsos and J-Y Pan (2012)


Citation

Citation

  • AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto

    PAKDD 2004, Sydney, Australia

(c) C. Faloutsos and J-Y Pan (2012)


References

References

  • Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.

  • Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005.

  • Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130.

  • Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

(c) C. Faloutsos and J-Y Pan (2012)


References1

References

  • Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001

(c) C. Faloutsos and J-Y Pan (2012)


Software

Software

  • Open source software: ‘fastICA’ http://research.ics.tkk.fi/ica/fastica/

  • Or ‘autosplit’:

    www.cs.cmu.edu/~jypan/software/autosplit_cmu.tar.gz

(c) C. Faloutsos and J-Y Pan (2012)


  • Login