15 826 multimedia databases and data mining
Download
1 / 42

15-826: Multimedia Databases and Data Mining - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

15-826: Multimedia Databases and Data Mining. Addendum to Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos. Must-read Material.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 15-826: Multimedia Databases and Data Mining' - onan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
15 826 multimedia databases and data mining

15-826: Multimedia Databasesand Data Mining

Addendum to Lecture #21:

Independent Component Analysis (ICA)

Jia-Yu Pan and Christos Faloutsos

(c) C. Faloutsos and J-Y Pan (2012)


Must read material
Must-read Material

  • AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto

    PAKDD 2004, Sydney, Australia

(c) C. Faloutsos and J-Y Pan (2012)


Outline
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q1 find patterns in data
Motivation: (Q1) Find patterns in data

  • Motion capture data: broad jumps

Left Knee

Energy exerted

Take-off

Right Knee

Landing

Energy exerted

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q1 find patterns in data1

Human would say

Pattern 1: along diagonal

Pattern 2: along vertical axis

How to find these automatically?

Take-off

R:L=60:1

Landing

R:L=1:1

Motivation: (Q1) Find patterns in data

Each point is the measurement

at a time tick (total 550 points).

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables
Motivation: (Q2) Find hidden variables

Hidden variables (=‘topics’ =

concepts)

Stock prices

Alcoa

American Express

“General trend”

Boeing

Citi Group

“Internet bubble”

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables1

0.94

0.64

0.63

0.03

Motivation:(Q2) Find hidden variables

Caterpillar

Intel

“Internet bubble”

“General trend”

Hidden variables

(c) C. Faloutsos and J-Y Pan (2012)


Motivation q2 find hidden variables2
Motivation:(Q2) Find hidden variables

Caterpillar

Intel

?

B1,CAT

B2,INTC

B1,INTC

B2,CAT

?

?

Hidden variable 2

Hidden variable 1

(c) C. Faloutsos and J-Y Pan (2012)


Motivation find hidden variables
Motivation:Find hidden variables

  • There are two sound sources in a cocktail party…

=“blind source separation”

(= we don’t know the sources,

nor their mixing)

(c) C. Faloutsos and J-Y Pan (2012)


Outline1
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Formulation finding patterns
Formulation: Finding patterns

Given n data points,

each with m attributes.

Find patterns that describe

data properties the best.

(c) C. Faloutsos and J-Y Pan (2012)


Linear representation
Linear representation

  • Find vectors that describe the data set the best.

  • Each point: linear combination of the vectors (patterns):

(c) C. Faloutsos and J-Y Pan (2012)


Patterns as data vocabulary
Patterns as data “vocabulary”

Good pattern

≈ sparse coding

Only b1 is needed

to describe xi.

(Q) Given data xi’s,

compute hi,j’s and bi’s that are “sparse”?

(c) C. Faloutsos and J-Y Pan (2012)


Patterns in motion capture data

Left

Right

?

?

Data

matrix

Hidden

variables

Basis

vectors

Patterns in motion capture data

Sparse ~ non-Gaussian ~ “Independent”

b2

b1

n=550 ticks

“Independent”: e.g., minimize mutual information.

(c) C. Faloutsos and J-Y Pan (2012)


Patterns in motion capture data1

?

?

Data

matrix

Hidden

variables

Basis

vectors

Patterns in motion capture data

Sparse ~ non-Gaussian ~ “Independent”

X = (U L ) VT

(c) C. Faloutsos and J-Y Pan (2012)


Outline2
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Pattern discovery with ica autosplit pakdd 04 wiri 05
Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05]

(Q) Different

modalities

Video frames

Step 1:

Data points (matrix)

or

Stock prices

Step 2:

Compute patterns

(Q) What pattern?

or

Step 3:

Interpret patterns

(Q) How?

Text documents

Data mining

(Case studies)

(c) C. Faloutsos and J-Y Pan (2012)


Finding patterns in high dimensional data

details

Dimensionality

reduction

Finding patterns in high-dimensional data

PCA finds the hyperplane.

ICA finds the correct patterns.

(c) C. Faloutsos and J-Y Pan (2012)


Outline3
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

    • Visual vocabulary for retinal images

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery on text streams
Topic discovery on text streams

  • Data: CNN headline news (Jan.-Jun. 1998)

  • Documents of 10 topics in one single text stream

    • Documents are sorted by date/time

    • Subsequent documents may have different topics

Topic 3

Topic 1

Topic 1

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery on text streams1
Topic discovery on text streams

  • Data: CNN headline news (Jan.-Jun. 1998)

  • Documents of 10 topics in one single text stream

    • FIND: the document boundaries

    • AND: the terms of each topic

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)

??


Topic discovery on text streams2
Topic discovery on text streams

  • Known: number of topics = 10

  • Unknown: (1) topic of each document (2) topic description

Topic 3

Topic 1

Topic 1

Date/Time

(c) C. Faloutsos and J-Y Pan (2012)


Topic discovery in documents

Step 2

Step 3

aaron

animal

zoo

X[nxm’] = H[nxm’] B[m’xm’]

b’i = [0, 0.7, …, 0.6]

  • Find hyperplane (m’=10)

  • Find patterns

B’[10x3887]

Topic discovery in documents

Step 1

aaron

zoo

xi = [1, 5, …, 0]

Windowing

(n=1659)

X[nxm]

New stories

(30 words)

m=3887 (dictionary size)

(Q) What does b’i mean?

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 interpret the patterns
Step 3: Interpret the patterns

aaron

animal

zoo

Top words: “animal”, “zoo”, …

b’i = [0, 0.7, …, 0.6]

A hidden topic!

m=3887 (dictionary size)

Topics

found

General idea: related to the data attributes

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns
Step 3: Evaluate the patterns

AutoSplit finds correct topics.

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns1
Step 3: Evaluate the patterns

AutoSplit’s topics are better than PCA.

(c) C. Faloutsos and J-Y Pan (2012)


Step 3 evaluate the patterns2
Step 3: Evaluate the patterns

Topic 1

Topic 2

PCA vectors mix the topics.

AutoSplit’s topics are better than PCA.

(c) C. Faloutsos and J-Y Pan (2012)


Outline4
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Find hidden variables djia stocks
Find hidden variables (DJIA stocks)

  • Weekly DJIA closing prices

    • 01/02/1990-08/05/2002, n=660 data points

    • A data point: prices of 29 companies at the time

Alcoa

American Express

Boeing

Caterpillar

Citi Group

(c) C. Faloutsos and J-Y Pan (2012)


Formulation find hidden variables
Formulation: Find hidden variables

AA1, …, XOM1

AAn, …, XOMn

H11, H12, …, H1m

Hn1, Hn2, …, Hnm

B11, B12, …, B1m

Bm1, Bm2, …, Bmm

?

=

?

Date

Hidden variable

Date

(c) C. Faloutsos and J-Y Pan (2012)


Characterize hidden variable by the companies it influences
Characterize hidden variable by the companies it influences

Caterpillar

Intel

B1,CAT

0.94

0.64

B2,INTC

0.63

0.03

B1,INTC

B2,CAT

“Internet bubble”

“General trend”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 1
Companies related to hidden variable 1

“General trend”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 11
Companies related to hidden variable 1

All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

(c) C. Faloutsos and J-Y Pan (2012)


General trend and outlier
General trend (and outlier)

“General trend”

AT&T

United Technologies

Walmart

Exxon Mobil

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 2
Companies related to hidden variable 2

Techcompany

2000-2001 “Internet bubble”

(c) C. Faloutsos and J-Y Pan (2012)


Companies related to hidden variable 21
Companies related to hidden variable 2

Tech company

Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related.

Other companies are un-related (weights < 0.15).

(c) C. Faloutsos and J-Y Pan (2012)


Outline5
Outline

  • Motivation

  • Formulation

  • PCA and ICA

  • Example applications

    • Find topics in documents

    • Hidden variables in stock prices

    • Visual vocabulary for retinal images

  • Conclusion

(c) C. Faloutsos and J-Y Pan (2012)


Conclusion
Conclusion

  • ICA: more flexible than PCA in finding patterns.

  • Many applications

    • Find topics and “vocabulary” for images

    • Find hidden variables in time series (e.g., stock prices)

    • Blind source separation

ICA

PCA

(c) C. Faloutsos and J-Y Pan (2012)


Citation
Citation

  • AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto

    PAKDD 2004, Sydney, Australia

(c) C. Faloutsos and J-Y Pan (2012)


References
References

  • Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006.

  • Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005.

  • Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130.

  • Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

(c) C. Faloutsos and J-Y Pan (2012)


References1
References

  • Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001

(c) C. Faloutsos and J-Y Pan (2012)


Software
Software

  • Open source software: ‘fastICA’ http://research.ics.tkk.fi/ica/fastica/

  • Or ‘autosplit’:

    www.cs.cmu.edu/~jypan/software/autosplit_cmu.tar.gz

(c) C. Faloutsos and J-Y Pan (2012)


ad