slide1 l.
Download
Skip this Video
Download Presentation
Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

Loading in 2 Seconds...

play fullscreen
1 / 32

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20 - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 2007. Outline. Introduction Robust singular value decomposition Non-negative matrix factorization 4. Inference with NMF. Data Blocks (Zoo). 3-Way. PCA. Linear Regression.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20' - berke


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Non-Negative Matrix Factorizationfor Statistical AnalysisStan YoungPaul Fogel, Doug HawkinsNISSMPC Vienna11July 2007
outline
Outline
  • Introduction
  • Robust singular value decomposition
  • Non-negative matrix factorization

4. Inference with NMF.

data blocks zoo
Data Blocks (Zoo)

3-Way

PCA

Linear Regression

Canonical Correlation

PLS

“U” design

Multi-Block

multiple blocks
Multiple Blocks

2-way tables of data are ubiquitous, PCA.

One response and a table of predictors is common,

linear regression.

Multiple 2-way tables are becoming important:

Gene expression, proteomics, metabololomics.

examples of multiple blocks
Examples of Multiple Blocks

……..

Factor analysis of multiple data matrices.

Horst 1961, 1965, Kettering, J. 1971.

Pittman, Sacks, Young. 2001.

3-Way Analysis.

See also www.niss.org/PowerArray

Martens. 2004.

“U” Analysis

motivating problem
Motivating Problem

Permute the rows and columns to find patterns.

  • Problems:
  • Large, 10s to 100s of rows and 1000s of columns.
  • Missing data.
  • Outliers.
matrix factroization methods
Matrix Factroization Methods
  • Principle component analysis.
  • Singular value decomposition.
  • Non-negative matrix factorization.
  • Independent component analysis.
  • Inference using NMF.
  • Area of active research.
understanding a svd algorithm helps
Understanding a SVD algorithm helps

X = l * LHE ‘ * RHE + E

=

+ E

X

y = bx + e

  • Guess at LHE.
  • Linear regression of LHE on column of Y.
  • Element of RHE is the regression coefficient.
  • Switch LRE and RHE, iterate. Alternating LS regression.
  • Use robust regression method. Least trimmed squares.
california versus all challengers the 1999 cabernet challenge
California Versus All Challengers,The 1999 Cabernet Challenge
  • 47 wines judged by 32 wine experts
  • No data for 1 wine/expert
  • One missing data point
  • Results are ranks of wine by each judge
slide10

Original Data

The missing cell is colored yellow.

slide11

Plot of Eigenvalues

The plot suggests one or two components.

slide12

Component 1

Judges are divided into the following groups: 1-3, 4-7, 8-11, 12-26, 27-32

Wines are divided into the following groups: 1-4, 5-17,18-27,28-41,42-46

comments wine dataset
Comments Wine Dataset

Most judges were consistent.

Three judges are at odds with the rest.

The wines divided into 6 classes;

six wines group very well.

There is an apparent interaction of

wines and judges.

One eigen system captures most of the variance.

key matrix factorization papers
Key Matrix Factorization Papers
  • Good (1969) Technometrics – SVD.
  • Liu et al. (2003) PNAS – rSVD.
  • Lee and Seung (1999) Nature – NMF.
  • Kim and Tidor (2003) Genome Research.
  • Brunet et al. (2004) PNAS – Micro array.
  • Fogel et al. (2007) Bioinformatics.
contention nmf finds parts
Contention: NMF finds “parts”

SVD RH EV elements come from a composite.

(They come from regression.)

NMF commits one vector to each mechanism.

(True??)

“For such databases there is a generative model

in terms of ‘parts’ and

NMF correctly identifies the ‘parts’.”

nmf algorithm

Genes or Compounds

Samples

A

NMF Algorithm

Green are the “spectra”.

Red are the “weights”.

H

WH

=

W

+ E

Optimize so that

(aij – whij)2 is minimized.

Start with random elements in red and green.

scotch whisky

Original matrix

=

Prototypical flavor patterns

X

Weights

Scotch Whisky

Wishart: Whisky Classified

golub t r et al 1999
Golub,T.R. et al. (1999)
  • Group AML: acute myeloid leukemia
  • Group ALL: acute lymphoblastic leukemia
    • Subgroup ALL-T: T cell subtypes
    • Subgroup ALL-B: B cell subtypes
gene and sample clustering
Gene and Sample Clustering

NMF clusters samples correctly.

Additional subgroup of ALL-B.

Brunet et al. (2004). PNAS 101, 4164–4169

all b1 and all b2 genes

Immune Response

10 genes (p=0.00019)

MHC class II

5 genes

Cluster 1 ALL-B1

(33 genes)

Proteasome

7 genes

P = 0.00054

MHC class I & II

6 genes

P = 0.00018

Immune Response

28 genes(p=0.00047)

RNA Processing

11 genes

P = 0.00260

Cluster 3 ALL-B2

(169 genes)

DNA Repair and

Replication

11 genes

P = 0.01519

Cell Growth and

Proliferation

61 genes

Cell Cycle

12 genes

Transcription

16 genes

ALL-B1 and ALL-B2 Genes

Upregulation in ALL-B2 genes

Higher rate of transcription and replication processes

More:

Proliferative nature compared with ALL-B1

Proteasomal activity

Energy production.

inference strategy
Inference Strategy

Non-negative matrix factorization is used to group genes.

[Unsupervised training.]

The testing alpha is allocated over these groups/vectors.

Within each group, genes are tested sequentially;

there is no multiple testing adjustment!!!.

inference nmf algorithm
Inference NMF Algorithm

H

Y

X

W

  • Compute NMF.
  • 2. Order Y by elements of W.
  • 3. Compute runs test on Y.
  • 4. Remove most important col of X.
  • 5. Repeat steps 1 to 3 (maintain order of H).
  • 6. Stop when runs test not significant.

Fogel et al. (2007) Bioinformatics

simulation27
Simulation

Genes 1-5: up-regulated by T1

Genes 6-10: up-regulated by T2

Genes 11-20: up-regulated by T1 and T2

NB: Genes within a mechanism are expected to be correlated.

general comments
General Comments

SVD is the basis for most linear statistical systems.

Non-negative matrix factorization

will become increasingly important.

Data sets are getting much bigger.

We are seeing complex, multi-block data sets.

We need good software to expand data analysis.

irmf summary
irMF Summary
  • NMF is an attractive alternative to SVD.
  • Mechanisms appear to be captured in separate vectors.
  • Genes can be tested sequentially within a right vectors.
  • Many statistical problems are open for research.
more information
More Information

NMF program and papers at

www.niss.org/irMF

Stan Young : young@niss.org

Paul Fogel : paul.fogel@wanadoo.fr

more information32
More Information

NMF Code and papers at www.niss.org/irMF

Analysis of “L” design: www.niss.org/PowerArray

NMF roundtable luncheon at JSM2007.

See also: www.niss.org/PowerMV

http://eccr.stat.ncsu.edu/

young@niss.org