1 / 40

Large Two-way Arrays

Large Two-way Arrays. Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu. What are ‘large’ arrays?. # of rows in at least hundreds and/or # of columns in at least hundreds. Challenges/Opportunities. Logistics of handling data more tedious

Download Presentation

Large Two-way Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu

  2. What are ‘large’ arrays? • # of rows in at least hundreds and/or • # of columns in at least hundreds

  3. Challenges/Opportunities • Logistics of handling data more tedious • Standard graphic methods work less well • More opportunity for assumptions to fail but • Parameter estimates more precise • Fewer model assumptions maybe possible

  4. Settings • Microarray data • Proteomics data • Spectral data (fluorescence, absorption…)

  5. Common problems seen • Outliers/Heavy-tailed distributions • Missing data • Large # of variables hurts some methods

  6. The ovarian cancer data • Data set as I have it: • 15154 variables (M/Z values), % relative intensity recorded • 91 controls (clinical normals) • 162 ovarian cancer patients

  7. The normals • Give us an array of 15154 rows, 91 columns. • Qualifies as ‘large’ • Spectrum very ‘busy’

  8. not to mention outlier-prone • Subtracting off a median for each MZ and making a normal probability plot of the residuals

  9. Comparing cases, controls • First pass at a rule to distinguish normal controls from cancer cases: • Calculate two-sample t between groups for each distinct M/Z

  10. Good news / bad news • Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) • Visually seem to be isolated spikes • This is due to large # of narrow peaks

  11. Variability also differs

  12. Big differences in mean and variability • suggest conventional statistical tools of • Linear discriminant analysis • Logistic regression • Quadratic or regularized discriminant analysis using a selected set of features. Off-the-shelf software doesn’t like 15K variables, but methods very do-able.

  13. Return to beginning • Are there useful tools for extracting information from these arrays? • Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)

  14. Singular value approximation • Some philosophy from Bradu (1984) • Write X for nxp data array. • First remove structure you don’t want to see • k-term SVD approximation is

  15. The rit are ‘row markers’ You could use them as plot positions for the proteins • The cjt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. • The eij are error terms. They should mainly be small

  16. Fitting the SVD • Conventionally done by principal component analysis. • We avoid this for two reasons: • PCA is highly sensitive to outliers • It requires complete data (an issue in many large data sets, if not this one) • Standard approach would use 15K square covariance matrix.

  17. Alternating robust fit algorithm • Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. • Use resulting row markers to refine column markers. • Iterate to convergence. • For robust regression we use least trimmed squares (LTS) regression.

  18. Result for the controls • First run, I just removed a grand median. • Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators

  19. But the subsequent terms capture the finer structure

  20. Uses for the RSVD • Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. • Can use as the basis for methods like cluster analysis.

  21. Cluster analysis use • Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)

  22. The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. • Second term noise. Adds no information, detracts from performance • Third term, cross-product, approximates zero because of independence.

  23. This leads to… • r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c • Replace column Euclidean distance with squared distance between column markers. This removes random variability. • Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.

  24. Special case • If a one term SVD suffices, we get an ordination of the rows and columns. • Row ordination doesn’t make much sense for spectral data • Column ordination orders subjects ‘rationally’.

  25. The cancer group • Carried out RSVD of just the cancer • But this time removed row median first • Corrects for overall abundance at each MZ • Robust singular values are 2800, 1850, 1200,… • suggesting more than one dimension.

  26. No striking breaks in sequence. • We can cluster, but get more of a partition of a continuum. • Suggests that severity varies smoothly

  27. Back to the two-group setting • An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? • Can address this by RSVD of cancer cases and clustering on column markers • Or use the controls to get multivariate metric and place the cancers in this metric.

  28. Do a new control RSVD • Subtract row medians. • Get canonical variates for all versus just controls • (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) • Plot the two groups

  29. Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations • Controls look a lot more homogeneous than cancer cases.

  30. Summary • Large arrays – challenge and opportunity. • Hard to visualize or use graphs. • Many data sets show outliers / missing data / very heavy tails. • Robust-fit singular value decomposition can handle these; provides large data condensation.

  31. Some references

More Related