Topics in learning from high dimensional data and large scale machine learning

Topics in learning from high dimensional data and large scale machine learning Ata Kaban School of Computer Science University of Birmingham

High dimensional data • We have N data points (observations): T={} The number of attributes, , we call the dimensionality of the data. • In any application areas in science and engineering, unprecedented technological advances lead to increasingly high dimensional data sets. For instance, in genomics and proteomics, biomedical imaging, signal processing, astrophysics, finance, web, and market basket analysisit is not uncommon to have in the order of thousands!

Problem 1 • We have seen that the working of machine learning algorithms depends in some way or other on the geometry of data – lengths of vectors, distances, angles, shapes. • High dimensional geometry is very different from low dimensional geometry. It defeats our intuitions -- we can draw in 2D, we can imagine things in 3D, what happens in larger D?

Problem 2 • Most machine learning methods take computation time that increases quickly (exponentially) with the dimensionality of the data. • These and a suite of other issues caused by high dimensionality are usually referred to as “the curse of dimensionality”. • Fortunately, high dimensionality also has blessings!

Concentration of norms • Generate points in 2D with coordinates drawn independently at random from a distribution with mean 0 and variance 1/d. Create a histogram of the norms (lengths) of these points. Repeat at larger dimensions. What happens?

Near-orthogonality • Now, generate pairs of vectors in the same way and look at the distributions of their dot products. Recall, the dot product is 0 iff the vectors are orthogonal. What happens as you increase d?

What is happening as d ∞? • We can see from the simulation plots that: • As d increases, any two of our random vectors end up being nearly orthogonal to each other • As d increases any of our random vectors ends up having about the same length • Can we explain why these things are happening? • Yes we can, but we need some math tools for that… [on separate slides]

Consequences for machine learning • When data has little structure (e.g. many attributes that re independent of each other – like in the data we generated in the earlier slides) then the ‘nearest neighbour’ is at about the same distance as the furthest one! • When the data does have structure (e.g. nicely separated classes, or lives on a smaller dimensional subspace), then we can use a small collection of random vectors to project our high dimensional data to without losing much of the structure! – Cheap dimensionality reduction by Random Projections

Random Projections • This result can be used for large scale machine learning. You just generate a kxd matrix, where k<<d, with entries i.i.d. random e.g. from a standard normal distribution, and pre-multiply your data points with it to get k-dimensional data that has much the same structure than the original.

Summary • Curse of dimensionality for data that has little structure: • Nearly equal lengths • Near orthogonality • Nearest neighbour becomes meaningless (as well as other methods that also rely on distances) • Blessing of dimensionality for data that has structure • Random Projections can be used as a cheap dimensionality reduction technique that has surprisingly strong guarantees of preserving the data geometry.

Related readings • R. J. Durrant and A. Kaban. When is 'Nearest Neighbor' Meaningful: A Converse Theorem and Implications Journal of Complexity. Volume 25, Issue 4, August 2009, pp. 385-397. • Our Tutorial at ECML’12, “Random Projections for Machine Learning and Data Mining: Theory and Applications” – with many references therein! http://www.ecmlpkdd2012.net/programme/tutorials/#random

Topics in learning from high dimensional data and large scale machine learning