1 / 13

Random Forest Photometric Redshift Estimation

Random Forest Photometric Redshift Estimation. Samuel Carliles 1 Tamas Budavari 2 , Sebastien Heinis 2 , Carey Priebe 3 , Alex Szalay 2 Johns Hopkins University 1 Dept. of Computer Science 2 Dept. of Physics & Astronomy 3 Dept. of Applied Mathematics & Statistics. Photometric Redshifts.

senona
Download Presentation

Random Forest Photometric Redshift Estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Random Forest Photometric Redshift Estimation Samuel Carliles1 Tamas Budavari2, Sebastien Heinis2, Carey Priebe3, Alex Szalay2 Johns Hopkins University 1Dept. of Computer Science 2Dept. of Physics & Astronomy 3Dept. of Applied Mathematics & Statistics

  2. Photometric Redshifts • You know what they are • I did it on SDSS DR6 colors • zspec = f(u-g, g-r, r-i, i-z) • zphot = f(u-g, g-r, r-i, i-z) •  = zphot - zspec • I did it with Random Forests ˆ

  3. Regression Trees • A Binary Tree • It partitions input training data into clusters of similar objects • Each new test object is matched with the cluster to which it is “closest” in the input space • The output value is the mean of the output values of training objects in its cluster

  4. x3 x2 x1 x3 Building a Regression Tree • Starting at the root node choose a dimension on which to split • Choose the point which “best” distinguishes clusters in that dimension • Points left go in the left child, right go in the right child • Repeat the process in each child node until all objects are in their own leaf node

  5. How Do You Choose the Dimension and Split Point? • The best split point in a dimension is the one which minimizes resubstitution error in that dimension • The best dimension is the one with the lowest best resubstitution error

  6. What’s Resubstitution Error? • For a candidate split point, there are points left and points right •  = L ( x - xL)2 / NL + R (x - xR)2 / NR • That’s the resubstitution error • Minimize it ¯ ¯

  7. Randomizing a Regression Tree • Train it on a bootstrap sample • This is a sample of N objects chosen uniformly at random with replacement from the complete training set • Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions

  8. Random Forest • An ensemble of “randomized” Regression Trees • Ensemble estimate is the mean of individual tree estimates • This gives a distribution of iid estimation errors • Central Limit Theorem gives the distribution of their mean • Their mean is exactly zphot - zspec • That means we have the error distribution for that object!

  9. Implemented in R • More training data -> better estimates • Forests converge pretty quickly in forest size • Training set size, input space constrained by memory in R implementation

  10. Training set size = 80,000 Results RMS Error = 0.023

  11. Error Distribution Standardized Error Distribution Since we know the error distribution* for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :) If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)

  12. Summary • Random Forest estimates come with Gaussian error distributions • 0.023 RMS error is competitive with other methodologies • This makes Random Forests good

  13. Future Work • CRLB says bigger N gives better estimates from the same estimator • 80,000 objects is good, but we have way more than that available • Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation • So I’m writing a C# implementation

More Related