2012-03-21

why stop at a few eigenvectors?

My loyal reader knows that I hate PCA. That said, Fergus and I are using it to model Oppenheimer's P1640 imaging spectrographic coronograph. We find that the right number of eigenvectors to use from the PCA is in the hundreds not the few to dozen that astronomers are used to. The reason? Fergus and I need to represent the data at exceedingly high accuracy to find faint companions (think exoplanets) among the speckly noise. That said, with hundreds of components, a PCA can properly model any exoplanet and all the speckles, so we use a train and test framework, in which the pixels of interest for finding the exoplanet are not used in building the eigenvectors (which are then subsequently used to model the pixels of interest). That permits us to go to immense model complexity without over-fitting. I love it because it is so crazy; we are barely even compressing the signal with the PCA; we really are just using the PCA to figure out if the pixels of interest are outliers relative to the pixels not of interest. Of course, because all pixels are of interest, we are cycling through all choices of the pixels of interest (and their complementary training set). My job is to write this all up. By Friday! Luckily I am on a train to Baltimore tomorrow. If you have recently sent me email: Expect high latency.

3 comments:

  1. Doesn't that mean your chosen basis set is inefficient at representing the pixels of non-interest? Or are the differences between the pixels of interest and non-interest in that <1% tail that you lose with truncating the number of eigenvectors?

    I'll hazard a guess that you cross-validate on the pixels of non-interest to choose the number of basis vectors?

    ReplyDelete
  2. Doug: Not sure how to answer your first question, but a partial answer is provided by our answer to the second: Our objective is sensitivity to faint companions, so we actually choose the number of eigenvectors that optimizes that sensitivity (as best we can measure it). It turns out that this criterion is very similar to cross-validation, of course!

    ReplyDelete
  3. I think this is awesome! We need to start prodding astronomers out of their 2-to-3-dimensional shells! I love concrete astrophysical examples where 100s of dimensions are necessary to appropriately model the data.

    The fact that a large number of basis functions is optimal could be due to the inefficiency of your choice of basis (i.e. the data cannot be compactly represented by PCA eigenfunctions) and/or that the response of interest (here, the exoplanet signal) has a complicated relationship with the observed data. In a work a couple of years ago, we showed that 100s or 1000s of diffusion map basis vectors were optimal (in a cross-validated sense) for predicting photo-z's (http://arxiv.org/abs/0906.0995). In that case, we actually expanded the dimensionality of the problem from ~20 to 1000!!

    ReplyDelete