predict data, not latent variables

At Galaxy Coffee today, Cristina Garcia (MPIA) spoke about the quasar–galaxy cross-correlation at redshift four. The quasars have huge clustering amplitude, so the cross-correlation is expected to have large amplitude too. She uses a clever galaxy selection technique and finds consistency between the data and expectations. Also at Galaxy Coffee, Željko Ivezić (UW) showed that there are simple situations in which the mean of the data is not the best estimator of the location of a distribution function. In one example, not only was there a better estimator, but it improved with more data as 1/N (rather than 1/sqrt(N)). He strongly advocated using likelihood functions to generate estimators.

Fadely showed up in Heidelberg today, and we discussed star–galaxy classification improvements that could help PanSTARRS and LSST. As our loyal reader will recall, one issue with supervised methods for the problem of star–galaxy clasification is that we don't have any good sets of labels, even when we have HST data, or spectroscopy; there is a lot of label noise at the faint end, and good labels only exist for very small slices of the total population. We realized today that we could try to predict not the labels, but the things that go into making the labels, like "psf minus model" in SDSS or roundness and sharpness and so on. We vowed to give it a try. Key idea: Predict data, not latent variables!

No comments:

Post a Comment