## 2017-12-19

### latent variable models: What's the point?

The only research time today was a call with Rix (MPIA) and Eilers (MPIA) about data-driven models of stars. The Eilers project is to determine the stellar luminosities from the stellar spectra, and to do so accurately enough that we can do Milky-Way mapping. And no, Gaia won't be precise enough for what we need. Right now Eilers is comparing three data-driven methods. The first is a straw man,
which is nearest-neighbor! Always a crowd-pleaser, and easy. The second is The Cannon, which is a regression, but fitting the data as a function of labels. That is, it involves optimizing a likelihood. The third is the GPLVM (or a modification thereof) where both the data and the labels are a nonlinear function of some uninterpretable latent variables.

We spent some of our time talking about exactly what are the benefits of going to a latent-variable model over the straight regression. We need benefits, because the latent-variable model is far more computationally challenging. Here are some benefits:

The regression requires that you have a complete set of labels. Complete in two senses. The first is that the label set is sufficient to explain the spectral variability. If it isn't, the regression won't be precise. It also needs to be complete in the sense that every star in the training set has every label known. That is, you can't live with missing labels. Both of these are solved simply in the latent-variable model. The regression also requires that you not have an over-complete set of labels. Imagine that you have label A and label B and a label that is effectively A+B. This will lead to singularities in the regression. But no problem for a latent-variable model. In the latent-variable model, all data and all known labels are generated as functions (nonlinear functions drawn from a Gaussian process, in our case) of the latent variables. And those functions can generate any and all data and labels we throw at them. Another (and not unrelated) advantage is in the latent-variable formulation is that we can have a function space for the spectra that is higher (or lower) dimensionality than the label space, which can cover variance that isn't label-related.

Finally, the latent-variable model has the causal structure that most represents how stars really are: That is, we think star properties are set by some unobserved physical properties (relating to mass, age, composition, angular momentum, dynamo, convection, and so on) and the emerging spectrum and other properties are set by those intrinsic physical properties!

One interesting thing about all this (and brought up to me by Foreman-Mackey last week) is that the latent-variable aspect of the model and the Gaussian-process aspect of the model are completely independent. We can get all of the (above) advantages of being latent-variable without the heavy-weight Gaussian process under the hood. That's interesting.