a mistake in an E-M algorithm

[I am on quasi-vacation this week, so only posting irregularly.]

I (finally—or really for the N-th time, because I keep forgetting) understood the basis of E-M algorithms for optimizing (what I call) marginalized likelihoods in latent-variable models. I then worked out the equations for the E-M step for factor analysis, and a generalization of factor analysis that I hope to use in my project with Christina Eilers (MPIA).

Imagine my concern when I got a different update step than I find in the writings of my friend and mentor Sam Roweis (deceased), who is the source of all knowledge, as far as I am concerned! I spent a lot of time looking up stuff on the web, and most things agree with Roweis. But finally I found this note by Andrew Ng (Stanford / Coursera), which agrees with me (and disagrees with Roweis).

If you care about the weeds, the conflict is between equation (8) in those Ng notes and page 3 of these Roweis notes. It is a subtle difference, and it takes some work to translate notation. I wonder if the many documents that match Roweis derive from (possibly unconscious) propagation from Roweis, or whether the flow is in another direction, or whether it is just that the mistake is an easy one to make? Oddly, Ng decorates his equation (8) with a warning about an error you can easily make, but it isn't the error that Roweis made.

So much of importance in computer science and machine learning is buried in lecture notes and poorly indexed documents in user home pages. This is not a good state of affairs!


serious bugs; dimensionality reduction

Megan Bedell (Chicago) and I had a scare today: Although we can show that in very realistically simulated fake data (with unmodeled tellurics, wrong continuum, and so on) a synthetic spectrum (data-driven) beats a binary mask for measuring radial velocities, we found that in real data from the HARPS instrument that the mask was doing better. Why? We went through a period of doubting everything we know. I was on the point of resigning. And then we realized it was a bug in the code! Whew.

Adrian Price-Whelan (Princeton) also found a serious bug in our binary-star fitting. The thing we were calculating as the pericenter distance is the distance of the primary-star center-of-mass to the system barycenter. That's not the minimum separation of the two stars! Duh. That had us rolling on the floor laughing, as the kids say, especially since we might have gotten all the way to submission without noticing that absolutely critical bug.

At the end of the day, I gave the Königstuhl Colloquium, on the blackboard, about dimensionality reduction. I started with a long discussion about what is good and bad about machine learning, and then went (too fast!) through PCA, ICA, kernel-PCA, PPCA, factor analyzers, HMF, E-M algorithms, latent-variable models, and the GPLVM, drawing connections between them. The idea was to give the audience context and jumping-off points for their projects.



Today, in an attempt to make our simulated extreme-precision radial-velocity fake data as conservative as possible, Megan Bedell (Chicago) and I built a ridiculously pessimistic model for un-modeled (and unknown) telluric lines that could be hiding in the spectra, at amplitudes too low to be clearly seen in any individual spectrum, but with the full wavelength range bristling with lines. Sure enough, these “micro-tellurics” (as you might call them) do indeed mess up radial-velocity measurements. The nice thing (from our perspective) is that they mess up the measurements in a way that is co-variant with barycentric velocity, and they mess up synthetic-spectrum-based RV measurements less than binary-mask-based RV measurements.

At MPIA Galaxy Coffee, Irina Smirnova-Pinchukova (MPIA) gave a great talk about her trip on a SOFIA flight.


machine learning, twins, excitation temperature

After our usual start at the Coffee Nerd, it was MW Group Meeting, where we discussed (separately) Cepheids and Solar-neighborhood nucleosynthesis. On the latter, Oliver Philcox (St Andrews) has taken the one-zone models of Jan Rybizki (MPIA) and made them 500 times faster using a neural-network emulator. This emulator is tuned to interpolate a set of (slowly computed) models very quickly and accurately. That's a good use of machine learning! Also, because of backpropagation, it is possible to take the derivatives of the emulator with respect to the inputs (I think) and therefore you also get derivatives, for optimization and sampling.

The afternoon's PSF Coffee meeting had presentations by Meg Bedell (Chicago) about Solar Twin abundances, and by Richard Teague (Michigan) about protoplanetary disk TW Hya. On the former, Bedell showed that she can make extremely precise measurements, because a lot of theoretical uncertainties cancel out. She finds rock-abundance anomalies (that is, abundance anomalies that are stronger in high-condensation-temperature lines) all over the place, which is context for results from Semyeong Oh (Princeton). On TW Hya, Teague showed that it is possible to get pretty consistent temperature information out of line ratios. I would like to see two-dimensional maps of those: Are there embedded temperature anomalies in the disk?


latent-variable model; bound-saturating EPRV

Today, Christina Eilers (MPIA) and I switched her project over to a latent variable model. In this model, stellar spectra (every pixel of every spectrum) and stellar labels (Teff, logg, and so on for every star) are treated on an equal footing as “data”. Then we fit an underlying low-dimensional model to all these data (spectra and labels together). By the end of the day, cross-validation tests were pushing us to higher and higher dimensionality for our latent space, and the quality of our predictions was improving. This seems to work, and is a fully probabilistic generalization of The Cannon. Extremely optimistic about this!

Also today, Megan Bedell (Chicago) built a realistic-data simulation for our EPRV project, and also got pipelines working that measure radial velocities precisely. We have realistic, achievable methods that saturate the Cramér–Rao bound! This is what we planned to do this week not today. However, we have a serious puzzle: We can show that a data-driven synthetic spectral template saturates the bound for radial-velocity measurement, and that a binary mask template does not. But we find that the binary mask is so bad, we can't understand how the HARPS pipeline is doing such a great job. My hypothesis: We are wrong that HARPS is using a binary mask.


linear models for stars

My loyal reader knows that my projects with Christina Eilers (MPIA) failed during the #GaiaSprint, and we promised to re-group. Today we decided to take one last attempt, using either heteroskedastic matrix factorization (or other factor-analysis-like method) or else probabilistic principal components analysis (or a generalization that would be heteroskedastic). The problem with these models is that they are linear in the data space. The benefit is that they are simple, fast, and interpretable. We start tomorrow.

I made a plausible paper plan with Megan Bedell (Chicago) for our extreme-precision radial-velocity project, in which we assess the information loss from various methods for treating the data. We want to make very realistic experiments and give very pragmatic advice.

I also watched as Adrian Price-Whelan (Princeton) used The Joker to find some very strange binary-star systems with red-clump-star primaries: Since a RC star has gone up the giant branch and come back down, it really can't have a companion with a small periastron distance! And yet...



Various hacking sessions happened in undisclosed locations in Heidelberg this weekend. The most productive moment was that in which—in debugging a think-o about how we combine independent samplings in The Joker—Adrian Price-Whelan (Princeton) and I found a very efficient way to make our samplings adapt to the information in the data (likelihood). That is, we used a predictive adaptation to iteratively expand the number of prior samples we use to an appropriate size for our desired posterior output. (Reminder: The Joker is a rejection sampler.) This ended up speeding up our big parallel set of samplings by a factor of 8-ish!


M-dwarf spectral types; reionization

Jessica Birky (UCSD) and I met with Derek Homeier (Heidelberg) and Matthias Samland (MPIA) to update them on the status of the various things Birky has been doing, and discuss next steps. One consequence of this meeting is that we were able to figure out a few well-defined goals for Birky's project by the end of the summer:

Because of a combination of too-small training set and optimization issues in The Cannon, we don't have a great model for M-dwarf stars (yet) as a function of temperature, gravity, and metallicity. That's too bad! But on the other hand, we do seem to have a good (one-dimensional) model of M-dwarf stellar spectra as a function of spectral type. So my proposal is the following: We use the type model to paint types onto all M-dwarf stars in the APOGEE data set, which will probably correlate very well with temperature in a range of metallicities, and then use those results to create recommendations about what spectral modeling would lead to a good model in the more physical parameters.

Late in the day, José Oñorbe (MPIA) gave a great talk about the empirical study of reionization. He began with a long and much needed review of all the ways you can measure reionization, using radio imaging, lyman-alpha forest, damping wings, cosmic microwave background polarization, and so on. This brought together a lot of threads I have been hearing about over the last few years. He then showed his own work on the lyman-alpha forest, where they exploit the thermodynamic memory the low-density gas has about its thermal history. They get good results even with fairly toy models, which is very promising. All indicators, by the way, suggest a very late reionization (redshifts 7 to 9 for the mid-point of the process). That's good for observability.


planning; marginalization

I had phone calls with Megan Bedell (Chicago) and Lauren Anderson (Flatiron) to discuss near-term research plans. Anderson and I discussed whether the precise MW mapping we were doing could be used to measure the length, strength, and amplitude of the Milky Way bar. It looks promising, although (by bad luck), the 2MASS sensitivity to red-clump stars falls off right around the Galactic Center (even above the plane and out of the dust). There are much better surveys for looking at the Galactic center region.

Bedell and I contrasted our plans to build a data-driven extreme-precision radial-velocity (EPRV) pipeline with our plans to write something more information-theoretic and pragmatic about how to maximize RV precision. Because our data-driven pipeline requires some strong applied math, we might postpone that to the Fall, when we are co-spatial with math expertise in New York City.

I was pleased by a visit from Joe Hennawi (UCSB) and Fred Davies (MPIA / UCSB) in which they informed me that some comments I made about sampling approximations to marginalizations changed their strategy in analyzing very high redshift quasars (think z>7) for IGM damping wing (and hence reionization). We discussed details of how you can use a set of prior-drawn simulations to do a nuisance-parameter marginalization (in this case, over the phases of the simulation).


graphical models; bugs

At MPIA MW group meeting, Semyeong Oh (Princeton) described her projects to find—and follow up—co-moving stellar pairs and groups in the Gaia TGAS data. She presented the hypothesis test (or model comparison) by showing the two graphical models, which was an extremely informative and compact way to describe the problem. This led to a post-meeting discussion of graphical models and how to learn about them. There is no really good resource for astronomers. We should write one!

I spent the afternoon with Matthias Samland (MPIA) and Jessica Birky (UCSD), debugging code! Samland is adding a new kind of systematics model to his VLT-SPHERE data analysis. Birky is hitting the limitations of some of our code that implements The Cannon. I got a bit discouraged about the latter: The Cannon is a set of ideas, not a software package! That's good, but it means that I don't have a perfectly reliable and extensible software package.


Simpson's paradox

I spent part of the day working through Moe & Di Stefano (2017), which is an immense and comprehensive paper on binary-star populations. The reason for my attention: Adrian Price-Whelan (Princeton) and I need a parameterization for the binary-star population work we are doing in APOGEE. We are not going to make the same choices as those made by Moe, but there is tons of relevant content in that paper. What a tour de force!

I spent part of the afternoon crashing the RAVE Collaboration meeting at the University of Heidelberg. I learned many things, though my main point was to catch up with Matijevic, Minchev, Steinmetz, and Freeman! Ivan Minchev (AIP), in his talk, discussed relationships between age, metallicity, and Galactocentric radius for stars in the disk. He has a beautiful example of Simpson's paradox, in which, for small population slices (age slices), the lower metallicity stars have higher tangential velocities, but overall the opposite is true, causing measured gradients to depend very strongly on the measurement uncertainties (because: Can you slice the populations finely enough in age?). We discussed paths to resolving this with a proper generative model of the data.