products of Gaussians are Gaussians

Adrian Price-Whelan (Flatiron) and I dusted off today our dormant project to write a paper about products of Gaussians. The idea is to provide closed-form refactoring of a product of a Gaussian prior times a Gaussian likelihood into a product of a posterior pdf times a marginalized likelihood. It involves some simple math that I first learned from Sam Roweis (deceased) but which was refined beautifully by Boris Leistedt (Imperial). I can't tell whether this project is really useful, or completely inside baseball. We were originally inspired to write this because of a stupid mistake I made when we built The Joker.


population inference: Bayesian or frequentist?

I had a conversation today with Adrian Price-Whelan (Flatiron) about doing population inferences with our catalog of binary companions from APOGEE. We have two options: Either build a hierarchical Bayesian model, with all the likelihoods that we have created for every star, whether or not it shows good evidence for a companion, or else we perform a set of injection/recovery tests to see what stars show good evidence for what kinds of companions and perform a frequentist populations analysis given those tests. You might think that there is a right answer to this, but there isn't: The different approaches involve different kinds of operations, and have different sensitivities to input assumptions.


Gauss-Markov theorem

Soledad Villar (NYU) and I discussed the Gauss–Markov theorem for the Nth time today. It's a strange theorem: It says that the best linear regressor in a regression problem is the one delivered by linear least squares. But it seems to conflate two kinds of linearity. It is both the case that the regressor must be linear in the features, and that the regression coefficients must be determined by a linear operation on the data. These are two different kinds of linearity I think! I'm confused about the assumptions that are required for this theorem; in particular it seems to be brittle when we generate the features by some non-trivial process in a latent space. I sure don't understand it yet!


the state of a spectrograph in calibration space

I did two bits of research with Lily Zhao (Yale) and Megan Bedell (Flatiron) today. One is that we opened up (with a shared screen) the outline of the Zhou's paper on Excalibur, our project to build a hierarchical, non-parametric calibration of the EXPRES spectrograph (and others). We re-organized it and outlined it and figured out remaining tasks for completion.

Zhao and I also discussed some matters relating to measuring the location in the spectrograph of the trace, which is 80-ish stripes on the detector laid down by the 80-ish echelle orders. The idea that we have—which is speculative—is that the locations of the traces might provide a kind of “simultaneous reference” for the wavelength solution. We can show that the spacecraft calibration state varies in a low-dimensional space; that means that we don't need all that much information to find and track the state. Maybe the trace positions will provide that information?! If so, we have given all spectrographs a simultaneous reference, whether they wanted it or not!


finding outer Solar System objects in TESS

Ben Montet (UNSW) put up the bat symbol for a project to find outer Solar System objects using NASA TESS data. It was the scale of problem I wanted: A stack of postage stamps from eleanor and a reflex angular velocity vector (really parallax vector) for Sedna. The issue is all those other pesky stars, and asteroids on closer orbits! I did some exploratory data analysis, in which I detrended each image using a PCA of all the images. But at the same time, others did pixel-based detrending like in CPM and others did more traditional image differencing with a reference image. It's nice: In all methods, Sedna sticks out beautifully! The question is: What methods will be best for finding objects further out in the Solar System? These are harder because they are fainter, but also because they don't move as far relative to the stars in a month!


How many neighbors?

By something of a coincidence, both Ana Bonaca (Harvard) and Adam Wheeler (Columbia) are performing data analyses that require finding and performing simple operations on the K nearest neighbors of each data point in a large data set. I love these kinds of models, because if your fancy machine-learning method isn't far better than K-nearest-neighbors, why would you use it? K-nearest neighbors is fast, simple to explain, and entirely reproducible. Most machine-learning methods are slow, almost impossible to explain to the uninitiated, and irreproducible (sensitive to seeds and initialization).

In both cases (Bonaca and Wheeler), we are looking at performance (in cross-validation, say) as a function of K. My surprise today is that both of them are getting big K values (in the hundreds) as optimal. We discussed why this must be the case. In general I think the model likes large K when the data lie in a locally smooth distribution in the space, or the features you are trying to predict only depend in smooth ways on the data. Or something like that?


what is an attack?

I did some math in a code notebook this morning for Teresa Huang (NYU), related to our project on adversarial attacks. One of our issues or innovations is to describe or evaluate an adversarial attack in a regression (rather than classification) setting. This, I believe, involves comparing how much one can move the labels (the regression output) relative to some kind of expectation: The model is successfully attacked if a small data perturbation leads to a large perturbation in the output label estimate. Large compared to what? I think some kind of prior expectation. Today I worked out how we estimate our prior expectation, given some sparse data from the physical models used to simulate the spectra and produce parameter estimates. We aren't trying to attack the physical models; we are trying to attack machine-learning surrogates for those models, like The Cannon and AstroNN.


generalizing a model to two telescopes

My day started with a call with Christina Eilers (MIT) about extending the spectroscopic parallax work we did with APOGEE DR14 and ESA Gaia to APOGEE DR16. This is a non-trivial extension because DR16 includes (for the first time) data from two different telescopes and this different detailed calibration, LSF, and so on. Eilers's first attempts at this seem very successful; with minimal changes she is getting comparable precision to what we got with the simpler single-telescope data. That's promising, because we didn't expect this to be easy.

She also showed that it works less well to train the model on one telescope and apply the model on the other. So something about the joint training is working. And yes, we test using held-out validation data.


photometric variations of short-period binaries

In the last day or two, Adrian Price-Whelan matched up our radial-velocity binary sample from APOGEE DR16 to NASA Kepler and TESS data. We had been inspired by asteroseismic science, so we were looking at short-period binaries. But: Oh. My. Goodness.

Every single binary below some period shows interesting time-domain behavior in its photometry! It shows ellipsoidal variations, or transits, or “heartbeat” oscillations, or tidally locked rotation. There is so much science in these light curves. And although many of the sources we find have been found previously by either Kepler/TESS teams or else APOGEE teams, we are sensitive to many “sub-threshold” variabilities at low amplitude, because low amplitude in the photometry, at short periods, does not necessarily translate into low amplitude in radial velocity. Now: What to do with all this? Launch on a big, comprehensive project, or many small, focused ones?


paper outline

On a low-research day, the only work I really did was discuss with Teresa Huang (NYU) and Soledad Villar (NYU) the outline and to-do items for an AAS-Journals re-write or re-boot of our paper on adversarial attacks against machine-learning methods in astronomy.


data-driven model of spectra; observing strategies

Today some actual work happened, although only a tiny bit! I spoke with Adam Wheeler (Columbia) about our project to find a local, low-dimensional representation of LAMOST spectra, which can then be used to find abundance outliers. The model is a linear latent-variable model, but executed locally among the K nearest neighbors in the training set near the test object. It is unsupervised, because it doesn't rely on any labels (except the knowledge of where in the spectrum each element has lines). I'm interested in whether approaches like this could be built up into a full abundance system, with clever self-calibration.

I met with Megan Bedell (Flatiron) who has been playing around with observing strategies for Terra Hunting Experiment. We would like to demonstrate that randomized observing strategies bring more information than regular observing strategies. But if we simulate the data as having white (uncorrelated) noise of known variance, the sensitivity of the data to planets of various periods depends only (or almost only) on the total exposure time and total survey duration, and barely on how the observations are scheduled. So what gives? I think the big deal is correlated noise: If the spectrographs drift in their calibration properties in some correlated way over time during the survey, then the regular data will alias those drifts into periodic signals. Or that's my conjecture. I'm not confident in it, but we can test it. Oh and by the way: You always have low-level correlated drifts in your instrument calibration, which are below the calibration precision but large enough that they can affect your results when you combine many thousands of exposures over many years.


spectrograph calibration: Which centroiding is best?

Lily Zhao (Yale) and I looked at the different centroiding methods, indexed by two integers: One integer is how many pixels to use near the peak of each line for the peak fitting. The other is what order of polynomial fitting (to the logarithm of the flux; quadratic in log flux is Gaussian in flux). She made many diagnostic plots and statistics. Now: How to decide which setting of the two integers is best? We tentatively settled on the mean absolute difference of pairs of frames that are taken within the same day (or night really). This uses an assumed stability of the spectrograph, but I think it's safe, because it's hard to imagine that our centroiding will correct for true spectrograph drifts! More soon; if this works, we can further improve the EXPRES pipelines.


how to measure a spectral line in a spectrogram

Lily Zhao (Yale) showed me issues late last week with the outputs of the EXPRES spectrograph line centroiding for its calibration (laser frequency comb and ThAr lamp) exposures. So early this morning I hacked up a parabolic fitting code and then a cubic fitting code. The former are analogous to what we used in SDSS imaging for stellar centroids and Vakili and I analyzed a few years ago. The cubic methods might be less stable, but they are able to capture line asymmetries. I passed these on to Zhao for testing against their pipeline, which fits Gaussians. The key idea, from my perspective, is that the central few pixels contain almost all of the information about the line center; as you go further out on the line, you get much more sensitive to details of your line-spread function without adding much to your centroid quality. It's engineering, of the kind I love.


asteroseismology with APOGEE?

Adrian Price-Whelan (Flatiron) and I discussed some of the results in our paper on binary companions in the APOGEE survey, because these results are launching points for many possible papers. We focused today on the stars with companions with too-short periods (like shorter than is physically possible); are these asteroseismic modes? Price-Whelan found TESS and ASASSN light curves for a few of them and they do look like they have large brightness variations at similar periods! So maybe we can do asteroseismology with SDSS-V! At the end of the day, Price-Whelan made me a notebook with a poster child example star for experiments.

[This post violates Betteridge's Law and/or Hinchliffe's Law.]


likelihood function

I've spent a lot of time in the last few weeks thinking about the differences between discriminative and generative models. One big difference is that a generative model has a concept (implicit or explicit, depending) of a likelihood function, or probability for the data given some kinds of parameters. I made this explicit today, making the generative model that is being studied by Villar (NYU) and Huang (NYU) and me a likelihood function that includes within its parameter space the true, correct generative likelihood function. We implemented an iterated least-squares optimizer for it and it seems to work and give sensible results. We aren't understanding some of the dependencies we are seeing on hyper-parameters.


more generative model work

Today it was back on the notebooks with Villar (NYU). We made improvements to the generative-model optimization, making it a model that explicitly optimizes a log-likelihood objective. That should be good for information-theoretic propertis. And indeed it seems to perform very well. We noticed some things, like that the results seem to be extremely sensitive to the dimensionality of the latent space. It may be that the models massively over-fit when the latent space gets too large, which perhaps isn't surprising. We started to develop some conjectures about the model performance in different settings; these could inform our work and writing on the subject.


generative latent-variable regression model

What does it look like to have a label-constrained latent-variable model that generates data and labels, for use in regression and inference? That's been the subject of discussion between me and Soledad Villar (NYU) and various students for many months now. We finally got serious and (in a collaborative notebook) Villar implemented her vision of this today. It works! And it works well; it appears to defeat similarly flexible discriminative models (where we are just trying to find a function of the data that predicts labels). If this holds up, it confirms my conjectures and my prejudice, so I am pretty enthusiastic about this.


made-to-measure; centroiding

I had a good conversation with Jason Hunt (Flatiron) today about the made-to-measure method of fitting observed kinematic data with equilibrium dynamical models. Hunt and collaborators have taken M2M from being an old-school method that works in the (indirectly and noisily observed) space of six-dimensional phase-space coordinates to something more observational. But not all the way to a fully data-centric model. I think it might be possible. And it relates to Hunt's big goal of applying M2M responsibly to the entire ESA Gaia data set. That's the kind of ambitious goal I can get behind! Late in the day I spent some time working out ideas of how we might make the M2M comparison between model and data as responsible as possible.

I also spoke with Lily Zhao (Yale) about centroiding spectral lines in a laser-frequency comb exposure with the EXPRES spectrograph. I said that we ought to try our polynomial fitting approximate methods there. They work well in multiple settings.


my attitude towards referee responses

Early in the day, Adrian Price-Whelan (Flatiron) called to discuss his latest response to referee. The referee report is (as many are) extremely constructive and valuable; the paper will be much better for every change we make in response. We discussed our philosophies about responding to referees. My approach is the following:

Every referee comment, confusion, or question is a real problem with the paper. The huge value of a referee report or submitting to a refereed journal is that you have called in a close read of your paper by a professional astronomer. That's amazingly valuable. Any place where that referee's close read led to a confusion or a comment or a question is a place where (by example!) a professional astronomer might have a comment or a question or a confusion! Even if you think the referee misunderstands, the misunderstanding is, at least at some level, because you didn't explain something well enough in the paper. So I recommend making a change in the paper text in response to every referee comment. Every comment is a real issue with the paper.


designing papers about adversaries

Teresa Huang (NYU), Soledad Villar (NYU), and I discussed the scope and format of papers about regression and adversarial attacks. The current thinking is that the paper for the astronomical literature would contain attacks against one generative and one discriminative regression for stellar parameters, and maybe also a toy example that is linear that is there for pedagogical purposes. We also discussed a more abstract paper on the subject that would be more mathematical, but we don't have the results for that yet!


a steady-state disk perturbation?

At the beginning of the day I discussed steady-state models of non-axisymmetric perturbations in the Milky Way disk with Frankel (MPIA), Rix (MPIA), and Eilers (MIT). Frankel had found some issues with the equations in our recent arXiv paper on this subject! So we discussed and we are modifying our equations to fix. Thank you, Frankel!

But a big conceptual point under discussion: With the exception of some bars, there are no truly global, steady-state, non-axisymmetric disk perturbations! When you model a perturbation as steady-state, you are really just saying that in some patch (or some annulus), there is an approximately steady-state solution, or that the pattern is changing more slowly than relevant dynamical times. That means that although we use a global pattern in our paper, really anything we find or derive is only locally true.