Eilers (MPIA) and I discussed puzzling results she was getting in which she could fit just about any data (including insanely random data) with the Gaussian Process latent variable model (GPLVM) but with no predictive power on new data. We realized that we were missing a term in the model: We need to constrain the latent variables with a prior (or regularization), otherwise the latent variables can go off to crazy corners of space and the data points have (effectively) nothing to do with one another. Whew! This all justifies a point we have been making for a while, which is that you never really understand a model until you implement it.
The day started with planning between Bedell (Flatiron), Foreman-Mackey (Flatiron), and I about a possible tri-linear model for stellar spectra. The model is that the star has a spectrum, which is drawn from a subspace in spectral space, and doppler shifted, and the star is subject to telluric absorption, which is drawn from a subspace in spectral space, and doppler shifted. The idea is to learn the telluric subspace using all the data ever taken from a spectrograph (HARPS, in this case). But of course the idea behind that is to account for the tellurics by simultaneously fitting them and thereby getting better radial velocities. This was all planning for the arrival of Ben Montet (Chicago), who arrived later in the day for a two-week visit.
At lunch time, Mike Blanton (NYU) gave the CCPP brown-bag talk about SDSS-V. He did a nice job of explaining how you measure the composition of ionized gas by looking at thermal state. And etc!
In the morning I sat in on a meeting of the GALAH team, who are preparing for a data release to precede Gaia DR2. In that meeting, Jeffrey Simpson (USyd) showed me GALAH results on the Oh et al comoving pairs of stars. He finds that pairs from the Oh sample that are confirmed to have the same radial velocity (and are therefore likely to be truly comoving) have similar detailed element abundances, and the ones that aren't, don't. So awesome! But interestingly he doesn't find that the non-confirmed pairs are as different as randomly chosen stars from the sample. That's interesting, and suggests that we should make (or should have made) a carefully constructed null sample for A/B testing etc. Definitely for Gaia DR2!
In the afternoon, I joined the USyd asteroseismology group meeting. We discussed classification of seismic spectra using neural networks (I advised against) or kernel SVM (I advised in favor). We also discussed using very narrow (think: coherent) modes in red-giant stars to find binaries. This is like what my host Simon Murphy (USyd) does for delta-Scuti stars, but we would not have enough data to phase up little chunks of spectrum: We would have to do one huge simultaneous fit. I love that idea, infinitely! I asked them to give me a KIC number.
I gave two talks today, making it six talks (every one very different) in five days! I spoke about the pros and cons of machine learning (or what is portrayed as machine learning on TV) as my final Hunstead Lecture at the University of Sydney. I ended up being very negative on neural networks in comparison to Gaussian processes, at least for astrophysics applications. In my second talk, I spoke about de-noising Gaia data at Macquarie University. I got great crowds and good feedback at both places. It's been an exhausting but absolutely excellent week.
On this, day four of my Hunstead Lectures, Andy Casey (Monash) came into town, which was absolutely great. We talked about many things, including the mixture-of-factor-analyzers model, which is a good and under-used model in astrophysics. I think (if I remember correctly) that it can be generalized to heteroskedastic and missing data too. We also talked about using machine learning to interpolate models, and future projects with The Cannon.
At lunch I sat with Peter Tuthill (Sydney) and Kieran Larkin (Sydney) who are working on a project design that would permit measurement of the separation between two (nearby) stars to better than one millionth of a pixel. It's a great project; the designs they are thinking about involve making a very large, but very finely featured point-spread function, so that hundreds or thousands of pixels are importantly involved in the positional measurements. We discussed various directions of optimization.
My talk today was about The Cannon and the relationships between methods that are thought of as “machine learning” and the kinds of data analyses that I think will win in the long run.
Today I am on my third of five talks in five days, as part of my Hunstead Lectures at Sydney. I spoke about MCMC sampling. A lot of what I said was a subset of things we write in our recent manual on MCMC. At the end of the talk there was some nice discussion of detailed balance, with contributions from Tuthill (USyd) and Sharma (USyd).
At lunch I grilled asteroseismology guru Tim Bedding (USyd) about measuring the large frequency difference delta-nu in a stellar light curve. My position is that you ought to be able to do this without explicitly taking a Fourier Transform, but rather as some kind of mathematical operation on the data. That is, I am guessing that there is a very good and clever frequentist estimator for it. Bedding expressed the view that there already is such a thing, in that there are methods for automatically generating delta-nu values. They do take a Fourier Transform under the hood, but they are nonetheless good Frequentist estimators. But I want to work on sparser data, like Gaia and LSST light curves. I need to understand this all better. We also talked about how it is possible for a gastrophysics-y star to have oscillations with quality factors better than 105. Many stars do!
That's all highly relevant to the work of Simon Murphy (USyd), who finds binary stars by looking at phase drifts in highly coherent delta-Scuti star oscillations. He and I spent an Afternoon of hacking on models for one of his delta-Scuti stars, with the hopes of measuring the quality factor Q and also maybe exploring new and more information-preserving methods for finding the binary companions. This method of finding binaries has similar sensitivity to astrometric methods, which makes it very relevant to the binaries that Gaia will discover.
Today I gave my second of five Hunstead Lectures at University of Sydney. It was about finding planets in the Kepler and K2 data, using our non-stationary Gaussian Process or linear model as a noise model. This is the model we wrote up in our Research Note of the AAS. In the question period, the question of confirmation or validation of planets came up. It is very real that the only way to validate most tiny planets is to make predictions for other data. But when will we have data more sensitive than Kepler? This is a significant problem for much of bleeding-edge astronomy.
Early in the morning I had a long call with Jason Wright (PSU) and Bedell (Flatiron) about the assessment of the calibration programs for extreme-precision RV surveys. My position is that it is possible to assess the end-to-end error budget in a data-driven way. That is, we can use ideas from causal inference to figure out what parts of the RV noise are coming from telescope plus instrument plus software. Wright didn't agree: He believes that large parts of the error budget can't be seen or calibrated. I guess we better start writing some kind of paper here.
In the afternoon I had a great discussion with Buder (MPIA), Sharma (USyd), and Bland-Hawthorn (USyd) about the current status of detailed elemental abundance measurements in GALAH. The element–element plots look fantastic, and clear trends and high precision are evident, just looking at the data. To extract these abundances, Buder has made a clever variant of The Cannon which makes use of the residuals away from a low-dimensional model to measure the detailed abundances. They are planning on doing a large data release in April.
On the plane to Sydney, I started an outline for a paper with Bedell (Flatiron) on detailed elemental abundances, and the dimensionality or interpretability of the elemental subspace. I also started to plan the five talks I am going to give in five days as the Hunstead Lecturer. On arrival I went straight to University of Sydney and started lecturing. My first talk was on fitting a line to data, with a concentration on the assumptions and their role in setting procedures. That is, I emphasized that you shouldn't choose a procedure by which you fit your data: You should choose a set of assumptions you are willing to make about your data. Once you do that, the procedure will flow from the assumptions. After my talk I had a great lunch with graduate students at Sydney. The range of research around the table was remarkable. I plan to spend some of the week learning about asteroseismology.
In Friday parallel-working session, Bedell (Flatiron) showed me all 900-ish plots of every element against every element for her sample of 80 Solar twins. Incredible. Outrageous precision, and outrageous structure. And it is a beautiful case where you can just see the precision directly in the figures: There are clearly real features at very small scales. And hugely informative structures. This is the ideal data set for addressing something that has been interesting me for a while: What is the dimensionality of the chemical-abundance space? And can we see different nucleosynthetic processes directly in the data?
Late in the day, Jim Peebles (Princeton) gave the Astro Seminar. He spoke about three related issues in numerical simulations of galaxies: They make bulges that are too large and round; they make halos that have too many stars; and they don't create a strong enough bimodality between disks and spheroids. There were many galaxy-simulators in the audience, so it was a lively talk, and a very lively dinner afterwards.
I had my weekly call with Bonaca (Harvard), about information theory and cold stellar streams. We discussed which streams we should be considering in our paper. We have combinatoric choices, because there are N streams and K Milky-Way parameters; we could constrain any combination of parameters with any combination of streams! And it is even worse than that, because we are talking about basis-function expansions for the Milky-Way potential, which means that K is tending to infinity! We tentatively decided to do something fairly comprehensive and live with the fact that we won't be able to fully interpret it with finite page charges.
The Gaia DR2 workshop and Stars Group meeting were both very well attended! At the former, Price-Whelan (Princeton) showed us PyGaia, a tool from Anthony Brown's group in Leiden to simulate the measurement properties of the Gaia Mission. It is really a noise model. And incredibly useful, and easy to use.
In the Stars meeting, so many things! Andrew Mann (Columbia) spoke about the reality or controversies around Planet 9, which got us arguing also about claims of extra-solar asteroids. Kopytova (ASU) described her project to sensitively find chemical abundance anomalies among stars with companions, and asked the audience to help find ways that true effects could be scooped. Her method is very safe, so it takes a near-conspiracy, I think, but Brewer (Yale) disagreed. Veselin Kostov (Goddard) talked about searching for circumbinary planets. This is a good idea! He has found a few in Kepler but believes there are more hidden. It is interesting for TESS for a number of reasons, one of which is that you can sometimes infer the period of the exoplanet with only a short stretch of transit data (much shorter than the period), by capitalizing on a double-transit across the binary.
Didier Queloz (Cambridge) was in town for the day. Bedell (Flatiron) and I discussed with him next-generation projects for HARPS and new HARPS-like instruments. He is pushing for extended campaigns on limited sets of bright stars. I like this idea for its statistical and experimental-design simplicity! But (as he notes) it is hard to get the heterogeneous community behind such big projects. He has a project to pitch, however, if people are looking to buy in to new data sources. He, Bedell, and I discussed what we know about limits to precision in this kind of work. We aren't far apart, in that we all agree that HARPS (and its competitors) are extremely well calibrated machines, much better calibrated than the end-to-end precision obtained.
Today Kate Storey-Fisher (NYU) and I met with Mike Blanton (NYU) and Zhongxu Zhai (NYU) to discuss possible projects that Storey-Fisher and I have been talking about. We are thinking about trying to systematize (and pre-register) the search for anomalies in cosmological surveys. The idea (which is still vague) is to somehow lexicographically order all anomalies we could search for, and then search, such that we can keep exquisite track of the number of independent hypotheses we have checked.
Blanton and Zhai had some advice for us. One category of advice was around systematics: Anomalies and systematics in the data might appear similar! So we should think about anomalies that are somehow least sensitive to these systematics. One good thing is that we are working at the home of many of the tools that we need to make these assessments. Another category of advice was to think about what anomalies are motivated by questions of theory in the dark sector, in galaxy formation, or in the initial conditions. Theory-inspired (if not predicted) anomalies are more productive, in a scientific-literature sense, than randomly specified anomalies. We are close to being able to specify a project!
Taisiya Kopytova arrived in NYC for a few days to work on stellar abundances and orbital companions. Her project is very well designed: She has a set of red-giant stars in APOGEE where we know they have companions. For each of these stars with companions, she has found a set of matched stars—matched in stellar parameters—that don't have companions (or not companions that are detectable). She then compares the detailed chemical abundances between these two samples. The approach is extremely conservative and very robust to problems in the data: For a false effect to appear, it has to be an effect that causes a companion to be detected (or not detected)! And she finds signals.
One disturbing thing is that we find signal-to-noise effects, and we get slightly different results when we use APOGEE DR13 or DR14 data. So we might need to match on signal-to-noise as well as stellar parameters.
In Friday parallel-working session, Megan Bedell (Flatiron) and I discussed (for the nth time) the scope of the first paper and next papers in our extreme-precision radial-velocity work. We realized that paper 1 is pretty-much ready to go! We also realized that the point should not be about what people might be doing wrong, but about what things you can do that are easy and close to correct. In particular, the point that a data-driven spectral model can come close to saturating the Cramér–Rao bound on radial-velocity. This was not obvious at the outset, because some of the information in the data must go into the spectral model (and not the RV measurement). That's a good point!
Renée Hlozek (Toronto) gave the Astrophysics Seminar. In part she talked about the negative S–Z effect from voids. In another part, she talked about constraining light scalar dark matter with large-scale structure. Both problems I am interested in for near-future research. In the afternoon, she and Alex Malz (NYU) and I talked about advising and mentoring. Hlozek is a deep thinker about these things.
In our weekly meeting, Ana Bonaca (Harvard) and I discovered a super-subtle bug in how the covariance matrices we are making (context: the Cramér–Rao bound on Milky-Way parameters given observations (and a model) of cold stellar streams) are being plotted. Damn geometry is hard! But she fixed the code and all our covariances look really good now. We think we understand the trade-offs between different parameters, given different data. Time to write! And use the framework for planning new observations.
I spent the rest of the day not working on my NSF proposal, which is very bad!
Today Kathryn Johnston (Columbia) came through Flatiron to discuss Gaia DR2 projects in the Milky-Way halo. She made the very nice point that we could use a Gaia simulator like PyGaia to “observe” the Bullock & Johnston all-substructure simulations to see how halo substructure appears in a realistic DR2 data set. We discussed clustering algorithms and the relationships between applying clustering directly to the observed data vs transforming the data to some better space (invariants, say) vs doing some kind of inference or data-driven model that respects the Gaia noise model and so on. We are looking for methods that will be powerful, but simple, since we are looking for fast projects to do in the immediate follow-up period to the data release.
Our conversation veered into chemical-abundance space, where we all realized that Megan Bedell (Flatiron) is sitting on an amazing chemical-tagging data set. She only has 80 stars, but because they are Solar twins, they have exceedingly good chemical measurements. Can we use these to measure scattering processes in the Milky-Way disk?
We also briefly discussed something inspired by Alyssa Goodman (Harvard), who spoke first thing in the morning at the Scientific Visualization conference that is on at Flatiron: Can we measure our position relative to the disk plane, and maybe see fluctuations in that plane? Goodman says that the Sun is 25 pc above the plane, and that is obvious (she says) from the radio observations of HI gas. But Bovy (if I recall correctly) looked at this in Gaia DR1 and finds that our offset from the midplane is less than 10 pc. Is there an offset between stars and gas? If so, why? If not, who is wrong? Great set of questions for DR2.