Gaussian Processes for astronomers

I spent my research time today (much of it on planes and trains) working on my seminar (to be given tomorrow) about Gaussian Processes. I am relying heavily on the approach advocated by Foreman-Mackey, which is to start with weighted least squares (and the Bayesian generalization), then show how a kernel function in the covariance matrix changes the story, and then show how the Gaussian Process is latent in this solution. Not ready! But Foreman-Mackey made me a bunch of great demonstration figures.


DDD meeting, day 2

On the second day of the Moore Foundation meeting, I gave my talk (about flexible models for exoplanet populations, exoplanet transits, and exoplanet-discovering hardware calibration). After my talk, I had a great conversation with Emmanuel Candès (Stanford), who asked me very detailed questions about my prior beliefs. I realized in the conversation that I have been violating all my own rules: I have been setting my prior beliefs about hyper-parameters in the space of the hyper-parameters and not in the space of the data. That is, you can only assess the influence and consistency of the prior pdf (consistency with your actual beliefs) by flowing the prior through the probabilistic model and generating data from it. I bet if I did that for some of the problems I was showing, I would find that my priors are absurd. This is a great rule, which I often say to others but don't do myself: Always sample data from your prior (not just parameters). This is a rule for Bayes but also a rule for those of us who eschew realism! More generally, Candès's expressed the view that priors should derive from data—prior data—a view with which I agree deeply. Unfortunately, when it comes to exoplanet populations, there really aren't any prior data to speak of.

There were many excellent talks again today; again this is an incomplete set of highlights for me: Titus Brown (MSU) explained his work on developing infrastructure for biology and bioinformatics. He made a number of comments about getting customer (or user) stories right and developing with the current customer in mind. These resonated for me in my experiences of software development. He also said that his teaching and workshops and outreach are self-interested: They feed back deep and valuable information about the customer. Jeffrey Heer (UW) said similar things about his development of DataWrangler, d3.js, and other data visualization tools. (d3.js is github's fourth most popular repository!) He showed some beautiful visualizations. Heer's demo of DataWrangler simply blew away the crowd, and there were questions about it for the rest of the day.

Carl Kingsford (CMU) caused me (and others) to gasp when he said that the Sequence Read Archive of biological sequences cannot be searched by sequence. It turns out that searching for strings in enormous corpuses of strings is actually a very hard problem (who knew?). He is using a new structure called a Bloom Filter Tree, in which k-mers (length-k subsections) are stored in the nodes and the leaves contain the data sets that contain those k-mers. It is very clever and filled with all the lovely engineering issues that the Astrometry.net data structures were filled with lo so many years ago. Kingsford focuses on writing careful code, so the combination of clever data structures and well written code gets him orders of magnitude speed-ups over the competition.

Causal inference was an explicit or implicit component of many of the talks today. For example, Matthew Stephens (Chicago) is using natural genetic variations as a "randomized experiment" to infer gene expression and function. Laurel Larson (Berkeley) is looking for precursor events and predictors for abrupt ecological changes; since her work is being used to trigger interventions, she requires a causal model.

Blair Sullivan (NC State) spoke about performing inferences with provable properties on graphs. She noted that most interesting problems are NP hard on arbitrary graphs, but become easier on graphs that can be embedded (without crossing the edges) on a planar or low-genus space. This was surprising to me, but apparently the explanation is simple: Planar graphs are much more likely to have small sets of vertices that split the graph into disconnected sub-graphs. Another surprising thing to me is that "motif counting" (which I think is searching for identical subgraphs within a graph) is very hard; it can only be done exactly and in general for very small subgraphs (six-ish nodes).

The day ended with Laura Waller (Berkeley) talking about innovative imaging systems for microscopy, including light-field cameras, and then a general set of cameras that do non-degenerate illumination sequences and infer many properties beyond single-plane intensity measurements. She showed some very impressive demonstrations of light-field inferences with her systems, which are sophisticated, but built with inexpensive hardware. Her work has a lot of conceptual overlap with astronomy, in the areas of adaptive optics and imaging with non-degenerate masks.


DDD meeting, day 1

Today was the first day of a private finalist meeting for the new Moore Foundation Data Driven Discovery Individual Investigator grants. The format is a shootout of short talks and a few group activities. There were many extremely exciting talks at the meeting; here is just an incomplete smattering of highlights for me:

Ethan White (Utah State) showed amazing ecology data, with most data sources being people looking at things and counting things in the field, but then also some remote sensing data. He is using high-resolution time-series data on plants and animals to forecast changes in species and the ecosystem. He appeared to be checking his models "in the space of the data"—not in terms of reproduction of some external "truth"—which was nice.

Yaser Abu-Mostafa (Caltech) spoke about the practice and pitfalls of machine learning; in particular he is interested in new methods to thwart data "snooping" which is the name he gives to the problem "if you torture your data enough, it will confess".

Carey Priebe (JHU) opened his talk with the "Cortical Column Conjecture" which claims that the cortex is made up of many repeats of small network structures that are themselves, in some sense, computing primitives. This hypothesis is hard to test both because the graphs of neural connections in real brains are very noisy, and because inference on graphs (including finding repeated sub-graphs) is combinatorically hard.

Amit Singer (Princeton) is using micrographs of large molecules to infer three-dimensional structures. Each micrograph provides noisy data on a two-dimensional projection of each molecule; the collection of such projections provides enough information to both infer the Euler angles (three angles per molecule) and the three-dimensional structure (a very high-dimensional object). This project is very related to things LeCun, Barron, and I were talking about many years ago with galaxies.

Kim Reynolds (UT Southwestern Medical Center) is using genetic variation to determine which parts of a protein sequence are important for setting the structure and which are replaceable. She makes predictions about mutations that would not interrupt structure or function. She showed amazing structures of cellular flagella parts, and proposed that she might be able to create new kinds of flagella that would be structurally similar but different in detail.


empirical models for APOGEE spectra

I spent a chunk of the day with Melissa Ness (MPIA), fitting empirical models to APOGEE infrared spectra of stars. The idea is to do a simple linear supervised classification or regression, in which we figure out the dependence of the spectra on key stellar parameters, using a "training set" of stars with good stellar parameters. We worked in the pair-coding mode. By the end of the day we could show that we are able to identify regions of the spectrum that might serve as good metallicity indicators, relatively insensitive to temperature and log-g. The hopes for this project range from empirical metallicity index identification to label de-noising to building a full data-driven (supervised) stellar parameter pipeline. We ended our coding day pretty optimistic.


dust priors and likelihoods

Richard Hanson and Coryn Bailer-Jones (both MPIA) and I met today to talk about spatial priors and extinction modeling for Gaia. I showed them what I have on spatial priors, and we talked about the differences between using extinction measurements to predict new extinctions, using extinction measurements to predict dust densities, and so on. A key difference between the way I am thinking about it and the way Hanson and Bailer-Jones are thinking about it is that I don't want to instantiate the dust density (latent parameters) unless I have to. I would rather use the magic of the Gaussian Process to marginalize it out. We developed a set of issues for the document that I am writing on the subject. At Galaxy Coffee, Girish Kulkarni (MPIA) gave a great talk about the physics of the intergalactic medium and observational constraints from the absorption lines in quasar spectra.


quasar continuum blueward of Lyman alpha, Galactic center

If you go to the blue side of Lyman alpha, at reasonable redshifts (say 2), the continuum is not clearly visible, since the forest is dense and has a range of equivalent widths. Any study of IGM physics or radiation or clustering or ionization depends on an accurate continuum determination. What to do? Obviously, you should fit your continuum simultaneously with whatever else you are measuring, and marginalize out the posterior uncertainties on the continuum. Duh!

That said, few have attempted this. Today I had a long conversation with Hennawi, Eilers, Rorai, and KG Lee (all MPIA) about this; they are trying to constrain IGM physics with the transmission pdf, marginalizing out the continuum. We discussed the problem of sampling each quasar's continuum separately but having a universal set of IGM parameters. I advocated a limited case of Foreman-Mackey and my endless applications of importance sampling. Gibbs sampling would work too. We discussed how to deal with the fact that different quasars might disagree mightily about the IGM parameters. Failure of support can ruin your whole day. We came up with a clever hack that extends a histogram of samples to complete support in the parameters space.

In Milky Way group meeting, Ness (MPIA) showed that there appears to be an over-density of metal-poor stars in the inner one degree (projected) at the center of the Milky Way. She is using APOGEE data and her home-built metallicity indicators. We discussed whether the effect could be caused by issues with selection (either because of dust or a different explicit selection program in this center-Galaxy field). If the effect is real, it is extremely interesting. For example, even if the stars were formed there, why would they stay there?


extinction and dust, H-alpha photons

While "off the grid" for a long weekend, I spent time writing documents for Coryn Bailer-Jones (MPIA) and Dennis Zaritsky (Arizona). The former was about using spatial priors for inference of the three-dimensional dust density constrained by Gaia data. If you use a Gaussian Process spatial prior, you can perform the inference in extinction space (not dust space) and transfer extinction predictions to new points given extinction data without ever explicitly instantiating the dust density field. This is not a genius idea; it flows from the fact that any linear projection of a Gaussian pdf is itself a Gaussian pdf. The whole thing might not be computationally tractable, but at least it is philosophically possible. One issue with using a Gaussian Process here is that it puts support onto negative dust densities. I don't think that is a problem, but if it is a problem, the fixes are not cheap.

The latter document—for Zaritsky—is about finding the H-alpha photons that are coming from the outskirts of low-redshift galaxies by doing cross-correlations between SDSS spectroscopy and nearby galaxy centers. This project is designed to test or constrain some of the ideas in a talk at NYU by Juna Kollmeier a few months ago.


coffee, objectives in calibration

At MPIA Galaxy Coffee, Schmidt (UCSB) showed resolved spectroscopy of some highly magnified high-redshift galaxies to look for spatial variations of metallicity. Beautiful data! He also talked about the photon density at high redshift and the implications for reionization. McConnell (Hawaii) spoke about the need for more good black-hole mass determinations. He (inadvertently perhaps) showed that the BH-sigma (velocity dispersion) relation could easily have zero scatter, when you consider the uncertainties in both directions and the possibility that the sigma uncertainties are under-estimated. In general it is very hard to get a sigma on a sigma! Filed away for future thinking.

In the afternoon, I spoke with Wang, Schölkopf, Foreman-Mackey about Kepler calibration. We tentatively decided to make exoplanet discovery our primary objective. An objective is required, because we can set our "hyper-parameters" (our choices about model and model complexity) to optimize anything from exoplanet discovery to exoplanet characterization to stellar rotation determination. I am still worried about the flexibility of our models.


spiral structure in the Milky Way, K2

In Milky Way group meeting, Bovy showed results from a red-clump-star catalog derived from SDSS-III APOGEE and Schlafly showed extinction-corrected stellar maps sliced by distance from PanSTARRS. Both of these data sets plausibly contain good evidence of the spiral arms in stellar density, but neither of the first authors were willing to show me that evidence! I couldn't convince Bovy to help me out, but I think Schlafly might make the plot I want, which is the density as a function of Galactic X and Y for stars at low Z, divided by a smooth Galaxy model.

In a phone call late in the day, Ian Crossfield (Arizona) and Erik Petigura (Berkeley) and I discussed K2 data. I (perhaps unwisely) agreed to see if I can photometer the data, given pixel files.


exoplanet photometry, crossing the streams

Nick Cowan (Northwestern, Amherst) was in town today to give a seminar about exoplanet thermodynamics and climate. He showed nice results inferring the temperature distribution on the surfaces of hot jupiters and the same for degraded data on the Earth. He spent a lot of time talking about thermostats and carbon and water cycles on the Earth and Earth-like planets. In the morning I discussed my OWL photometry with him, and we proposed to try it on his massive amounts of Spitzer IRAC data on transiting exoplanets.

After lunch, a group of the willing (Rix, Bovy, Sesar, Price-Whelan, myself) discussed stream fitting in the Milky Way. We decided to fit multiple streams simultaneously, starting with Orphan, GD-1, and Palomar 5. The first step is to gather the data. Price-Whelan and I also have to modify our method so that it can take the heterogeneous kinds of data in the GD-1 data set.


centroiding and searching

I spoke with Vakili about centroiding stars. We are trying to finally complete a project started by Bovy ages ago to compare the best-possible centroiding of stars with a three-by-three pixel hack related to what is done in the SDSS pipelines. Vakili hit this issue because if you don't have good centroids, you can't get a good point-spread function model. Well, actually I shouldn't say "can't", because you can but then you need to make the centroids a part of the model that you learn along with the point-spread function. That may still happen, but along the way we are going to write up an analysis of the hack and also the Right Thing To Do.

Foreman-Mackey, Fadely, Hattori, and I discussed Hattori's search for exoplanets in the Kepler data. The idea is to build up a simple system based on simple components and then swap in more sophisticated components as we need them. We discussed a bit the question of "search scalar"—that is, what we compute as our objective function in the search. There is a likelihood function involved, but, as we often say in #CampHogg: Probabilistic inference is good at giving you probabilities; it doesn't tell you what to do. Search is a decision problem.


Gaia RVS spectra

As I mentioned a few days ago, Mark Cropper (UCL) brought up the question of how to infer clean, high-resolution spectra from the Gaia RVS data, which are afflicted by distortions caused by charge-transfer inefficiency. I spent my small amount of work time today drafting a document with my position, which is that if you can forward-model the CTE (which they can) then spectral extraction is not hard.


GaiaCal, day four

Tne theme of this short meeting has been that things we thought were real astrophysical parameters just might not be. Soderblom continued along this path when he showed that it is impossible to measure effective temperature to better than something on the order of 100 K. That limits the precision of model comparisons, even in the absence of modeling uncertainties (which are currently large). I argued that perhaps we should just give up on these ersatz "physical" parameters and go to predictions of observables. I was shouted down.

Finkbeiner showed unreal comparisons between SDSS and PanSTARRS photometry, with each survey synthesizing the other. He can show with enormous confidence which survey is responsible for which photometric residuals, just by projecting the residuals onto the sky or into the focal planes of the two instruments. This is an interesting kind of "causal inference". It is incredibly convincing; I must discuss it with Schölkopf next month.


GaiaCal, day three

It was a half-day at the meeting today. Highlights included a talk by Bergemann about stellar models, which (perhaps unwittingly) demonstrated that the effective temperature, spectrum, and color of a star are not unique predictions of a model. That is, all of them are time-dependent; we really need to prediction distributions, not values. In the questioning, Davies pointed out that the variability could be used as a predictor of the depth of the photospheric convection zone or turnover scale. We discussed that later; resolving to think more about it.

Creevey convinced me that asteroseismology might be the One True Path to reliable calibration of astrophysical determinations of stellar densities, internal structure, and ages. An asteroseismology mission with the same scope as Gaia would change everything forever! In the mean time, asteroseismology can be used to calibrate the stars we most care about.

At the end of the sessions, I spoke, emphasizing the need to preserve and transmit likelihood functions, echoing comments by Juric that this is a goal of the planning in the LSST software group.


GaiaCal, day two

A lot of the second day of #GaiaCal was about "benchmark stars". Indeed, my concerns about "truth" from yesterday got more severe: I don't think there is any possibility of knowing the "true" value of Fe/H or anything else for a star, so we should probably figure out ways of doing our science without making reference to such inaccessible things. One approach is to define the results of a certain set of procedures on a certain set of stars to be "truth" and then warp all models and methods to reproduce those truths. That's worth thinking about. During the day's proceedings, Blomme and Worley showed some incredible spectra and spectral results on the benchmarks, including abundance measurements for some 25 elements.

Along different lines, Cropper gave an astounding talk about the Gaia RVS spectral data reduction pipeline. It was astounding because it involves a complicated, empirical, self-calibrated model of charge-transfer efficiency and radiation damage, which is continuously updated as the mission proceeds. It also includes a spectral calibration (wavelength solution, astrometry) that is fit simultaneously with all spectral extractions in a complete self-calibration. Cropper asked the implicit question in his talk: "How do we extract CTE-undamaged spectra from CTE-damaged data?". I think I know how to answer that; resolved to write something about it.

Liu showed amazing empirical calibrations of the LAMOST spectra using SEGUE parameters and support vector machines. He has done a great job of tranferring the one system onto the other data, which gives hope for the benchmark plans that are emerging. Bovy gave a nice talk about ages (birth dates), chemical abundances, and orbital actions serving as stellar "tags" which can then be understood statistically and generatively.


GaiaCal, day one

Today was the first day of the Astrophysical calibration of Gaia and other surveys meeting at Ringberg Castle in Tegernsee, Germany. The meeting is about methods for combining data to produce good astrophyiscal parameters from current and future surveys. There were many great talks today, but highlights were the following:

Andrae showed work on simulated data to demonstrate that the final Gaia Catalog will have good parameters (effective temperature, surface gravities, and so on) for stars. He showed that the differences between models is much larger than any precision of the data, and that therefore no models are expected to be "good fits" in any sense. He showed that they can recalibrate or adjust the input spectral models to force them to agree with the data, and that this seems to work extremely well.

At the end of a talk by Korn, Fouesneau commented that every model incorporated into the Gaia pipelines ought to include (if it has it) predictions for wavelengths outside the Gaia bandpasses. This is because then the results of fitting to Gaia data can be used to make predictions in other surveys. This comment resonated with me, because in the talks today there was a lot of talk about the "true" values of astrophysical parameters, but since these latent values are inaccessible to direct observation, we need other ways to assess credibility.

In between talks, Bovy made a nice point, which is that if Gaia wants end users to be able to compute likelihoods (and not just get point estimates or best-fit values) for astrophysical parameters, they should project all models onto their various systems (the G magnitude system, the BP/RP low-resolution spectral system, and the RVS high-resolution spectral system). This would permit end users to re-compute likelihoods for all data for all models.


web cams and James Bradley

Over lunch, Markus Pössel (MPIA) mentioned that he can measure the sidereal day very accurately, using a fish-eye or wide-field web cam pointed at the sky. This led us to a discussion of whether it would be possible to repeat Bradley's experiments of the 1700s that measured stellar aberration, precession of the Earth's axis, and nutation. Pössel had the very nice realization that you don't have to specifically identify any individual stars in any images to do this experiment; you can just do cross-correlations of multi-pixel time series. That's brilliant! We decided to discuss again later this month along with a possible (high school) student researcher.

Before that, Roberto Decarli (MPIA) and I discussed various projects. The most interesting is whether or how you can "stack data" (combine information from many images or many parts of an image) but in interferometric imaging data. Decarli has shown that you can do this stacking in the fourier space rather than in the image space. That's excellent, because the noise properties of the data are (conceivably) known there, but never understood properly in the image space. I gave him my usual advice, which is to replace the stacking with some kind of linear fit or regression: Stacking in bins is like linear fitting but under hard assumptions about the noise model and properties of the sources. We agreed to test some ideas.


Gaia status, Kepler calibration

Great conversations today with Brewer about sampling and inference, Knut Jahnke and Stefanie Wachter (MPIA) and Doug Finkbeiner (CfA) about Euclid calibration, and Jennifer Hill (NYU) about data science. But that's not all:

In the morning, Coryn Bailer-Jones (MPIA) gave a status update on the Gaia mission, which was launched in December. It is performing to spec in almost all respects. Bailer-Jones gave us straight talk on three issues they now face: The scattered light (from sources not in the field of view) is much higher than expected, reducing the magnitude limits for all accuracies and capabilities. There is something (probably water) affecting the total photometric throughput. It looks like this can be mitigated with occasional thermal cycling of the optics. The "fundamental angle" between the two telescopes seems to be varying with time with a larger amplitude than expected. This can be modeled, so long as it is understood properly. I think I speak for the entire astronomical community when I say that we can't wait for early data releases and wish the Gaia Collaboration the best of luck in addressing these challenges.

In the afternoon, Dun Wang, Foreman-Mackey, Schölkopf, and I worked through Wang's results on Kepler data-driven calibration. I am pretty excited about this project: Wang has shown that when we "train" the model on data not near (in time) to an injected transit, the data recalibrated with the model produces unbiased (on average) inferences about the transit. We assigned Wang next steps.


cheap and dirty chemical abundances

It was a packed research day, with great conversations about data analysis and inference with Rix, Dalcanton, Fouesneau, Price-Whelan, Schölkopf, and others. I have so many things to do this summer!

The highlight conversation of the day was with Rix and Melissa Ness (MPIA) about the possibility that we might be able to build data-driven metallicity indicators or indices for APOGEE spectra by performing regressions of the data against known metallicities and other meta-data (confounders, if you will). We came up with a baseline plan and talked through the iterated linear algebra the project requires. The baseline project has strong overlap with things we worked on with Conroy & Choi this Spring. The idea is to have the (noisily) labeled data decide what operators on the data are the best predictors or estimators of chemical abundances by building an empirical predictive model of the labels.


what is science?

It was a relatively low-research day, although I did check in with So Hattori and Dun Wang on their projects. I had a short conversation with Hennawi about next projects with SDSS-III and SDSS-IV spectroscopy. (Today is the last day of SDSS-III observing; SDSS-IV starts tomorrow!) We talked about Hennawi's Princeton-esque philosophy that great science happens when there are precise measurements that can be made and precise theoretical predictions for those measurements. This is a high bar for astrophysics: Many research areas either have no precise measurements, or no precise predictions, or neither. Interesting to think about projects through this lens. I am not particularly endorsing it, of course, since lots of science happens through discovery and description too.