the future of astrophysical data analysis

Dan Foreman-Mackey (UW) crashed NYC today, surprising me, and disrupting my schedule. We began our day by arguing about the future of hierarchical modeling. His position is (sort-of) that the future is not hierarchical Bayes as it is currently done, but rather that we will be doing things that are much more ABC-like. That is, astrophysics theory is (generally) computational or simulation-based, and the data space is far too large for us to understand densities or probabilities in the data space. So we need ways to responsibly use simulations in inference. Right now the leading method is what is called (dumbly) ABC. I asked: So, are we going to do CMB component separation at the pixel level with ABC? This seems impossible at the present day, and DFM's pointed out that ABC is best when precision requirements are low. When precision requirements are high, there aren't really options that have computer simulations inside the inference loop!

Many other things happened today. I spent time with Lauren Anderson (Flatiron), validating and inspecting the output of our parallax inferences. I spent a phone call with Fed Bianco (NYU) talking about how to adapt Gaussian Processes to make models of supernovae light curves. And Foreman-Mackey and I spent time talking about linear algebra, and also this blog post, with which we more-or-less agree (though perhaps it doesn't quite capture all the elements that contribute (positively and negatively) to the LTFDFCF of astronomers!).


linear algebra; huge models

I had a long conversation today with Justin Alsing (Flatiron) about hierarchical Bayesian inference, which he is thinking about (and doing) in various cosmological contexts. He is thinking about inferring a density field that simultaneously models the galaxy structures and the weak lensing, to do a next-generation (and statistically sound) lensing tomography. His projects are amazingly sophisticated, and he is not afraid of big models. We also talked about using machine learning to do emulation of expensive simulations, initial-conditions reconstruction in cosmology, and moving-object detection in imaging.

I also spent time playing with my linear algebra expressions for my document on finding identical stars. Some of the matrices in play are low-rank; so I ought to be able to either simplify my expressions or else simplify the number of computational steps. Learning about my limitations, mathematically! One thing I re-discovered today is how useful it is to use the Kusse & Westwig notation and conceptual framework for thinking about hermitian matrices and linear algebra.


ergodic stars; blinding and pre-registration

In the Stars group meeting, Nathan Leigh (AMNH) and Nick Stone (Columbia) spoke about 4-body scattering or 2-on-2 binary-binary interactions. These can lead to 3-1, 2-2, and 2-1-1 outcomes, with the latter being most common. They are using a fascinating and beautiful ergodic-hypothesis-backed method (constrained by conservation laws) to solve for the statistical input–output relations quasi-analytically. This is a beautiful idea and makes predictions about star-system evolution in the Galaxy.

In the Cosmology group meeting, Alex Malz (NYU) led a long and wide-ranging discussion of blinding (making statistical results more reliable by pre-registering code or sequestering data). The range of views in the room was large, but all agreed that you need to be able to do exploratory data analysis and also protect against investigator bias. My position is we better be doing some form of blinding for our most important questions, but I also think that we need to construct these methods to permit people to play with the data and permit public data releases that are uncensored and unmodified. One theme which came up is that astronomy's great openness is a huge asset here. Fundamentally we are protected (in part) by the availability of the data to re-analysis.


Local Group on the Local Group

Today, Kathryn Johnston (Columbia) organized a “Local Group on the Local Group” meeting at Columbia. Here are some highlights:

Lauren Anderson (Flatiron) gave an update on her data-driven model of the color–magnitude diagram of stars. This led to a conversation about which features in her deconvolved CMD are real? And are there too many red-clump stars given the total catalog size?

Steven Mohammed (Columbia) showed our GALEX Galactic-Plane survey data on the Gaia TGAS stars. The GALEX colors look very sensitive to metallicity and possibly other abundances. The audience suggested that we look at the full dependences on metallicity and temperature and surface gravity to see if we can break all degeneracies. This led to more discussion of the use of the Red Clump stars for Galactic science.

Adrian Price-Whelan (Columbia) presented a puzzle about the Galactic globular cluster system, which he has been thinking about. Are the distant clusters accreted? The in-situ formation hypothesis is unpalatable (it had to be many clusters at early times; should be many thin streams); the accreted hypothesis over-produces the smooth component of the stellar halo (unless dwarf galaxies had far more GCs per unit stellar mass in the past). These problems can be resolved, but only with strong predictions.

Yong Zheng (Columbia) spoke about the gaseous Magellanic stream and associated (or plausibly associated) high-velocity clouds. Many of the challenges in interpretation connect to the problem that we don't know where the gas is along the line of sight. She showed really nice data on something called Wright’s Cloud. For this huge structure—and for the stream as a whole—there is little to no associated stellar component.

Nicola Amorisco (Harvard) Showed theoretical simulations of the accreted part of the MW (and MW-like-galaxy) halo, with the goal of finding stellar-halo observables that strongly co-vary with the assembly history of the dark-matter halo. Both theory and observations suggest large scatter in halo properties at Milky-Way-like masses, and much less scatter at higher masses (because of central-limit-like considerations). His results are promising for understanding the MW assembly history.

Glennys Farrar (NYU) spoke about the MW magnetic field, using rotation measures and CMB to constrain the model. She showed UHECR deflections in the inferred magnetic field, and also discussed implications of her results for electron and cosmic-ray diffusion. There are also tantalizing implications for the synchrotron spectrum and CMB component separation. One interesting comment: If her results are right for the scale and amplitude of the field, there are serious questions about origin and generation; is it primordial or generated on scales much larger than the galaxy?


identical, statistically speaking

My research activity today was to re-write, from scratch (well, I really started yesterday) my document on how you tell whether two noisily measured objects are identical. This is an old and solved problem! But I am writing the answer in astronomer-friendly form, with a few astronomy-related twists. I have no idea whether this is a paper, a section of a paper, or something else. My re-write was caused by the algebra I learned from Leistedt, and the customers (so to speak) of the document are Rix, Ness, and Hawkins, all of whom are thinking about finding identical pairs of stars.


how to write an April Fools' paper

I had a great visit to the University of Toronto Department of Astronomy and Astrophysics (and Dunlap Institute) today. I had great conversations about scintillometry (new word?) and the future of likelihood functions and component separation in the CMB. I also discussed pairwise velocity differences in cosmology, and probabilistic supernova classification. There is lots going on. I gave my talk on The Cannon, in which I was perhaps way too pessimistic about chemical tagging!

Early in the day, I ate Toronto-style (no, not Montreal-style) bagels with Dustin Lang (Toronto) and discussed many of the things we like to discuss, like finding very faint outer-Solar-System objects in all the data Lang wrangles, like the differences between accuracy and precision, and even how to define accuracy in astrophysics, and like April Fools' papers, which have to meet four criteria:

  1. conceptually interesting inference
  2. extremely challenging computation
  3. no long-term scientific value to the specific results found
  4. non-irrelevant mention of April 1 in abstract
It is a brutal set of requirements but we have met them two times. I think this year is out (because of criterion 2), but maybe 2018?


math with Gaussians

My one piece of research news today was an email exchange with Boris Leistedt (NYU) in which he completely took me to school on math with Gaussians. My intuition (expressed this week) that there was an easier way to do all the operations I was doing was right! But everything else I was doing was not wrong but wrong-headed. Anyways, this should simplify some things right away. The key observation is that a product of Gaussians can be transformed into another product of Gaussians, in another basis, trivially. More soon!


circular reasoning, continuity of globular clusters

In Stars group meeting, Lauren Anderson (Flatiron) showed our toy example that demonstrates why our method for de-noising the Gaia TGAS data works. That led to some useful conversation that might help us explain our project better. I didn't take all the notes I should have! One idea that came up is that if there are two populations, one only seen at very low signal-to-noise, then that second population can easily get pulled in to the first. Another is the question of the circularity of the reasoning. Technically, our reasoning is circular, but it wouldn't be if we marginalized out the hyper-parameters (that is, the parameters of our color–magnitude diagram).

Also in the Stars meeting, Ruth Angus (Columbia) suggested how we might responsibly look for the differences in exoplanet populations with stellar age. And Semyeong Oh (Princeton) and Adrian Price-Whelan (Princeton) described their very successful observing run to follow up the comoving stellar pairs. Preliminary analyses suggest that many of the pairs (which we found only with transverse information) are truly comoving.

In Cosmology group meeting, Jeremy Tinker discussed the possibility of using halo-occupation-like approaches to determine how the globular cluster populations of galaxies form and evolve. This led to a complicated and long discussion, with many ideas and issues arising. I do think that various simple scenarios could be ruled out, making use of some kind of continuity argument (with sources and sinks, of course).

I spent some time hidden away working on multiplying and integrating Gaussians. I am doing lots of algebra, completing squares. I have the tiniest suspicion that there is an easier way, or that all of the math I am doing has a simple answer at the end, that I could have seen before starting?


half-pixel issues; building our own Gibbs sampler

First thing in the morning I met with Steven Mohammed (Columbia) and Dun Wang (NYU) to discuss GALEX calibration and imaging projects. Wang has a very clever astrometric calibration of the satellite, built by cross-correlating photons with the positions of known stars. This astrometric calibration depends on properties of the photons for complicated reasons that relate to the detector technology on board the spacecraft. Mohammed finds, in an end-to-end test of Wang's images, that there might be half-pixel issues in our calibration. We came up with methods for tracking that down.

Late in the day, I met with Ruth Angus (Columbia) to discuss the engineering in her project to combine all age information (and self-calibrate all methods). We discussed how to make a baby test where we can do the sampling with technology we are good at, before we write a brand-new Gibbs sampler from scratch. Why, you might ask, would any normal person write a Gibbs sampler from scratch when there are so many good packages out there? Because you always learn a lot by doing it! If our home-built Gibbs doesn't work well, we will adopt a package.


statistics questions

I spent time today writing in the method section of the Anderson et al paper. I realized in writing it that we have been thinking about our model of the color–magnitude diagram as being a prior on the distance or parallax. But it isn't really, it is a prior on the color and magnitude, which for a given noisy, observed star, becomes a prior on the parallax. We will compute these implicit priors explicitly (it is a different prior for every star) for our paper output. We have to describe this all patiently and well!

At some point during the day, Jo Bovy (Toronto) asked a very simple question about statistics: Why does re-sampling the data (given presumed-known Gaussian noise variances in the data space) and re-fitting deliver samples of the fit parameters that span the same uncertainty distribution as the likelihood function would imply? This is only true for linear fitting, of course, but why is it true (and no, I don't mean what is the mathematical formula!)? My view is that this is (sort-of) a coincidence rather than a result, especially since it (to my mind) confuses the likelihood and the posterior. But it is an oddly deep question.


a prior on the CMD isn't a prior on distance, exactly

Today my research time was spent writing in the paper by Lauren Anderson (Flatiron) about the TGAS color–magnitude diagram. I think of it as being a probabilistic inference in which we put a prior on stellar distances and then infer the distance. But that isn't correct! It is an inference in which we put a prior on the color–magnitude diagram, and then, given noisy color and (apparent) magnitude information, this turns into an (effective, implicit) prior on distance. This Duh! moment led to some changes to the method section!


what's in an astronomical catalog?

The stars group meeting today wandered into dangerous territory, because it got me on my soap box! The points of discussion were: Are there biases in the Gaia TGAS parallaxes? and How could we use proper motions responsibly to constrain stellar parallaxes? Keith Hawkins (Columbia) is working a bit on the former, and I am thinking of writing something short with Boris Leistedt (NYU) on the latter.

The reason it got me on my soap-box is a huge set of issues about whether catalogs should deliver likelihood or posterior information. My view—and (I think) the view of the Gaia DPAC—is that the TGAS measurements and uncertainties are parameters of a parameterized model of the likelihood function. They are not parameters of a posterior, nor the output of any Bayesian inference. If they were outputs of a Bayesian inference, they could not be used in hierarchical models or other kinds of subsequent inferences without a factoring out of the Gaia-team prior.

This view (and this issue) has implications for what we are doing with our (Liestedt, Hawkins, Anderson) models of the color–magnitude diagram. If we output posterior information, we have to also output prior information for our stuff to be used by normals, down-stream. Even with such output, the results are hard to use correctly. We have various papers, but they are hard to read!

One comment is that, if the Gaia TGAS contains likelihood information, then the right way to consider its possible biases or systematic errors is to build a better model of the likelihood function, given their outputs. That is, the systematics should be created to be adjustments to the likelihood function, not posterior outputs, if at all possible.

Another comment is that negative parallaxes make sense for a likelihood function, but not (really) for a posterior pdf. Usually a sensible prior will rule out negative parallaxes! But a sensible likelihood function will permit them. The fact that the Gaia catalogs will have negative parallaxes is related to the fact that it is better to give likelihood information. This all has huge implications for people (like me, like Portillo at Harvard, like Lang at Toronto) who are thinking about making probabilistic catalogs. It's a big, subtle, and complex deal.


snow day

[Today was a NYC snow day, with schools and NYU closed, and Flatiron on a short day.] I made use of my incarceration at home writing in the nascent paper about the TGAS color–magnitude diagram with Lauren Anderson (Flatiron). And doing lots of other non-research things.


toy problem

Lauren Anderson (Flatiron) and I met early to discuss a toy model that would elucidate our color–magnitude diagram model project. Context is: We want to write a section called “Why the heck does this work?” in our paper. We came up with a model so simple, I was able to implement it during the drinking of one coffee. It is, of course, a straight-line fit (with intrinsic width, then used to de-noise the data we started with).

planning a paper sprint, completing a square

Lauren Anderson (Flatiron) are going to sprint this week on her paper on the noise-deconvolved color–magnitude diagram from the overlap of Gaia TGAS, 2MASS, and the PanSTARRS 3-d dust map. We started the day by making a long to-do list for the week, that could end in submission of the paper. My first job is to write down the data model for the data release we will do with the paper.

At lunch time I got distracted by my project to find a better metric than chi-squared to determine whether two noisily-observed objects (think: stellar spectra or detailed stellar abundance vectors) are identical or indistinguishable, statistically. The math involved completing a huge square (in linear-algebra space) twice. Yes, twice. And then the result is—in a common limit—exactly chi-squared! So my intuition is justified, and I know where it will under-perform.


the Milky Way halo

At the NYU Astro Seminar, Ana Bonaca (Harvard) gave a great talk, about trying to understand the dynamics and origin of the Milky Way halo. She has a plausible argument that the higher-metallicity halo stars are the halo stars that formed in situ and migrated out, while the lower-metallicity stars were accreted. If this holds up, I think it will probably test a lot of things about the Galaxy's formation, history, and dark-matter distribution. She also talked about stream fitting to see the dark-matter component.

On that note, we started a repo for a paper on the information theory of cold stellar streams. We re-scoped the paper around information rather than the LMC and other peculiarities of the Local Group. Very late in the day I drafted a title and abstract. This is how I start most projects: I need to be able to write a title and abstract to know that we have sufficient scope for a paper.


The Cannon and APOGEE

I discussed some more the Cramér-Rao bound (or Fisher-matrix) computations on cold stellar streams being performed by Ana Bonaca (Harvard). We discussed how things change as we increase the numbers of parameters, and designed some possible figures for a possible paper.

I had a long phone call with Andy Casey (Monash) about The Cannon, which is being run inside APOGEE2 to deliver parameters in a supplemental table in data release 14. We discussed issues of flagging stars that are far from the training set. This might get strange in high dimensions.

In further APOGEE2 and The Cannon news, I dropped an email on the mailing lists about the radial-velocity measurements that Jason Cao (NYU) has been making for me and Adrian Price-Whelan (Princeton). His RV values look much better than the pipeline defaults, which is perhaps not surprising: The pipeline uses some cross-correlation templates, while Cao uses a very high-quality synthetic spectrum from The Cannon. This email led to some useful discussion about other work that has been done along these lines within the survey.


does the Milky Way disk have spiral structure?

At stars group meeting, David Spergel (Flatiron) was tasked with convincing us (and Price-Whelan and I are skeptics!) that the Milky Way really does have spiral arms. His best evidence came from infrared emission in the Galactic disk plane, but he brought together a lot of relevant evidence, and I am closer to being convinced than ever before. As my loyal reader knows, I think we ought to be able to see the arms in any (good) 3-d dust map. So, what gives? That got Boris Leistedt (NYU), Keith Hawkins (Columbia), and me thinking about whether we can do this now, with things we have in-hand.

Also at group meeting, Semyeong Oh (Princeton) showed a large group-of-groups she has found by linking together co-moving pairs into connected components by friends-of-friends. It is rotating with the disk but at a strange angle. Is it an accreted satellite? That explanation is unlikely, but if it turns out to be true, OMG. She is off to get spectroscopy next week, though John Brewer (Yale) pointed out that he might have some of the stars already in his survey.


finding the dark matter with streams

Today was a cold-stream science day. Ana Bonaca (Harvard) computed derivatives today of stream properties with respect to a few gravitational-potential parameters, holding the present-day position and orientation of the stream fixed. This permits computation of the Cramér-Rao bound on any inference or estimate of those parameters. We sketched out some ideas about what a paper along these lines would look like. We can identify the most valuable streams, the streams most sensitive to particular potential parameters, the best combinations of streams to fit simultaneously, and the best new measurements to make of existing streams.

Separately from this, I had a phone conversation with Adrian Price-Whelan (Princeton) about the point of doing stream-fitting. It is clear (from Bonaca's work) that fitting streams in toy potentials is giving us way-under-estimated error bars. This means that we have to add a lot more potential flexibility to get more accurate results. We debated the value of things like basis-function expansions, given that these are still in the regime of toy (but highly parameterized toy) models. We are currently agnostic about whether stream fitting is really going to reveal the detailed properties of the Milky Way's dark-matter halo. That is, for example, the properties that might lead to changes in what we think is the dark-matter particle.


LMC effect on streams; dust corrections

Ana Bonaca (Harvard) showed up for a week of (cold) stellar streams inference. Our job is either to resurrect her project to fit multiple streams simultaneously, or else choose a smaller project to hack on quickly. One thing we have been discussing by email is the influence of the LMC (and SMC and M31 and so on) on the streams. Will it be degenerate with halo quadrupole or other parameters? We discussed how we might answer this question without doing full probabilistic inferences: In principle we only need to take some derivatives. This is possible, because Bonaca's generative stream model is fast. We discussed the scope of a minimum-scope paper that looks at these things, and Bonaca started computing derivatives.

Lauren Anderson (Flatiron) and I looked at her dust estimates for the stars in Gaia DR1 TGAS. She is building a model of the color–magnitude diagram with an iterative dust optimization: At zeroth iteration, the distances are (generally) over-estimated; we dust-correct, fit the CMD, and re-estimate distances. Then we re-estimate dust corrections, and do it again. The dust corrections oscillate between over- and under-corrections as the distances oscillate between over- and under-estimates. But it does seem to converge!


similarities of stars; getting started in data science

I met with Keith Hawkins (Columbia) in the morning, to discuss how to find stellar pairs in spectroscopy. I fundamentally advocated chi-squared difference, but with some modifications, like masking things we don't care about, removing trends on length-scales (think: continuum) that we don't care about, and so on. I noted that there are things to do that are somewhat better than chi-squared difference, that relate to either hypothesis testing or else parameter estimation. I promised him a note about this, and I also owe the same to Melissa Ness (MPIA), who has similar issues but in chemical-abundance (rather than purely spectral) space. Late in the day I worked on this problem over a beer. I think there is a very nice solution, but it involves (as so many things like this do) a non-trivial completion of a square.

In the afternoon, I met with my undergrad-and-masters research group. Everyone is learning how to install software, and how to plot spectra, light curves, and rectangular data. We talked about projects with the Boyajian Star, and also with exoplanets in 1:1 resonances (!).


D. E. Shaw

The research highlight of my day was a trip to D. E. Shaw, to give an academic seminar (of all things) on extra-solar planet research. I was told that the audience would be very mathematically able and familiar with physics and engineering, and it was! I talked about the stationary and non-stationary Gaussian Processes we use to model stellar (stationary) and spacecraft (non-stationary) variability, how we detect exoplanet signals by brute-force search, and how we build and evaluate hierarchical models to learn the full population of extra-solar planets, given noisy observations. The audience was interactive and the questions were on-point. Of course many of the things we do in astrophysics are not that different—from a data-analysis perspective—from things the hedge funds do in finance. I spent my time with the D. E. Shaw trying to understand the atmosphere in the firm. It seems very academic and research-based, and (unlike at many banks), the quantitative researchers run the show.


fitting stellar spectra and deblending galaxy images

Today was group meetings day. In the Stars meeting, John Brewer (Yale) told us about fitting stellar spectra with temperature, gravity, and composition, epoch-by-epoch for a multi-epoch radial-velocity survey. He is trying to understand how consistent his fitting is, what degeneracies there are, and whether there are any changes in temperature or gravity that co-vary with radial-velocity jitter. No results yet, but we had suggestions for tests to do. His presentation reinforced my idea (with Megan Bedell) to beat spectral variations against asteroseismological oscillation phase.

In the Cosmology meeting, Peter Melchior (Princeton) told us about attempts to turn de-blending into a faster and better method that is appropriate for HSC and LSST-generation surveys. He blew us away with a tiny piece of deep HSC imaging, and then described a method for deblending that looks like non-negative matrix factorization, plus convex regularizations. He has done his research on the mathematics around convex regularizations, reminding me that we should do a more general workshop on these techniques. We discussed many things in the context of Melchior's project; one interesting point is that the deblending problem doesn't necessarily require good models of galaxies (Dustin Lang and I always think of it as a modeling problem); it just needs to deliver a good set of weights for dividing up photons.