On a call with Dan Foreman-Mackey we had an insight on our combinatoric-degeneracies toy problem: We can construct a more general family of models by considering every way we can put K (unique, identifiable) pigeons into M (unique, identifiable) holes, with or without exclusion of pigeons (but preferably with). I think we can still make an analytically integrable likelihood function in all these cases. That's the next generalization of the problem. I was trying to build code that deterministically sets all parameters; Foreman-Mackey encouraged me to just pull the parameters from sensible distributions.
On the weekend I had two long conversations with Hans-Walter Rix about the Milky Way chemical abundance ratio gradients that I sent out to the APOGEE Collaboration this past week. Rix's view is that naive interpretation of poorly thought-out gradient estimates would set things in the field back; we should make plots of things that are (as much as possible) easy to predict, and we should interpret them with physically motivated models.
I agreed, and that led us down the path of working out the Right Thing To Do (tm). Of course this is to build a model of the star-formation history and inflow and outflow history of every molecular cloud in the Galaxy (and anything that has fallen into the Galaxy), and the IMF and all supernova and stellar-wind yields, and constrain this model with every star ever observed! So I did that on Sunday afternoon.
No, I didn't: I spent time thinking about what might be possible baby steps towards solving all of astronomy. Or, in other words: What would you do right now if you had 100,000 stars with 15 chemical abundances measured for each one? We have that!
After making and sending radial abundance gradients (in the Milky Way) around to the APOGEE collaboration, I realized that I could just as easily do vertical gradients. I made and sent those today. The interesting things appear to be that alpha elements do not all track one another, and the gradient differences among them in the radial direction is different from that in the vertical direction. Interpretation is not trivial, however, because of radial migration and its attendant implications for variation in height distributions.
I started to code up the idea that Daniel Foreman-Mackey and I talked about earlier in the week: Making toy inference problems that have analytic Bayesian evidence integrals but combinatoric degeneracies (labeling degeneracies) among sets of parameters. I got started, and worked out some of my conceptual issues. Not all of them, though! A conversation with Alex Barnett (SCDA) helped immensely.
A few weeks ago, Jason Sanders (Cambridge) computed isochrone-based distances for all the red giants for which Andy Casey, Melissa Ness, and I measured 15-dimensional chemical abundance information. Finally today I incorporated these distances, computed Galactocentric positions (approximately) and plotted chemical abundance trends in the Galaxy. They look super-promising! Here's one:
I also had a long phone call with Dan Foreman-Mackey about various things. We had a good idea on the call for testing MCMC methods in problems where there are labeling degeneracies. For example, if you do a 5-planet model for a stellar lightcurve, you can make an identical model by swapping all of the properties of the third and fourth planet in your list. That's a combinatoric degeneracy: It means that there are (at least) 5-factorial identical modes in your posterior pdf. We figured out a large family of analytic distributions with these properties, and even ones where we can compute (analytically) the fully marginalized likelihood (FML) or Bayesian evidence integral. This will give us benchmarks for testing codes that claim to compute this notoriously hard integral.
We met today at Columbia for #AstroHackNY. Inspired by a problem from Ellie Schwab (CUNY), I spoke about the fitting of data in which you are presented with not just simple measurements, but some of the data are (only) upper limits. I gave a few strategies, from the life-hacking strategy of getting the data provider to replace the upper limits with forced measurements (who said measurements have to deviate significantly from zero to be useful?) to the ur-Bayesian strategy of including the missing data as (nuisance) parameters in the fit. After this, Alex Malz pointed me to the Kaplan–Meier estimator for survival analysis, and we figured out how we might convert the fitting problem into a survival-analysis analog.
After this, Nicholas Stone (Columbia) talked about the formation of stars (and thus binary stars, and thus black-hole binaries, and thus gravitational-wave events) in accretion disks. He has a whole theory of how to make binaries that are tight enough to lead to (relatively) prompt mergers. For me, the idea that you will have star formation in Q-unstable disks is a very good idea: In fact, this star formation ought to be self-regulating, because as stars form and heat up the disk, the disk becomes stable. Lots of predictions to be made here.
In between these things, and after, Adrian Price-Whelan and I discussed the visualization of the very faint, hard-to-see tidal stream he has found.
Boris Leistedt came by for an hour to discuss his progress on huge-survey photometric redshift determination. His method is truly novel: It is not a template-based method in the traditional sense, but it is also not a machine-learning method of the train-and-test variety (supervised regression, like Random Forest or equivalent): It uses the causal and noise model of a template method, but fits for the details of every one of the set of templates and also the luminosity functions for the galaxies generated by each template. This makes it possible to build a photometric redshift system entirely from the data but that can be trained with a very small amount of spectroscopy. But most importantly, the model can be trained with with a training set that does not have the same flux or redshift distribution as the bulk of the sources. That is, his method is perfectly matched to the future of imaging surveys. We planned the first paper.
[I spent the last week on vacation, doing only minimal writing and reading.]
I spent the weekend working through the draft of Andy Casey's The Cannon 2 paper, in which we use L1 regularization to permit The Cannon to build a model of stellar spectra with dozens of labels (think: abundances). And also Dun Wang's response to referee on the CPM self-calibration of the Kepler data. I also spent some time working out my priorities for March, April, and May.
The day started with Dun Wang, Steven Mohammed, David Schiminovich, and I meeting to discuss GALEX projects. Of course instead we brain-stormed projects we could do around the LIGO discovery of gravitational radiation. So many ideas! Rates, counterparts, and re-analysis of the raw data emerged as early leaders in the brain-storming session.
Adrian Price-Whelan crashed the party and showed me evidence he has of a disrupting globular cluster. Not many are known! So then we dropped everything and spent the day getting membership probabilities for stars in the field. The astrophysical innovation is that Price-Whelan found this candidate on theoretical grounds: What Milky Way clusters are most likely to be disrupting? The methodological innovation is that we figured out a way to do membership likelihoods without an isochrone model: We are completely data-driven! We fired a huge job into the NSF supercomputer Stampede. Holy crap, that computer is huge.
The day started at Columbia, where many hundreds of people showed up to listen to the announcement from the LIGO project. As expected, they announced the detection of a gravitational inspiral, merger, and ringdown from a pair of 30-ish solar-mass black holes. Incredible. The signal is so clear, you can just see it directly in the data stream. There was lots of great discussion after the press conference, led by Imre Bartos (Columbia), who did a great job of fielding questions. I asked about the large masses (larger than naively expected), and about the cosmological-constraint implications. David Schiminovich asked about the event rate, which looks high (especially because we all believe they have more inspirals in the data stream). Adrian Price-Whelan asked about the the Earth-bound noise sources. And so on. It was a great party, and it is a great accomplishment for a very impressive experiment. And there will be much more, we all very much hope.
In the afternoon, I had the pleasure of serving on the committee of Henrique Moyses (NYU), who successfully defended a PhD on microscopic particles subject to non-conservative forces (and a lot of thermal noise). He has beautiful theoretical explanations for non-trivial experimental results on particles that are thermophoretic (are subject to forces caused by temperature gradients). Interestingly, the thermophoretic mechanisms are not well understood, but that didn't stop Moyses from developing a good predictive theory. Moyses made interesting comments on biological systems; it appears that driven, microscopic, fluctuating systems collectively work together to make our bodies move and work. That's incredible, and shows just how important this kind of work is.
The twitters are ablaze with rumors about the announcement from the LIGO project scheduled for tomorrow. We discussed this in group meeting today, with no embargo-breaking by anyone. That is, on purely physical, engineering, sociological, and psychological grounds we made predictions for the press release tomorrow. Here are my best predictions: First, I predict that the total signal-to-noise of any detected black-hole inspiral signal they announce will be greater than 15 in the total data set. That is, I predict that (say) the width of the likelihood function for the overall, scalar signal amplitude will have a half-width that is less than 15 times its mode. Second, I predict that the uncertainty on the sum of the two masses (that is, the total mass of the inspiral system, if any is announced) will be dominated by the (large, many hundreds of km/s) uncertainty in the peculiar velocity of the system (in the context that the system lives inside the cosmological world model). Awesome predictions? Perhaps not, but you heard them here first!
[Note to the world: This is not an announcement: I know nothing! This is just a pair of predictions from an outsider.]
We discussed the things that could be learned from any detection of a single black-hole inspiral signal, about star formation, black-hole formation, and galaxies. I think that if the masses of the detected black holes are large, then there are probably interesting things to say about stars or supernovae or star formation.
Today was the first meeting of #AstroHackNY, where we discuss data analysis and parallel work up at Columbia on Tuesday mornings. We discussed what we want to get out of the series, and started a discussion of why we do linear fitting the way we do, and what are the underlying assumptions.
Prior to that, I talked with Hans-Walter Rix about interpolation and gridding of spectral models. We disagree a bit on the point of all this, but we are trying to minimize the number of stellar model evaluations we need to do to get precise, many-element abundances with a very expensive physical model of stars. We also discussed the point that we probably have to cancel the Heidelberg #GaiaSprint, because of today's announcement from the Gaia Project.
In a day limited by health issues, I had a useful conversation with Leslie Greengard and Alex Barnett (SCDA, Dartmouth) about star-shades for nulling starlight in future exoplanet missions. They had ideas about how the electromagnetic field might be calculated, and issues with what might be being done with current calculations of this. These calculations are hard, because the star-shades under discussion for deployment at L1 are many times 107 wavelengths in diameter, and millions of diameters away from the telescope!
I also talked to Boris Leistedt about galaxy and quasar cosmology using imaging (and a tiny bit of spectroscopy), in which the three-dimensional mapping is performed with photometric redshifts, or more precisely models of the source spectral energy distributions that are modeled simultaneously with the density field and so on. We are working on a first paper with recommendations for LSST. The idea is that a small amount of spectroscopy and an enormous amount of imaging ought to be sufficient to build a model that returns a redshift and spectral energy distribution for every source.
Dun Wang, Steven Mohammed (Columbia), David Schiminovich and I met to discuss GALEX. Wang has absolutely beautiful images of the GALEX flat, and he can possibly separate the flat appropriate for stars from the flat appropriate for background photons. We realized we might need some robust estimation to deal with transient reflections from bright stars.
Matthew Penny (OSU) showed up and distracted us onto K2 matters; Penny is involved in our efforts to deliver photometry from the crowded fields of K2 Campaign 9 in the Milky Way bulge. Wang showed his CPM-based prediction of the crowded field in K2C0 test data, where he has an absolutely beautiful time-domain image model. This is like difference imaging, except that the prediction is made not from a master image, but from the time-domain behavior of other (spatially separated) pixels. The variable stars and asteroids stick out dramatically. So I think we are close to having a plan.
A day of mainly writing: Alex Malz nearly finished a NASA graduate fellowship proposal; I put comments on pages from Dun Wang's CPM paper; and I closed issues open on my MCMC tutorial. I had a long discussion with Tony Butler-Yeoman (Wellington) and Marcus Frean (Wellington) about our Oddity method for detecting anomalies (like astronomical sources) in imaging. They asked me two very good questions about writing for astronomers: How do you demonstrate to astronomers that this is a useful method that they want to try—with a few good examples or a large-scale statistical test? And how do you write a methods paper in astrophysics?
On the latter, I advised our new methods-paper template, which is this: Introduction, then a full statement of all of the assumptions underlying the method. Then a demonstration that the method is best or very good under those assumptions (using fake data or analytical arguments). Then a demonstration that the method is okay on real data. Then a discussion, in which the assumptions are addressed, one by one: This permits discussion of advantages, disadvantages, limitations, and places where improvements are possible. The key idea of all this is that a good method should be the best possible method under some set of identifiable assumptions. I don't think that's too much to ask of a method (and yet it is not true of most of the things the data-analysis community does these days).
At group meeting, Chang Hoon Hahn, MJ Vakili, Kilian Walsh, and I had a discussion of the inference of variance: The idea is that there is an extremely dumb toy problem in the inference of a variance of a one-dimensional distribution of points that is directly analogous to the inferences of the two-point correlation function of galaxies in the Universe. I can show, with my toy problem, that conventional cosmological practice is wrong or biased. We got super-confused about terminology (variance of the variance, and the data or the estimator based on the data, and so on), which illustrates how hard this is going to be to write up!
In the afternoon I had my weekly tea with Phil Marshall (by videophone). We talked about the reproducibility crisis in the social and health sciences and how that might apply or be related to issues in astronomy. My view is that astronomy results fail to reproduce just like these other studies, but we don't notice it as much because we have stronger p-value requirements. But still subsequent studies tend to be inconsistent with previous studies. We discussed blinding and hypothesis registration; many astronomers are dead-set against these tools. We discussed why that is, and whether being against these is effectively being for irreproducibility.
I spoke at length with Daniel Foreman-Mackey about current projects, and also possible April Fools' projects. It is getting late do do the latter, since (as my loyal reader knows), we take our April Fools contributions very, very seriously. When we do them. One idea is to do some probabilistic modeling of the “Alien Megastructure” Kepler source. We also talked about recent breakthroughs with Bernhard Schölkopf and Dun Wang on doing ultra-crowded-field photometry with independent components analysis (ICA).
At lunch, Andrew Zirm (grennhouse.io) proposed that we start a Dumb Ideas in Data Science meetup. The idea is that so many good ideas are dumb ideas. And so many bad ideas! Anyway, I hope this happens.
In the afternoon, I launched the #GaiaSprint web page and registration information. If you want to hack on the Gaia data the moment it is released, then the #GaiaSprint is for you!
We finally got fully ready to pull the #GaiaSprint trigger. We expect to pull it tomorrow. This will be a meeting in Heidelberg, and another in New York City (at the brand-new Simons Center for Computational Astrophysics), both to occur after the Gaia First Data Release. The idea is that is not a traditional meeting but more like a hack week, intended to facilitate exploitation of the new data. I also spent some time writing in our MCMC tutorial.