inference of variance; reproducibility

At group meeting, Chang Hoon Hahn, MJ Vakili, Kilian Walsh, and I had a discussion of the inference of variance: The idea is that there is an extremely dumb toy problem in the inference of a variance of a one-dimensional distribution of points that is directly analogous to the inferences of the two-point correlation function of galaxies in the Universe. I can show, with my toy problem, that conventional cosmological practice is wrong or biased. We got super-confused about terminology (variance of the variance, and the data or the estimator based on the data, and so on), which illustrates how hard this is going to be to write up!

In the afternoon I had my weekly tea with Phil Marshall (by videophone). We talked about the reproducibility crisis in the social and health sciences and how that might apply or be related to issues in astronomy. My view is that astronomy results fail to reproduce just like these other studies, but we don't notice it as much because we have stronger p-value requirements. But still subsequent studies tend to be inconsistent with previous studies. We discussed blinding and hypothesis registration; many astronomers are dead-set against these tools. We discussed why that is, and whether being against these is effectively being for irreproducibility.


#GaiaSprint is live; dumb ideas

I spoke at length with Daniel Foreman-Mackey about current projects, and also possible April Fools' projects. It is getting late do do the latter, since (as my loyal reader knows), we take our April Fools contributions very, very seriously. When we do them. One idea is to do some probabilistic modeling of the “Alien Megastructure” Kepler source. We also talked about recent breakthroughs with Bernhard Schölkopf and Dun Wang on doing ultra-crowded-field photometry with independent components analysis (ICA).

At lunch, Andrew Zirm (grennhouse.io) proposed that we start a Dumb Ideas in Data Science meetup. The idea is that so many good ideas are dumb ideas. And so many bad ideas! Anyway, I hope this happens.

In the afternoon, I launched the #GaiaSprint web page and registration information. If you want to hack on the Gaia data the moment it is released, then the #GaiaSprint is for you!


preparing for the #GaiaSprint

We finally got fully ready to pull the #GaiaSprint trigger. We expect to pull it tomorrow. This will be a meeting in Heidelberg, and another in New York City (at the brand-new Simons Center for Computational Astrophysics), both to occur after the Gaia First Data Release. The idea is that is not a traditional meeting but more like a hack week, intended to facilitate exploitation of the new data. I also spent some time writing in our MCMC tutorial.


#AAAC, day 2, classification by discriminability

The morning started with the second day of the #AAAC meeting. Steve Kahn (Stanford) talked to us about the relationships among the superficially similar projects LSST, Euclid, and WFIRST. The argument is that they are highly complementary. I didn't really disagree, but it is not obvious that we as a community would be willing to spend a lot of money on WFIRST if we knew that LSST and Euclid are definitely going forward. I asked pointed questions and hope to follow up. Since WFIRST can do so many things, maybe it should slightly re-prioritize given the context?

In the afternoon, I talked with Amit Singer (Princeton), who was pretty adamant that the stuff I am doing on single-photon imaging is stupid and a waste of time! Late in the day, based on a comment by Greengard, Jeremy Magland and I formulated an awesome new clustering (or unsupervised classification) algorithm: Define discriminability (of j from k) to be the empirical probability that a point from distribution j be closer to a neighbor in distribution j than to a point in distribution k. Now set the boundary (which could be an arbitrarily shaped surface) to maximize discriminability. Magland ended up getting pessimistic when we realized that it would be slow. But it is worth exploring.


#AAAC, day 1

Today was day 1 of the two-day Astronomy and Astrophysics Advisory Committee (that advises NSF, NASA, and the DOE on places of intellectual and funding overlap). In the meeting (which is open to all by phone) I learned various things. One is that the LISA Pathfinder has successfully arrived at L1 and is apparently fully functional, including in its high-tech thruster system. The test masses are set to be released in weeks (if I remember correctly). John Carlstrom (Chicago) gave a nice presentation about CMB Stage 4 experiments, which are ramping up. There is a lot still to learn from the CMB. He emphasized that foregrounds are the dominant issue for many important experiments, along with the lensing distortions. I have ideas about both, but especially foregrounds: I don't think the CMB community is using the most sophisticated, tractable models that are out there. I made a mental note to contact people off-line about this.


photon pile-up

At Blanton–Hogg group meeting, Daniela Huppenkothen brought up photon pile-up in x-ray and gamma-ray detectors. The issue is that if two photons arrive at the same time, or in the same electronics-restricted time window, they will appear as one photon, but of higher energy. It is an issue for Chandra and for Fermi, among other assets. This pile-up leads to a distortion of the spectrum (and point-spread function, and so on) of very bright sources. We discussed how one might model this, given that it is easy to simulate but hard to describe with a likelihood function. We also came up with a ridiculously simple idea for testing cosmic-ray detection in time-series imaging, which really, really needs to e done.

In the afternoon, I did text writing and problem (exercise) writing in my MCMC tutorial. Stay on target. #AcWriYear


MCMC and the Milky Way

I put more exercises into my MCMC document, and put another figure into my single-photon microscopy document. I had a long conversation with Hans-Walter Rix about all the projects we can do with our results on detailed abundances from The Cannon for stellar clusters and in the Milky Way disk.

In the afternoon I started preparing for two Gaia Sprints. These are going to be hack weeks in which we try to exploit the Gaia first data release data in the late Summer and early Fall. The idea is to produce publishable science in one tough week. Watch this space for an announcement soon.


not writing a book; single-photon imaging code

Stoked by having finished my first first-author paper in a long while, I had a call with Daniel Foreman-Mackey in which I proposed to him that I try to finish a paper every week until I got through my backlog! He talked me down to one paper per month, and we agreed that our MCMC tutorial document should be next. He argued that we should add exercises (it is, after all, a chapter of the book I will never write). I agreed and wrote an exercise later in the day. I have a bunch more to go.

In the afternoon, I had a discussion with Leslie Greengard of my results on imaging molecules at random, unknown (yes, random=unknown) orientations with single-photon images. We discussed two big issues. The first is writing and testing analytic derivatives of my fully marginalized likelihood function (which is the objective function I (horror) optimize for this project). The other issue is representation for the molecule. We discussed many options and tentatively settled on a simple linear parameterization in real space (not Fourier space). Still confused; Greengard points out that it is confusing because there genuinely is no simple answer: There are no bases with elements that are compact in both Fourier space and real space, for deep, deep reasons.


candidate Wang

Today Dun Wang (NYU) passed his oral candidacy exam. His PhD thesis is pretty ambitious: A self-calibration of the Kepler Spacecraft main-mission data, an ultraviolet map of the Milky Way from GALEX data (which he will also self-calibrate), and photometry in crowded fields for the K2 mission!


all the plots, radial-velocity survey design

In response to requests from my team, I made a 256-page, 2560-panel plot of every single one of the 256 k-means clusters we found in abundance space, a few of which we published in our paper yesterday. I don't see much else in there that is easy to interpret, but it looks to me like the higher metallicity groups we see are actually multiple groups mashed together. So I resolved to run at higher values of K on the weekend.

In the afternoon, I had a call with Dan Foreman-Mackey about exoplanet populations. (Thinking of the future) he observed that the goal of finding reliable targets for some kind of Terrestrial Planet Finder mission and the goal of understanding how typical planetary systems form and evolve might be very much at odds, especially when it comes to radial-velocity surveys. Indeed, there haven't been many populations analyses of radial-velocity surveys to date, in part because most of them are not designed with long-term statistical goals in mind. We discussed a bit about what we might do to encourage a future in which both goals can be met, handily. I pointed out something that Charlie Lawrence (JPL) said to me at #AAS227 a couple weeks ago: If some kind of TPF is going to cost billions of dollars, it is worth spending a few hundred million on the ground in preparation. So resources might be abundant.


paper submitted!

Today I presented out chemical-tagging results at Blanton–Hogg group meeting. We show that (now that we can see them with The Cannon) overdensities in chemical space appear also to be overdensities (or at least oddities) in phase space. I followed the meeting by making final edits to the paper, submitting it to the Astrophysical Journal and the APOGEE Collaboration, and putting it on the arXiv. I also sent it to friends and colleagues. This led to an email battle with Charlie Conroy (CfA), who believes our results are somewhere between trivial and wrong!


galaxy redshift model of everything

I produced today a second draft of the chemical tagging paper. Boris Leistedt came by the SCDA and updated me on his SED and redshift model for photometric surveys. I hesitate to even call this “photometric redshifts” because the ambition is so much greater: It is to get the luminosity distribution, the redshift dependence, and the spectral energy distribution for every type of object on the sky, plus assign types and redshifts to all the sources in a multi-band survey (or collection thereof). We even talked about constraining cosmology with galaxy counts, as was first proposed by Hubble so many years ago: It is a generative model for all the redshifted objects on the sky, plus perhaps an admixture of stars in our own Galaxy. Ambitious! And, I think, not impossible, at least if we keep our early goals limited.



I decided to go all in on the chemical-tagging paper; I completed a first draft today, with boat-loads of help from Andy Casey and Melissa Ness over the weekend. Hoping to finish and submit this week.


chemical tagging, phase retrieval, and imaging with single photons

I spent the morning writing about chemical tagging. It appears that The Cannon now delivers precise enough chemical abundance measurements that we can find structure in abundance space (as I discovered in Florida this past weekend). I started to write this up into some kind of paper. In the afternoon, I joined a discussion with the usual mathematical suspects about phase retrieval and set intersection methods. This is very promising! Also, during that discussion, Leslie Greengard handed me a paper by Ayyer et al that poses (pretty much) the question I solved over the break: Can you reconstruct a 3-d image from unknown projections, in each of which you get only one photon?


data-driven model of supernova yields

I spent the last few days working on abundance-space structure, with some detours to hang out with colleagues in town for the Future of AI meeting at NYU, including Bernhard Schölkopf. Today Schölkopf and I spent some quality time today talking about our next round of projects, in two categories. In the first category, we talked about simple situations in astronomy in which independent components analysis might be useful. One is supernova yields: Jennifer Johnson (OSU) had asked me last weekend what kinds of supernovae create potassium; I promised her not an answer but a method for getting an answer. Schölkopf suggested that this is a perfect case for ICA: We want to matrix factorize, but in a way that separates causes, not variance! ICA is based on some great math, to which he pointed me.

In the second category, we talked about the next generation of what astronomers call “image differencing”. We want to build out the Causal Pixel Model we built for Kepler self-calibration so that it could work in situations (like LSST) in which there is heterogeneous temporal and spatial coverage of the sky. Then, if it works, we can use everything we have to predict the imaging data we are trying to subtract (or really just predict precisely).