separating noise from signal

Or Graur (JHU), Yuqian Liu (NYU), Maryam Modjaz (NYU), and Gabe Perez-Giz (NYU) came by today to pick my brain and Fadely's brain about interpreting spectral data. Their problem is that they want to analyze supernova spectral data, but for which they don't know the SN spectral type, don't know the velocity broadening of the lines, don't know the true spectral resolution, don't know the variance of the observational noise, and expect the noise variance to depend on wavelength. We discussed proper probabilistic approaches, and also simple filtering techniques, to separate the signal from the noise. Obviously strong priors on supernova spectra help enormously, but the SN people want to stay as assumption-free as possible. In the end, a pragmatic filtering approach won out; we discussed ways to make the filtering sensible and not mix (too badly) the signal output with the noise output.



Aside from a blackboard talk by Maryam Modjaz (NYU) about supernova types and classification, it was a day of all talk. I spoke with Goodman and CampHogg about Hou's paper on marginalized likelihood calculation using the geometric path. I spoke with Vakili about how you go, in kernel PCA, back from the (high dimensional) feature space back to the original data space. (It's complicated.) I spoke with the exoSAMSI crew about exoplanet populations inference; Megan Shabram (PSU) is close to having a hierarchical inference of the exoplanet eccentricity distribution (as a function of period). Finally, I spoke with Foreman-Mackey about his new evil plan (why is there a new evil plan every four days?) to build an interim-prior-based sampling of the posterior density of exoplanet parameters for every KOI in the Kepler Catalog.


white-dwarf binaries, importance sampling

In the astro seminar, Carlos Badenes (Pitt) talked about white-dwarf–white-dwarf binaries and an inferred rate of inspiral, based on SDSS spectra split up exposure by exposure: The orbits of the soon-to-merge white dwarfs are so fast and short-period that even the twenty-minute intervals between spectral exposures in SDSS are long enough to show velocity changes! He finds a merger event rate for the binaries large enough to explain the type-Ia supernova rate, but only if he permits sub-Chandrasekhar total masses to make the SNe. That is, he gets enough events, but they tend to be low-mass.

Tim Morton (Princeton) spent the day at NYU to talk exoplanets, sampling, selection functions, marginalized likelihoods, and so on. We had a productive talk about making high-performance importance-sampling code to compute the marginalized likelihoods.


kernel PCA, tool of the devil

I spent too much time today trying to understand kernel PCA, inspired by Vakili's use of it to build a probabilistic model of galaxy images. Schölkopf would be disappointed with me! I don't see how it can give useful results. But then on further reflection, I realized that all my problems with kPCA are really just re-statements of my problems with PCA, detailed in my HMF paper: PCA delivers results that are not affine invariant. If you change the metric of your space, or the units of your quantities, or shear or scale things, you get different PCA components. That problem is even more severe and hard to control and incomprehensible as you generalize with the kernel trick.

I also don't understand how you go from the results of kPCA back to reconstructions in the original data space. But that is a separate problem, and just represents my weakness.


plotting spectra

In a low-research day, I discussed spectral plotting with Jeffrey Mei (NYUAD). This is serious wheel-reinvention: Every student who works on spectra pretty-much has to build her or his own plotting tools.


insane robotics, populations

In a blast from the past, James Long (TAMU) called me today to discuss a re-start of what I like to call the "insane robot" project, in which we are fitting photometric data censored by an unknown (but assumed stationary) probabilistic process. This project was started with Joey Richards (wise.io), who wrote much of the code with Long's help, but it has been dormant for some time now. One astonishing thing, after a couple years of disuse, the code was comprehensible and ran successfully. Let's hear it for well-documented, well-structured code!

Late in the day, Foreman-Mackey proposed a very simple approach to inferring exoplanet population parameters, based only on the content of the Kepler "Object of Interest" catalog. That is, a way to build a probabilistic model of this catalog that would be responsible and rigorous (though involving many simplifying assumptions, of course). It relates to projects by Subo Dong and others, who have been doing approximations to hierarchical inference; one goal would be to test those conclusions. The common theme between the exoplanet project and the insane robot project is that both require a parameterized model of the completeness or data censoring; we don't know with any reliability in either case the conditions under which an observation makes it into the catalog.


evaluating and sampling from pdfs

I spoke with MJ Vakili today about how to turn his prior over galaxy images into a probabilistic weak lensing measurement system. Any time we write down a probabilistic model, we need to be able to evaluate the probability of some set of parameters given data, or some set of data given parameters, and we also need to be able to sample from it: We need to be able to generate fair samples of artificial data given parameters, and generate fair samples of parameters given data. Vakili is assigned with the task of making both kinds of operations first correct and second fast; the weak lensing community won't care that we are more righteous if we aren't practicable.


supernova cosmology, Hou's dissertation

Masao Sako (Penn) gave the astro seminar today, talking about supernova cosmology, now and in the near future. Afterwards we discussed the possibility that precise cosmological measurements may be reaching their maximum possible precisions, some from cosmic variance and some from complicated and random systematic issues (unknown unknowns, as it were).

Before and at lunch, CampHogg discussed the chapters and title for Hou's PhD thesis, which is about probabilistic inference in the exoplanet domain. This subject of discussion was inspired by Hou's extremely rapid write-up of his new MCMC method (which he is calling multi-canonical, but which we now think is probably a misnomer).


exoplanet migration

After the Spitzer Oversight Committee meeting came to a close, I got lunch with Heather Knutson (Caltech), during which I picked her brain about things exoplanet. She more-or-less agreed with my position that if any eta-Earth-like calculation is going to be precise, it will have to find new, smaller planets, beyond what was found by Petigura and company (in their pay-walled article, and recent press storm). That said, she was skeptical that CampHogg could detect smaller-sized planets than anyone else has.

Knutson described to me a beautful project in which she is searching the hot jupiters for evidence of more massive, outer planets and she says she does find them. That is, she is building up evidence that migration is caused by interactions with heavier bodies. She even finds that more massive hot Jupiters tend to have even more massive long-period siblings. That's pretty convincing.



I spent the day at the Spitzer Science Center participating in a review of preparations for Spitzer's proposal to the NASA Senior Review, which is empowered to continue or terminate the ongoing missions. I also wrote text for the NSF proposal being submitted by Geha, Johnston, and me.


data science; data-driven models

Today I turned down an invitation to the White House. That might not be research, but it sure is a first for me! I turned it down to hang out more with Vanderplas (UW). I hope he appreciates that! At the White House Office of Science and Technology Policy (okay, perhaps this is just on the White House grounds), there was an announcement today of the Moore-Sloan Data Science Environment at NYU, UW, and Berkeley. This is the project I was working on all summer; it has come to fruition, and we start hiring this Spring. Look for our job ads, which will be for fellowship postdocs, software engineering and programming positions, quantitative evaluation (statistics) positions, and even tenure-track faculty positions (the latter coming from NYU, not Moore and Sloan, but related).

At lunch, Vanderplas, Foreman-Mackey, Fadely, and I discussed alternative publication models and how they relate to our research. Foreman-Mackey reasserted his goal of having any exoplanet discoveries we make come out on Twitter before we write them up. Vanderplas is wondering if there could be a scientific literature on blogs that would "play well" with the traditional literature.

Earlier in the morning, Vanderplas gave us some good feedback on our data-driven model of the Kepler focal plane. He had lots to say about these "uninterpretable" models. How do you use them as if they provide just a calibration, when what they really do is fit out all the signals without prejudice (or perhaps with extreme prejudice)? Interestingly, the Kepler community is already struggling with this, whether they know it or not: The Kepler PDC photometry is based on the residuals away from a data-driven model fit to the data.


objective filter design, Bayes

Jake Vanderplas (UW), internet-famous computational data-driven astrophysicist, showed up at NYU for a couple of days today. He showed us some absolutely great results on objective design of photometric systems for future large imaging surveys (like LSST). His method follows exactly my ideas about how this should be done—it is a scoop, from my perspective—he computes the information delivered by the photometric bandpasses about the quantities of interest from the observed objects, as a function of exposure time. Fadely, Vanderplas, and I discussed what things about the bandpasses and the survey observing strategy he should permit to vary. Ideally, it would be everything, at fixed total mission cost! He has many non-trivial results, not the least of which is that the bandpasses you want depend on the signal-to-noise at which you expect to be working.

In the afternoon, Hou, Goodman, Fadely, Vanderplas, and I had a conversation about Hou's recent work on full marginalization of the likelihood function. In the case of exoplanet radial-velocity data, he has been finding that our simple "multi-canonical" method is faster and more accurate than the much more sophisticated "nested sampling" method he has implemented. We don't fully understand all the differences and trade-offs yet, but since the multi-canonical method is novel for astrophysics, we decided to raise its priority in Hou's paper queue.



In a day of proposal and letter writing, Fadely came by for a work meeting. We discussed all his projects and publications and priorities. On the HST WFC3 self-calibration project, he is finding that the TinyTim PSF model is not good enough for our purposes: If we use it we will get a very noisy pixel-level flat. So we decided we have to suck it up and build our own model. Then we realized that in any small patch of the detector, we can probably make a pretty good model just empirically from all the stellar sources we see; the entire HST Archive is quite a bit of data. Other decisions include: We will model the pixel-convolved PSF, not the optical PSF alone. There is almost no reason to ever work with anything other than the pixel-convolved PSF; it is easier to infer (smoother) and also easier to use (you just sample it, you don't have to convolve it). We will work on a fairly fine sub-pixel grid to deal with the fact that the detector is badly sampled. We will only do a regularized maximum likelihood or MAP point estimate, using convex optimization. If all that works, this won't set us back too far.


potential expansions

Late in the day I zoomed up to Columbia to discuss streams with Bonaca, Johnston, Küpper, and Price-Whelan. We discussed things related to our upcoming NSF proposal. One idea in the proposal is too look at models of the Milky Way gravitational potential that make use of expansions. In these kinds of problems, issues arise regarding what expansion to use, and what order to go to. On the former, choices include expansions that are orthogonal in something you care about, like the potential or density, or expansions that are orthogonal in the context of the data you have. That is, the data constrain the potential incompletely, so an expansion that is orthogonal in the potential basis will not have coefficients that are independently constrained by the data; there will be data-induced covariances in the uncertainties. On the latter (what order), choices include, at one extreme, just making a heuristic or educated guess, and on the other extreme, going fully non-parametric and inferring an infinity of parameters. You can guess what I want to try! But we will probably put more modest goals in the proposal, somewhere in-between. Amusingly, both of these problems (orthogonal expansions for incomplete observations, and choices about expansion order) come up in cosmology and have been well studied there.



Jeffrey Mei (NYUAD) came by to discuss his project to infer the dust extinction law from SDSS spectra of F stars. We talked about a "centering" issue: We are regressing out g-band brightness (flux; a brightness or distance proxy), H-delta equivalent width (a temperature proxy), and extinction amplitude from the Schlegel, Finkbeiner, & Davis map. The coefficient of the latter will be interpretable in terms of the extinction law (the dust law). Because we have regressed out the observed g-band brightness, we get that the mean effect of extinction is zero in the g-band; that is, our results about the dust extinction are distorted by what we choose, precisely, to regress out. Another way to put it: The brightness is a function of distance, temperature, and extinction. So if you regress that out, you distort your extinction results. The point is obvious, but it took us a while to figure that simple thing out! We have a fix, and Mei is implementing.


marginalizing over images

MJ Vakili delivered to me a draft manuscript on his prior over galaxy images. In the introduction, it notes that the only other times things like this have been done it has been to reduce the dimensionality of the space in which galaxy images are modeled or represented. This is a baby step, of course, towards a prior on images, but only a baby step, because principal component coefficients don't, in themselves, have a probabilistic interpretation or result in a generative model.

On the board in my office, Vakili explained how he would use the prior over images to make the best possible measurement of weak gravitational-lensing shear; it involves marginalizing out the unsheared galaxy image, which requires the prior of which we speak. The cool thing is that this solves—in principle—one of the ideas Marshall and I hatched at KIPAC@10, which was to use the detailed morphological features in galaxies that go beyond just overall ellipticity to measure the shear field. Now that's in principle; will it work in practice? Vakili is going to look at the GREAT3 data.


the (low mass) IMF

Dennis Zaritsky (Arizona, on sabbatical at NYU) gave the astro seminar today, about measuring the IMF for the purposes of understanding the mass-to-light ratios of stellar populations. He is using bound stellar clusters, measuring kinematic masses and visible and infrared photometry. He finds that there seem to be two different IMFs for stellar clusters, one for the oldest clusters and another for those less than 10 Gyr in age. But this difference also more-or-less maps onto metallicity (the older clusters are more metal poor) and onto environment (the younger clusters are Magellanic-Cloud disk clusters, the older clusters are Milky-Way bulge and halo clusters). So it is hard to understand the causal relationships in play. Zaritsky is confident that near-future observations will settle the questions.

At lunch, Fadely proposed that we enter the Strong-Lens Time Delay Challenge. We imagined an entry that involves multi-band Gaussian Processes (like those worked out by Hernitschek, Mykytyn, Patel, Rix, and me this summer) added to multi-band Gaussian Processes. Time to do some math.