comparing data-driven and theory-driven models

I gave the brown-bag talk in the Center for Cosmology and Particle Physics at lunch-time today. I talked about The Cannon, Ness and Rix and my data-driven model of stellar spectra. I also used the talk as an opportunity to talk about machine learning and data science in the CCPP. Various good ideas came up from the audience. One is that we ought to be able to synthesize, with our data-driven model, the theory-driven model spectra that the APOGEE team uses to do stellar parameter estimation. That would be a great idea; it would help identify where our models and the theory diverge; it might even point to improvements both for The Cannon and for the APOGEE pipelines.


Gerry Neugebauer

I learned late on Friday that Gerry Neugebauer (Caltech) has died. Gerry was one of the most important scientists in my research life, and in my personal life. He co-advised my PhD thesis (with also Roger Blandford and Judy Cohen); we spent many nights together at the Keck and Palomar Observatories, and many lunches together with Tom Soifer and Keith Matthews at the Athenaeum (the Caltech faculty club).

In my potted history (apologies in advance for errors), Gerry was one of the first people (with Bob Leighton) to point an infrared telescope at the sky; he found far more sources bright in the infrared than anyone seriously expected. This started infrared astronomy. In time, he became the PI of the NASA IRAS mission, which has been one of the highest-impact (and incredibly high in impact-per-dollar) astronomical missions in NASA history. The IRAS data are still the primary basis for many important results and tools in astronomy, including galaxy clustering, infrared background, ultra-luminous galaxies, young stars, and the dust maps.

To a new graduate student at Caltech, Gerry was intimidating: He was gruff, opinionated, and never wrong (as far as I could tell). But if you broke through that very thin veneer of scary, he was the most loving, caring, thoughtful advisor a student could want. He patiently taught me why I should love (not hate) magnitudes and relative measurements. He showed me how a telescope worked by having me observe at Palomar at his side. He showed me how to test our imaging-data uncertainties, both theoretically and observationally, to make sure we weren't making mistakes. (He taught me to call them "uncertainties" not "errors"!) He helped me develop observing strategies and data-analysis strategies that minimize the effects of detector "memory" and non-linearities. He enjoyed data analysis so much, on one of our projects he insisted that he do the data analysis, so long as I (the graduate student) would be willing to write the paper! Uncharacteristically for then or now, he could run his group so efficiently that many of his students designed, built, and operated an astronomical instrument, from soup to nuts, in a few years of PhD! He had strong opinions about how to run a scientific project, how to write up the results, and even about how to typeset numbers. I obey these positions strictly now in all my projects.

Reading this back, it doesn't capture what I really want to say, which is that Gerry spent a huge fraction of his immense intellectual capability on students, postdocs, and others new to science. He cared immensely about mentoring. From working with Gerry I realized that if you want to propagate great ideas into astronomy, you do it not just by writing papers and giving seminars: You do it by mentoring well new generations of scientists who will, in turn, pass it on in their own work and their own students. Many of the world's best infrared astronomers are directly or indirectly a product of Gerry's wonderful mentoring. I was immensely privileged to get some of that!

[I am also the author of Gerry's only erratum ever in the scientific literature. Gerry was a bit scary the day we figured out that error!]


interstellar bands; PSF dictionaries

Gail Zasowski (JHU) gave an absolutely great talk today, about diffuse interstellar bands in the APOGEE spectra and their possible use as tools for mapping the interstellar medium and measuring the kinematics of the Milky Way. Her talk also made it very clear what a huge advance APOGEE is over previous surveys: There are APOGEE stars in the mid-plane of the disk on the other side of the bulge! She showed lots of beautiful data and some results that just scratch the surface of what can be learned about the interstellar medium with stellar spectra.

In CampHogg group meeting in the morning, we realized we can reformulate Vakili's work on the point-spread function in SDSS and LSST so that he never has to interpolate the data (to, for example, centroid the stars properly). We can always shift the models, never the data. We also realized that we don't need to build a PCA or KL basis for the PSF representation; we can use a dictionary and learn the dictionary elements along with the PSF. This is an exciting realization; it almost ensures that we have to beat the existing methods for accuracy and flexibility. Also interesting: The linear algebra we wrote down permits us to make use of "convolutional methods" and also permits us to represent the PSF at pixel resolutions higher than the data (super-resolution).


overlapping stars, stellar training sets

On the phone with Schölkopf, Wang, Foreman-Mackey, and I tried to understand how it is that we can fit some insanely variable stars in the Kepler data using other stars, when the variability seems so specific to each star. In one case we investigated, it turned out that the crazy variability of one star (below) was perfectly matched by the variability of another, brighter star. What gives? It turns out that the two stars overlap on the detector, so their footprints actually share pixels! The shared variability is caused by the situation that they are being photometered through overlapping apertures. We also learned that some stars in Kepler have been assigned non-contiguous apertures.

Late in the day, Gail Zasowski (JHU) showed up. I explained in detail The Cannon—Ness, Rix, and my label-transfer code for stellar parameter estimation. She had many questions about our training set, both because it is too large (it contains some obviously wrong entries) and too small (it doesn't nearly cover all kinds of stars at all metallicities).


deep learning and exoplanet transits

At group meeting, Foreman-Mackey and Wang showed recent results on calibration of K2 and Kepler data, respectively, and Malz showed some SDSS spectra of the night sky. After group meeting, Elizabeth Lamm (NYU) came to ask about possible Data Science capstone projects. We pitched a project on finding exoplanets with Gaia data and another on finding exoplanet transits with deep learning! The latter project was based on Foreman-Mackey's realization that everything that makes convolutional networks great for finding kittens in video also makes them great for finding transits in variable-star light-curves. Bring it on!


half full or half empty?

Interestingly (to me, anyway), as I have been raving in this space about how awesome it is that Ness and I can transfer stellar parameter labels from a small set of "standard stars" to a huge set of APOGEE stars using a data driven model, Rix (who is one of the authors of the method) has been seeing our results as requiring some spin or adjustment in order to be impressive to the stellar parameter community. I see his point: What impresses me is that we get good structure in the label (stellar parameter) space and we do very well where the data overlap the training sample. What concerns Rix is that many of our labels are clearly wrong or distorted, especially where we don't have good coverage in the training sample. We discussed ways to modify our method or our display of the output to make both points in a responsible way.

Late in the day, Foreman-Mackey and I discussed NYU's high-performance computing hardware and environment with Stratos Efstathiadis (NYU), who said he would look into increasing our disk-usage limits. Operating on the entire Kepler data set inside the compute center turns out to be hard, not because the data set is large, but rather because it is composed of so many tiny files. This is a problem, apparently, for distributed storage systems. We discussed also the future of high-performance computing in the era of Data Science.


making black holes from gravitons!

I am paying for a week of hacking in Seattle with some days of not research back here in New York City. The one research highlight of the day was Gia Dvali (NYU) telling us at lunch about his work on black holes as information processing machines. Along the way, he described the thought experiment of constructing a black hole by concentrating enormous numbers of gravitons in a small volume. Apparently this thought experiment, as simple as it sounds, justifies the famous black-hole entropy result. I was surprised! Now I am wondering what it would take, physically, to make this experiment happen. Like could you do this with a real phased array of gravitational radiation sources?


AstroData Hack Week, day 5

Today was Ivezic (UW) day. He gave some great lectures on useful methods from his new textbook. He was very nice on the subject of Extreme Deconvolution, which is the method Bovy, Roweis, and I developed a few years ago. He showed that it rocks on the SDSS stars. Late in the day, I met with Ivezic, Juric, and Vanderplas to discuss our ideas for Gaia "Zero-Day Exploits". We have to be ready! One great idea is to use the existing data we have (for example, PanSTARRS and SDSS), and the existing dynamical models we have of the Milky Way, to make specific predictions for every star in the Gaia catalog. Some of these predictions might be quite specific. I vowed to write one of my short data-analysis documents on this.

In the hack session, Foreman-Mackey generalized our K2 PSF model to give it more freedom, and then we wrote down our hyper-parameters, a hacky objective, and started experimental coding to find the best place to be working. By the wrap-up, he had found some pretty good settings.

The wrap-up session on Friday was deeply impressive and humbling! For more information, check out the HackPad (tm) index.

[ps. Next year this meeting will be called AstroHackWeek and it may happen in NYC. Watch this space.]


AstroData Hack Week, day 4

Today began with Bloom (UCB) talking about supervised methods. He put in a big plug for Random Forest, of course! Two things I liked about his talk: One, he emphasized the difficulties and time spent munging the data, identifying features, imputing, and so on. Two, he put in some philosophy at the end, to the effect that you don't have to understand every detail of these methods; much better to partner with someone who does. Then the intellectual load on data scientists is not out of control. For Bloom, data science is inherently collaborative. I don't perfectly agree, but I would agree that it works very, very well when it is collaborative. Back on the data munging point: Essentially all of the basic supervised methods presume that your "features" (inputs) are noise-free and non-missing. That's bad for us in general.

Based on Bloom's talk I came up with many hacks related to Random Forest. A few examples are the following: Visualize the content and internals of a Random Forest, and its use as a hacky generative model of the data. Figure out a clever experimental sequence to determine conditional feature importances, which is a combinatorial problem in general. Write a basic open-source RF code and get some community to fastify it. Figure out if there is any way to extend RF to handle noisy input features.

I didn't do any of these hacks! We insisted on pair-coding today among the hackers, and this was a good idea; I partnered with Foreman-Mackey and he showed me what he has done for K2 photometry. It is giving some strange results—it is using the enormous flat-field freedom we are giving it to fix all its ills—we thought about ways to test what is going wrong and how to give other parts of the model relevant freedom.


AstroData Hack Week, day 3

I gave the Bayesian inference lectures in the morning, and spent time chatting in the afternoon. In my lectures, I argued strongly for passing forward probabilistic answers, not just regularized estimators. In particular, many methods that are called "Bayesian" are just regularized optimizations! The key ideas of Bayes are that you can marginalize out nuisances and properly propagate uncertainties. Those are important ideas and both get lost if you are just optimizing a posterior pdf.


AstroData Hack Week, day 2

The day started with Huppenkothen (Amsterdam) and I meeting at a café to discuss what we were going to talk about in the tutorial part of the day. We quickly got derailed to talking about replacing periodograms and auto-correlation functions with Gaussian Processes for finding and measuring quasi-periodic signals in stars and x-ray binaries. We described the simplest possible project and vowed to give it a shot when she arrives at NYU in two months. Immediately following this conversation, we each talked for more than an hour about classical statistics. I focused on the value of standard, frequentist methods for getting fast answers that are reliable, easy to interpret, and well understood. I emphasized the value of having a likelihood function!

In the hack session, I spoke with Eilers (MPIA) and Hennawi (MPIA) about measuring absorption by the intergalactic medium in quasars subject to noisy (and correlated) continuum estimation. Foreman-Mackey explained to me that our failures on K2 the previous night were caused by the inflexibility of the (dumb) PSF model hitting the flexibility of the (totally unconstrained) flat-field. I discussed Gibbs sampling for a simple hierarchical inference with Sick (Queens). And I went through agonizing rounds of good-ideas-turned-bad on classifying pixels in Earth imaging data with Kapadia (Mapbox). On the latter, what is the simplest way to do clustering in the space of pixel histograms?

The research day ended with a discussion of Spectro-Perfectionism (Bolton and Schlegel) with Byler (UW). I told her about the long conversations among Roweis, Bolton, and me many years ago (late 2009) about this. We decided to do a close reading of it (the paper) tomorrow.


AstroData Hack Week, day 1

On my way to Seattle, I wrote up a two-page document about inferring the velocity distribution when you only get (perhaps noisy, perhaps censored) measurements of v sin i. When I arrived at the AstroData Hack Week, I learned that Foreman-Mackey and Price-Whelan had both come to the same conclusion that this would be a valuable and achievable hack for the week. Price-Whelan and I spent hacking time specifying the project better.

That said, Foreman-Mackey got excited about doing a good job on K2 point-source photometry. We talked out the components of such a model and tried to find the simplest possible version of the project, which Foreman-Mackey wants to approach by building a full, parameterized, physical model of the point-spread function, the spacecraft Euler angles, and the flat-field. Late in the day (at the bar) we found out that our first shot at this model is going badly off the rails: The flat-field and the point-spread function are degenerate (somewhat or totally?) in the naive model we have right now. Simple fixes didn't work.


GRB beaming, classifying stars

Andy Fruchter (STScI) gave the astrophysics seminar, on gamma-ray bursts and their host galaxies. He showed Modjaz's (and others) results on the metallicities of "broad-line type IIc" supernovae, which show that the ones associated with gamma-ray bursts are in much lower-metallicity environments than those not associated. I always react to this result by pointing out that this ought to put a very strong constraint on GRB beaming, because (if there is beaming) there ought to be "off-axis" bursts that we don't see as GRBs, but that we do see as a BLIIc. Both Fruchter and Modjaz claimed that the numbers make the constraint uninteresting, but I am surprised: The result is incredibly strong.

In group meeting, Fadely showed evidence that he can make a generative model of the colors and morphologies (think: angular sizes, or compactnesses) of faint, compact sources in the SDSS imaging data. That is, he can build a flexible model (using the "extreme deconvolution" method) that permits him to predict the compactness of a source given a noisy measurement of its five-band spectral energy distribution. This shows great promise to evolve into a non-parametric, model-free (that is: free of stellar or galaxy models) method for separating stars from galaxies in multi-band imaging. The cool thing is he might be able to create a data-driven star–galaxy classification system without training on any actual star or galaxy labels.


single-example learning

I pitched projects to new graduate students in the Physics and Data Science programs today; hopefully some will stick. Late in the day, I took out new Data Science Fellow Brenden Lake (NYU) for a beer, along with Brian McFee (NYU) and Foreman-Mackey. We discussed many things, but we were blown away by Lake's experiments on single-instance learning: Can a machine learn to identify or generate a class of objects from seeing only a single example? Humans are great at this but machines are not. He showed us comparisons between his best machines and experimental subjects found on the Mechanical Turk. His machines don't do badly!


crazy diversity of stars; cosmological anomalies

At CampHogg group meeting (in the new NYU Center for Data Science space!), Sanderson (Columbia) talked about her work on finding structure through unsupervised clustering methods, and Price-Whelan talked about chaotic orbits and the effect of chaos on the streams in the Milky Way. Dun Wang blew us all away by showing us the amazing diversity of Kepler light-curves that go into his effective model of stellar and telescope variability. Even in a completely random set of a hundred light-curves you get eclipsing binaries, exoplanet transits, multiple-mode coherent pulsations, incoherent pulsations, and lots of other crazy variability. We marveled at the range of things used as "features" in his model.

At lunch (with surprise, secret visitor and Nobel Laureate Brian Schmidt), I had a long conversation with Matt Kleban (NYU), following my conversation from yesterday with D'Amico. We veered onto the question of anomalies: Just as there are anomalies in the CMB, there are probably also anomalies in the large-scale structure, but no-one really knows how to look for them. We should figure out and look! Also, each anomaly known in the CMB should make a prediction for an anomaly visible (or maybe not) in the large-scale structure. That would make for a valuable research program.


searching in the space of observables

In the early morning, Ness and I talked by phone about The Cannon (or maybe The Jump; guess the source of the name!), our method for providing stellar parameter labels to stars without using stellar models. We talked about possibly putting priors in the label space; this might regularize the results towards plausible values when the data are ambiguous. That's for paper 2, not paper 1. She has drafted an email to the APOGEE-2 collaboration about our current status, and we talked about next steps.

In the late morning, I spoke with Guido D'Amico (NYU) about future projects in cosmology that I am interested in thinking about. One class of projects involves searching for new kinds of observables (think: large-scale structure mixed-order statistics and the like) that are tuned to have maximum sensitivity to the cosmological parameters of interest. I feel like there is some kind of data-science-y approach to this, given the incredible simulations currently on the market.


CDS space

Does it count as research when I work on the NYU Center for Data Science space planning? Probably not, but I spent a good fraction of the day analyzing plans and then discussing with the architects working to create schematics for the new space. We want a great diversity of offices, shared spaces (meeting rooms, offices, and carrels), and open space (studio space and lounge and cafe space). We want our space plans to be robust to our uncertainties about how people will want to use the space.



Hidden away in a cabin off the grid, I made writing progress on my project with Ness to make data-driven spectral models, and on my project to use hot planets as clocks for timing experiments. On the latter, I figured out some useful things about expected signal-to-noise and why there is a huge difference between checking local clock rates and looking at global, time-variable timing residuals away from a periodic model.


single transits, group meeting, robot DJs

The day started with a discussion with So Hattori about finding single transits in the Kepler data. We did some research and it seems like there may be no clear sample in the published literature, let alone any planet inference based on them. So we are headed in that direction. In group meeting, Foreman-Mackey told us about his approach to exoplanet search, Goodman told us about his approach to sampling that uses (rather than discards) the rejected likelihood calls (in the limit that they are expensive), and Vakili told us about probabilistic PSF modeling. On the latter, we had requests that he do something more like train and test.

A fraction of the group had lunch with Brian McFee (NYU), the new Data Science Fellow. McFee works on music, from a data analysis perspective. His past research was on song selection and augmenting or replacing collaborative filtering. His present research is on beat matching. So with the two put together he might have a full robot DJ. I have work for that robot!


calibration and search

A phone call between Wang, Foreman-Mackey, Schölkopf, and me started the day, and a conversation between Foreman-Mackey and me ended the day, both on the subject of modeling the structured noise in the Kepler data and the impact of such methods on exoplanet search. In the morning conversation, we discussed the relative value of using other stars as predictors for the target star (because co-variability of stars encodes telescope variability) compared to using the target star itself, but shifted in time (because the past and future behavior of the star can predict the present state of the star). Wang has a good system for exploring this at the pixel level. We gave him some final tasks before we reduce our scope and write a paper about it.

In the afternoon conversation, we looked at Foreman-Mackey's heuristic "information criteria" that he is using for exoplanet search in the Kepler data. By the end of the day, his search scalar included such interesting components as the following: Each proposed exoplanet period and phase is compared not just to a null model, but to a model in which there is a periodic signal but with time-varying amplitude (which would be a false positive). Each peak in the search scalar is vetoed if it shares transits with other, higher peaks (to rule out period aliasing and transit–artifact pairings). A function of period is subtracted out, to model the baseline created by noise (which is a non-trivial function of period). Everything looks promising for exoplanet candidate generation.


psf interpolation

Vakili and I spoke about his probabilistic interpolation of the point-spread function in imaging using Gaussian Processes and a basis derived from principal components. The LSST atmospheric PSF looks very stable, according to Vakili's image simulations (which use the LSST code), so I asked for shorter exposures to see more variability. We talked about the point that there might be PSF variations that are fast in time but slow in focal-plane position (from the atmosphere) and others that might be slow in time but fast in focal-plane position (from the optics) and maybe hybrid situations (if, say, the optics have some low-dimensional but fast variability). All these things could be captured by a sophisticated model.


exoplanet search

Back when we were in Tübingen, Foreman-Mackey promised Schölkopf that he would find some new exoplanets by September 1. That's today! He failed, although he has some very impressive results recovering some known systems that are supposed to be hard to find. That is, it looks like existing systems are found at higher significance for us than for other groups, so we ought to be able to find some lower-significance systems. The key technology (we think) is our insanely flexible noise model for stellar (and Kepler) variability, that uses a Gaussian process not just of time, but of hundreds of other-star lightcurves. Foreman-Mackey and I talked extensively about this today, but we are still many days away from discoveries. We are bound to announce our discoveries on twitter (tm). Bound in the sense of "obligated". Let's see if we have the fortitude to do that!