AstroData Hack Week, day 5

Today was Ivezic (UW) day. He gave some great lectures on useful methods from his new textbook. He was very nice on the subject of Extreme Deconvolution, which is the method Bovy, Roweis, and I developed a few years ago. He showed that it rocks on the SDSS stars. Late in the day, I met with Ivezic, Juric, and Vanderplas to discuss our ideas for Gaia "Zero-Day Exploits". We have to be ready! One great idea is to use the existing data we have (for example, PanSTARRS and SDSS), and the existing dynamical models we have of the Milky Way, to make specific predictions for every star in the Gaia catalog. Some of these predictions might be quite specific. I vowed to write one of my short data-analysis documents on this.

In the hack session, Foreman-Mackey generalized our K2 PSF model to give it more freedom, and then we wrote down our hyper-parameters, a hacky objective, and started experimental coding to find the best place to be working. By the wrap-up, he had found some pretty good settings.

The wrap-up session on Friday was deeply impressive and humbling! For more information, check out the HackPad (tm) index.

[ps. Next year this meeting will be called AstroHackWeek and it may happen in NYC. Watch this space.]


AstroData Hack Week, day 4

Today began with Bloom (UCB) talking about supervised methods. He put in a big plug for Random Forest, of course! Two things I liked about his talk: One, he emphasized the difficulties and time spent munging the data, identifying features, imputing, and so on. Two, he put in some philosophy at the end, to the effect that you don't have to understand every detail of these methods; much better to partner with someone who does. Then the intellectual load on data scientists is not out of control. For Bloom, data science is inherently collaborative. I don't perfectly agree, but I would agree that it works very, very well when it is collaborative. Back on the data munging point: Essentially all of the basic supervised methods presume that your "features" (inputs) are noise-free and non-missing. That's bad for us in general.

Based on Bloom's talk I came up with many hacks related to Random Forest. A few examples are the following: Visualize the content and internals of a Random Forest, and its use as a hacky generative model of the data. Figure out a clever experimental sequence to determine conditional feature importances, which is a combinatorial problem in general. Write a basic open-source RF code and get some community to fastify it. Figure out if there is any way to extend RF to handle noisy input features.

I didn't do any of these hacks! We insisted on pair-coding today among the hackers, and this was a good idea; I partnered with Foreman-Mackey and he showed me what he has done for K2 photometry. It is giving some strange results—it is using the enormous flat-field freedom we are giving it to fix all its ills—we thought about ways to test what is going wrong and how to give other parts of the model relevant freedom.


AstroData Hack Week, day 3

I gave the Bayesian inference lectures in the morning, and spent time chatting in the afternoon. In my lectures, I argued strongly for passing forward probabilistic answers, not just regularized estimators. In particular, many methods that are called "Bayesian" are just regularized optimizations! The key ideas of Bayes are that you can marginalize out nuisances and properly propagate uncertainties. Those are important ideas and both get lost if you are just optimizing a posterior pdf.


AstroData Hack Week, day 2

The day started with Huppenkothen (Amsterdam) and I meeting at a café to discuss what we were going to talk about in the tutorial part of the day. We quickly got derailed to talking about replacing periodograms and auto-correlation functions with Gaussian Processes for finding and measuring quasi-periodic signals in stars and x-ray binaries. We described the simplest possible project and vowed to give it a shot when she arrives at NYU in two months. Immediately following this conversation, we each talked for more than an hour about classical statistics. I focused on the value of standard, frequentist methods for getting fast answers that are reliable, easy to interpret, and well understood. I emphasized the value of having a likelihood function!

In the hack session, I spoke with Eilers (MPIA) and Hennawi (MPIA) about measuring absorption by the intergalactic medium in quasars subject to noisy (and correlated) continuum estimation. Foreman-Mackey explained to me that our failures on K2 the previous night were caused by the inflexibility of the (dumb) PSF model hitting the flexibility of the (totally unconstrained) flat-field. I discussed Gibbs sampling for a simple hierarchical inference with Sick (Queens). And I went through agonizing rounds of good-ideas-turned-bad on classifying pixels in Earth imaging data with Kapadia (Mapbox). On the latter, what is the simplest way to do clustering in the space of pixel histograms?

The research day ended with a discussion of Spectro-Perfectionism (Bolton and Schlegel) with Byler (UW). I told her about the long conversations among Roweis, Bolton, and me many years ago (late 2009) about this. We decided to do a close reading of it (the paper) tomorrow.


AstroData Hack Week, day 1

On my way to Seattle, I wrote up a two-page document about inferring the velocity distribution when you only get (perhaps noisy, perhaps censored) measurements of v sin i. When I arrived at the AstroData Hack Week, I learned that Foreman-Mackey and Price-Whelan had both come to the same conclusion that this would be a valuable and achievable hack for the week. Price-Whelan and I spent hacking time specifying the project better.

That said, Foreman-Mackey got excited about doing a good job on K2 point-source photometry. We talked out the components of such a model and tried to find the simplest possible version of the project, which Foreman-Mackey wants to approach by building a full, parameterized, physical model of the point-spread function, the spacecraft Euler angles, and the flat-field. Late in the day (at the bar) we found out that our first shot at this model is going badly off the rails: The flat-field and the point-spread function are degenerate (somewhat or totally?) in the naive model we have right now. Simple fixes didn't work.


GRB beaming, classifying stars

Andy Fruchter (STScI) gave the astrophysics seminar, on gamma-ray bursts and their host galaxies. He showed Modjaz's (and others) results on the metallicities of "broad-line type IIc" supernovae, which show that the ones associated with gamma-ray bursts are in much lower-metallicity environments than those not associated. I always react to this result by pointing out that this ought to put a very strong constraint on GRB beaming, because (if there is beaming) there ought to be "off-axis" bursts that we don't see as GRBs, but that we do see as a BLIIc. Both Fruchter and Modjaz claimed that the numbers make the constraint uninteresting, but I am surprised: The result is incredibly strong.

In group meeting, Fadely showed evidence that he can make a generative model of the colors and morphologies (think: angular sizes, or compactnesses) of faint, compact sources in the SDSS imaging data. That is, he can build a flexible model (using the "extreme deconvolution" method) that permits him to predict the compactness of a source given a noisy measurement of its five-band spectral energy distribution. This shows great promise to evolve into a non-parametric, model-free (that is: free of stellar or galaxy models) method for separating stars from galaxies in multi-band imaging. The cool thing is he might be able to create a data-driven star–galaxy classification system without training on any actual star or galaxy labels.


single-example learning

I pitched projects to new graduate students in the Physics and Data Science programs today; hopefully some will stick. Late in the day, I took out new Data Science Fellow Brenden Lake (NYU) for a beer, along with Brian McFee (NYU) and Foreman-Mackey. We discussed many things, but we were blown away by Lake's experiments on single-instance learning: Can a machine learn to identify or generate a class of objects from seeing only a single example? Humans are great at this but machines are not. He showed us comparisons between his best machines and experimental subjects found on the Mechanical Turk. His machines don't do badly!


crazy diversity of stars; cosmological anomalies

At CampHogg group meeting (in the new NYU Center for Data Science space!), Sanderson (Columbia) talked about her work on finding structure through unsupervised clustering methods, and Price-Whelan talked about chaotic orbits and the effect of chaos on the streams in the Milky Way. Dun Wang blew us all away by showing us the amazing diversity of Kepler light-curves that go into his effective model of stellar and telescope variability. Even in a completely random set of a hundred light-curves you get eclipsing binaries, exoplanet transits, multiple-mode coherent pulsations, incoherent pulsations, and lots of other crazy variability. We marveled at the range of things used as "features" in his model.

At lunch (with surprise, secret visitor and Nobel Laureate Brian Schmidt), I had a long conversation with Matt Kleban (NYU), following my conversation from yesterday with D'Amico. We veered onto the question of anomalies: Just as there are anomalies in the CMB, there are probably also anomalies in the large-scale structure, but no-one really knows how to look for them. We should figure out and look! Also, each anomaly known in the CMB should make a prediction for an anomaly visible (or maybe not) in the large-scale structure. That would make for a valuable research program.


searching in the space of observables

In the early morning, Ness and I talked by phone about The Cannon (or maybe The Jump; guess the source of the name!), our method for providing stellar parameter labels to stars without using stellar models. We talked about possibly putting priors in the label space; this might regularize the results towards plausible values when the data are ambiguous. That's for paper 2, not paper 1. She has drafted an email to the APOGEE-2 collaboration about our current status, and we talked about next steps.

In the late morning, I spoke with Guido D'Amico (NYU) about future projects in cosmology that I am interested in thinking about. One class of projects involves searching for new kinds of observables (think: large-scale structure mixed-order statistics and the like) that are tuned to have maximum sensitivity to the cosmological parameters of interest. I feel like there is some kind of data-science-y approach to this, given the incredible simulations currently on the market.


CDS space

Does it count as research when I work on the NYU Center for Data Science space planning? Probably not, but I spent a good fraction of the day analyzing plans and then discussing with the architects working to create schematics for the new space. We want a great diversity of offices, shared spaces (meeting rooms, offices, and carrels), and open space (studio space and lounge and cafe space). We want our space plans to be robust to our uncertainties about how people will want to use the space.



Hidden away in a cabin off the grid, I made writing progress on my project with Ness to make data-driven spectral models, and on my project to use hot planets as clocks for timing experiments. On the latter, I figured out some useful things about expected signal-to-noise and why there is a huge difference between checking local clock rates and looking at global, time-variable timing residuals away from a periodic model.


single transits, group meeting, robot DJs

The day started with a discussion with So Hattori about finding single transits in the Kepler data. We did some research and it seems like there may be no clear sample in the published literature, let alone any planet inference based on them. So we are headed in that direction. In group meeting, Foreman-Mackey told us about his approach to exoplanet search, Goodman told us about his approach to sampling that uses (rather than discards) the rejected likelihood calls (in the limit that they are expensive), and Vakili told us about probabilistic PSF modeling. On the latter, we had requests that he do something more like train and test.

A fraction of the group had lunch with Brian McFee (NYU), the new Data Science Fellow. McFee works on music, from a data analysis perspective. His past research was on song selection and augmenting or replacing collaborative filtering. His present research is on beat matching. So with the two put together he might have a full robot DJ. I have work for that robot!


calibration and search

A phone call between Wang, Foreman-Mackey, Schölkopf, and me started the day, and a conversation between Foreman-Mackey and me ended the day, both on the subject of modeling the structured noise in the Kepler data and the impact of such methods on exoplanet search. In the morning conversation, we discussed the relative value of using other stars as predictors for the target star (because co-variability of stars encodes telescope variability) compared to using the target star itself, but shifted in time (because the past and future behavior of the star can predict the present state of the star). Wang has a good system for exploring this at the pixel level. We gave him some final tasks before we reduce our scope and write a paper about it.

In the afternoon conversation, we looked at Foreman-Mackey's heuristic "information criteria" that he is using for exoplanet search in the Kepler data. By the end of the day, his search scalar included such interesting components as the following: Each proposed exoplanet period and phase is compared not just to a null model, but to a model in which there is a periodic signal but with time-varying amplitude (which would be a false positive). Each peak in the search scalar is vetoed if it shares transits with other, higher peaks (to rule out period aliasing and transit–artifact pairings). A function of period is subtracted out, to model the baseline created by noise (which is a non-trivial function of period). Everything looks promising for exoplanet candidate generation.


psf interpolation

Vakili and I spoke about his probabilistic interpolation of the point-spread function in imaging using Gaussian Processes and a basis derived from principal components. The LSST atmospheric PSF looks very stable, according to Vakili's image simulations (which use the LSST code), so I asked for shorter exposures to see more variability. We talked about the point that there might be PSF variations that are fast in time but slow in focal-plane position (from the atmosphere) and others that might be slow in time but fast in focal-plane position (from the optics) and maybe hybrid situations (if, say, the optics have some low-dimensional but fast variability). All these things could be captured by a sophisticated model.


exoplanet search

Back when we were in Tübingen, Foreman-Mackey promised Schölkopf that he would find some new exoplanets by September 1. That's today! He failed, although he has some very impressive results recovering some known systems that are supposed to be hard to find. That is, it looks like existing systems are found at higher significance for us than for other groups, so we ought to be able to find some lower-significance systems. The key technology (we think) is our insanely flexible noise model for stellar (and Kepler) variability, that uses a Gaussian process not just of time, but of hundreds of other-star lightcurves. Foreman-Mackey and I talked extensively about this today, but we are still many days away from discoveries. We are bound to announce our discoveries on twitter (tm). Bound in the sense of "obligated". Let's see if we have the fortitude to do that!