utility and decisions

In the morning I woke up deciding that I had absolutely nothing new or interesting to say about model comparison (which I cast as a decision-making problem). By the end of the day I was writing furiously on it. I guess I realized that I might as well write down what I know and see if it stands on its own. It probably doesn't. I think the new things I have to add (maybe) are (a) pragmatic considerations about computation and precision and (b) an argument that you can never know your utility precisely (and rarely your priors). My point is that if you don't know the inputs precisely, there is no reason to do a huge computation. This has the consequence that if you are looking to split models that are close in marginalized likelihood you need to know a lot of unknowables precisely and do a lot of computation. If you are looking to split models that are far apart, you don't need to do anything serious; the difference will be obvious.

ps. If you want to know why you can never know your utility precisely, it is because it necessarily involves long-term considerations. The long term is the period over which all variables and assumptions can change. So you never have a precise model of it.


proposal, utility

Today a large group at NYU started the process of preparing our proposal to the Moore and Sloan Foundations. We got off to a challenging start, because we can all see that once you get a whole bunch of people in the room, resources get tight. But of course it is a sign of great health that there is so much we can do and I am excited about paring down what we might do down to what we will do.

Late in the day I worked more on my arguments about computing the marginalized likelihood (or Bayes integral). It is a complicated argument; I don't have an extremely simple version of it (yet). Yesterday, Gelman advised me to read Chapter 6 of his book Bayesian Data Analysis.


worlds collide

Foreman-Mackey and I had an internet chat with Andrew Gelman, Bob Carpenter, and Daniel Lee (all Columbia) regarding the hierarchical modeling package STAN. They want to see if it works in astronomical contexts, and we have astronomical contexts that need hierarchical inference. So it may be a match made in heaven. After the great success of exoSAMSI, we gave them the Kepler problem, and promised them reading material. They gave us the STAN manual!


serving likelihood

At the end of the day yesterday, Price-Whelan and I discussed his very nice project with Kathryn Johnston, in which they run the clock backwards on stars in stellar streams in the Milky Way and use their clustering properties back in time to identify better and worse gravitational potential models. They are using the empirical variance tensor of a set of 6-displacements (in phase space) as the objective function that they are trying to minimize.

As my loyal readers know, HOGG SMASH objective functions that are not justifiable in terms of probabilistic inference. So today I sketched out a method (in LaTeX) that fits the past-time distribution of stars with a Gaussian (with that pesky variance tensor), plus a bunch of nuisance parameters. That is, I made a likelihood function for their problem. What I have proposed is justifiable and probabilistic—so I am pleased—and it retains the penalize with the variance aspect that APW and KVJ have been using. I have to admit, however, that inference with my likelihood function will be far, far slower than what they are doing now. It reminds me of our tag line for The Tractor: It took one hell of a long time, but I felt so probabilistically righteous.


streams and gravitational potential

Kathryn Johnston (Columbia), Ana Bonaca (Yale), Andreas Küpper (Yale), and Adrian Price-Whelan (Columbia) stopped in for the day to discuss all things streams-in-the-Milky-Way-Halo. We discussed objective functions, probabilistic inference, action–angle space, chaotic and integrable potentials, biases and functional testing of inferences, tidal stripping, and other matters of great importance. We spent quite a bit of time talking about how to make more realistic or asymmetric potential models and fit them, and how to test the evil done by the assumption (that almost everyone makes) of integrability (which cannot be true). We also discussed work by Sanderson and Helmi (Groningen) that makes use of information theory to do inference in a similar context; indeed information theory appeared multiple times in the conversation. We decided that there are many overlapping papers to write. My job in all this is appearing to be: Turn a sensible data analysis idea into a probabilistic generative model such that we can fire up emcee. Love that!


marginalized likelihood

On the trip home from San Francisco, I worked a bit on a note I want to write about the marginalized likelihood or Bayes factor or evidence integral, inspired by the very nice talk at exoSAMSI by Jim Berger (Duke) and a lunch conversation with him after.



I spent the day with design firm IDEO and a large number of scientists and stats people in San Francisco, finishing up a two-day meeting to design a program in Data Science supported by the Moore and Sloan Foundations. It was odd to interact with designers in a scientific context. Their ways of operating and thinking and communicating are very alien for scientists. However, they did a good job—via a well-developed and thought-out process—of bringing out important questions, ideas, and common themes for institution-wide and multi-institution interdisciplinary research in Data Science. Now we all head back to our home institutions and write proposals, while the Foundations and IDEO prepare some public documents about principles of design pertaining to Data Science initiatives. Odd!

Some of the most interesting ideas that came up involve going meta (applying data science to understand data science, using profiling to assess programmers and scientists and not just programs) and mash-ups (Foursquare plus TripAdvisor plus github plus Python packages). I hope we implement some of these ideas and they have legs.


more wavelets

I was taken away from the utopia that is exoSAMSI to a meeting in San Francisco with the Moore and Sloan Foundations about data science. On the flight, I worked on writing up what we learned yesterday about wavelets. The main thing we learned—and I have to credit this to Baines (Davis) and Dawson (CfA)—is that even in the dumbest wavelet basis, the wavelet amplitudes for a detrended G-type star lightcurve look very close to indpendent, Gaussian draws. This fully justifies our wavelet-based simple Gaussian Process approach. What do I mean by detrended? Not saying, just yet, because we don't yet really know what works best. The Moore and Sloan meeting was fun; it involved a lot of brainstorming.


exoSAMSI day seven, injection and recovery

While Baines (Davis) and Dawson (CfA) worked on the statistics of the amplitudes of wavelet-tranform elements (to build intuition for our likelihood function), the rest of the team, including especially Montet (Caltech), Barclay (Ames), and Foreman-Mackey worked on doing approximate one-dimensional (lightcurve-level) injections of Earth-like planets on year-like orbital periods into a set of Solar-type stars chosen for us by Matijevic (Villanova). At time of writing (day isn't over), team is trying to close the loop of running an approximate search ("box least squares" or equivalent) on the injected lightcurves for exoplanet recovery. If we can close this loop, we feel like we might be able to achieve our goals, which include (but are not limited to) finding Earth-like planets around Sun-like stars on year-like orbits. Sound familiar? I worked on missing-data imputation and wavelet transforms.


exoSAMSI day six, likelihood function

We had the big battle about the form for a likelihood for exoplanet search and characterization late in the day, with Baines (Davis) leading the discussion. There was a huge disagreement about the realism of the model—as there always is—with some saying we should split stellar and spacecraft contributions to the lightcurve variability, and some wanting to mush them together. The former is the Right Thing To Do (tm) when you are going hierarchical, because it permits you to pool data from multiple stars on the same part of the detector (for the instrument model) and multiple stars of the same type scattered around in time and space (for the stellar model). That said, the mush-together option might be the right thing to do for an effective model where you want or need to treat every lightcurve separately. We chose the mush, and Baines and the gathered worked out some ideas about what kinds of models for stellar variability and spacecraft artifacts might work.

In the rest of the day, the team worked on selection of target stars, injection of false signals, running standard filter-based and fitting-based de-trenders and co-trenders, investigation of wavelet transforms, and statistical properties of stellar variations.


exoSAMSI day five, wavelets

Wavelets. Yes, you thought they were cool in the 1980s. Apparently they are back. Almost every statistician at this workshop has mentioned them to me, and after two days of hard work, I am convinced: If you wavelet transform the space in which you do your inference, you can lay down an independent (diagonal) Gaussian Process but still produce exceedingly nontrivial covariance functions. We are looking at whether we can model the Kepler data this way. The idea is to use the wavelets to make it possible to marginalize out all possible stellar variability consistent with the data.

Baines (Davis) has been instrumental in my understanding, after Carter (CfA) and Dawson (CfA) taught me the basics by battling it out about implementations of the wavelet transform. I am still working on the linear algebra of all this, but I think I am going to find that one issue is that—in order for the method to be fast and scalable—the wavelet part of the problem has to treat the data as uniformly sampled and homoskedastic. Those assumptions are always wrong (do I repeat myself?) but I think I decided today that we can just pretend they are true for the purposes of our rotation into the wavelet basis. The rest of the working group kicked ass while I had these deep, deep thoughts.


exoSAMSI day four, centroids

Barclay (Ames) blew us away this morning by pointing out the incredibly useful fact that the Kepler sensitivity to a star is a very strong function of the centroid position of the star in the pixel space, and that the data from Kepler include this information (both measured and predicted). Awesome! I think this is going to have a big impact on our activity. Indeed, Montet (Caltech) pointed out that this alone is a publishable idea. Foreman-Mackey and Montet went off to build a model of this effect. As my loyal reader might guess, I have been pushing my working group hard with the idea that the output of this workshop should be papers, not just ideas or projects or folk knowledge.


exoSAMSI day three, likelihood function, wavelets

We generated a scope for the detrending working group at SAMSI, which is to concentrate on small planets on long periods around quiet G-type stars. That is, the signals that are most likely to make us famous. We put deliverables for each Friday for the next three Fridays. We began the discussion about what kinds of instrumental and astrophysical effects we would need to consider and also how to test methods that we want to consider.

In the afternoon, Baines (Davis) and a bunch of other statisticians convinced me that wavelets are a great tool for modeling non-parametric, stochastic processes. I was surprised at first but then came around when I realized that because they are always linearly related to the time-domain data, they preserve Gaussianity; in many cases of interest you can design a wavelet basis in which the noise—or some component of the noise—is independent (diagonal in the covariance matrix).

Important piece of context: We agreed that there is a deep sense in which detrending is just writing down a better likelihood function. It would please me to no end if one of the outputs of this project was a new likelihood function for the Kepler mission.


exoSAMSI day two

Talks at the Kepler meeting today continued, with the end of the day being the start of developing working groups to tackle scientific problems in collaboration.

Talk highlights included Berger (Duke) going through some very simple arguments about how Bayesian inference protects you from many statistical and model-selection pitfalls. He showed a real example where Bayesian inference gets a different answer from frequentist inference, in a scientific problem on which he had actually worked; it is related to the Jeffreys-Lindley paradox, which is about distinguishing models with zero effect from models with some (unknown) non-zero effect. Clyde (Duke) and Cisewski (CMU) talked (in different contexts) about importance sampling (rather than MCMC). Adaptive importance sampling is obviously a very powerful technique we ought to be thinking about. At lunch, Berger backed this up, strongly endorsing importance sampling with adaptive proposal distributions in contexts where you want the Bayesian Evidence. (Didn't I say yesterday that you never want that?) Cisewski also talked about Approximate Bayesian Computation, which she described as "a method of last resort". But it looks like it might be valuable for many astrophysics problems: It replaces the likelihood with a thresholded "distance" between statistics of the data and generated data samples. In some cases, it asymptotically approaches true inference. Her seminar also effectively answered a question posed yesterday by Ragozzine (Florida) about how to combine inferences based on statistics computed on the data! ABC FTW?

Another highlight was Christiansen (Ames) showing that she can insert transits into the pixel-level data, run the pipelines, and recover them. She is doing this to test completeness and selection and noise models. But it is just as interesting from an inference perspective; the fact that she can insert realistic transits means (in principle) that we could be fitting in the pixel domain. I would be pleased to be doing that. Also, she showed evidence that de-trending (fitting out instrumental and stellar variability effects) perhaps ought to be done in the pixel space, not the integrated lightcurve space?

In the afternoon, we crowd-sourced projects for further study and split-ish into working groups. Despite the fact that I am dying to infer the population parameters of long-period planets from data coming only from single (isolated) transits, I decided to join the de-trending and noise model group. We want to improve Kepler's sensitivity by understanding better the device trends and noise.


exoSAMSI day one

Today was the first day of the SAMSI workshop on statistics and computation for Kepler data, organized by Ford (Florida) and a few others (including, absolutely minimally, me). The workshop is a long one, because the goal is to get astronomers and statisticians together to solve some real problems. In the first day, far more happened than I can possibly blog responsibly about; here are some personal highlights:

The day started with Barclay (Ames) explaining the mission and its data, with Ford enforcing strict communicate-with-the-statisticians speaking style! Payne (CfA) talked a bit about long-period candidates, where in a single Kepler source only one transit is seen. What can we conclude? A lot, I think; both the depth and duration bring important information; that's something I want to keep thinking about in the next few weeks. Morton (Caltech) talked by phone about false positives and (very detailed) efforts to remove them by probabilistic modeling. He said that the right thing to do is compute the Bayes Evidence, something I agree with in principle, but disagree with in practice. I disagree because such computations are very prior-dependent and very hard; if you don't know your priors extremely well, then you can get substantially wrong answers after a very large amount of work. Besides, decisions fall outside the realm of Bayes, as my loyal reader has heard me say more than once. Don't get me started about long-term future discounted free cash flow!

Carter (CfA) blew us away with the awesome physics of three-body (and more) transiting systems. There are huge transit timing variations, exoplanets around binary systems, and more, all of which is insane to model and hard to find (because they generally aren't strictly periodic; sometimes they aren't even close). He is living the dream: There is an n-body integration inside his likelihood function. That's something that we tried at CampHogg in the past, and one of the things that got us thinking about methods for doing MCMC that minimize likelihood calls per independent sample.

Loredo (Cornell), Baines (Davis), and I talked about hierarchical models. Loredo was one of the first (maybe the first) to use them in astronomy, and remains one of the clearest thinkers in the business about measuring stuff in astrophysics. I am glad he is here. Baines showed that you can enormously speed up inference in hierarchical models if you transform variables. The challenging thing is that you need to change variables across layers in the model. Then he showed how to benefit maximally from multiple parameterizations with a modified sampling method. This is probably extremely relevant to our (future) projects.



Krikamol Muandet (MPI-IS) and I discussed briefly at the end of the day his work on support measure machines, which are a generalization of support vector machines. He has generalized SVM (which is a supervised classifier for data points) to take as input probability distribution functions rather than points. This is an impressive accomplishment, and the resulting machinery has a beautiful internal structure, in which distances between data points look like chi-squared differences between probability clouds. But he has also applied this machinery to the training and test sets used in XDQSO to compete with Bovy et al. What he finds is that SMM beats SVM (because it uses error information) but that SMM loses to XD (because it doesn't know as much about the causal, generative processes as XD does). We discussed a bit the general question of how domain knowledge ought to be included in machine-learning algorithms, and I started to think that maybe this might be our superpower here at CampHogg. More on that after a few weeks of Kepler goodness at SAMSI in NC.



supervised vs unsupervised

At lunch (we found a Thai place where the food was actually hot enough to satisfy Muandet) CampHogg discussed with Marshall the differences between (and meanings of) supervised and unsupervised methods in machine learning. In supervised learning, you want (usually) to obtain a probability for a label given the data, by looking at many labeled-data instances (training data). This can proceed without producing anything like a generative model for the data. I argued that—while supervised methods have roles to play in astronomy—no important high-level problem is really a supervised problem, for various reasons: One is that it is rare that science (at a high level) involves label transfer or classification. Classification is usually done in service to something more important. Another is that the main goal is to make discoveries, not classify instances into known already-discovered classes. Another is that science really is about creating causal, generative models of the data, while supervised learning explicitly deprecates that (in many cases, though not all, I should say; XDQSO and extreme deconvolution is a counter-example).

Very late in the day I wrote in what I hope to be the Astronomical Journal write-up of the image-combination and rank-statistics project we did for NIPS (and finished so close to the deadline last week).





integer multiples of periods

One crazy thing about period fitting (which we are implementing because we are searching for exoplanets in the Kepler data) is that for every period you discover, there is a chance you are only discovering it because it is twice, or three times, or five times the fundamental period. Today, Foreman-Mackey and I worked on ruling out integer fractions of the best-fit period. We came up with some nice heuristics; there is a chance we can find new exoplanets in Kepler without too many false positives. More testing is needed! On a walk in the park with Marshall (visiting this week) and Fadely, we realized that what we are doing isn't even frequentist. Following along discussions we started on twitter, we should entitle our paper on the subject "Incorrect" or "Approximately Frequentist" exoplanet discovery!