2012-12-05

emcee vs Gibbs

Tang, Marshall, and I have a problem for which we think Gibbs sampling will beat the ensemble sampler emcee (by Foreman-Mackey). The issue is that you can't really Gibbs sample with emcee: In Gibbs sampling you hold a bunch of parameters fixed while you independently sample the others, and swap back and forth (or cycle through groups of easily sampled subsets of parameters). In the ensemble (of ensemble sampling) if you have different values for parameters in the non-sampling subset, then the parameters in the sampling subset are drawn from different posterior PDFs. That breaks the whole ensemble concept. So we are confused. Tang's job is to implement simple Metropolis–Hastings Gibbs sampling and see if it beats emcee. I am afraid it will as the number of parameters gets large. This is just more annoying evidence that there is no one-size-fits-all or kitchen-sink sampler that solves all problems best. If I worked in stats I think I would try to make that thing.

6 comments:

  1. David,

    I'm not exactly sure about how it would work with the stretch moves, but people often implement differential evolution using difference vectors that incorporate only a subset of the parameters. It's a little easier to think about for D.E. because there the moves really are symmetric, while the stretch moves have the Z^(N-1) factor because they are not.

    In the extreme case, one could do D.E. moves that propose jumps that are single-parameter updates computed from differences between single coordinates of two randomly-chosen members of the ensemble.

    Will

    ReplyDelete
  2. Putting on my pedant hat for a moment...

    The original Metropolis et al. paper (1953) updated one particle (which was only two of their 2N variables) at a time. From the beginning, Metropolis(–Hastings) methods have included the option of cycling through the variables.

    To me, Gibbs sampling is the successive resampling of conditional distributions p(x_i|x_\i), for some subsets i (often individual scalar parameters/variables). Gibbs sampling (when tractable) can work better than simple Metropolis proposals applied to each variable in turn, especially with heavy-tailed distributions (e.g. Figure 2.1c of my PhD thesis).

    The term Metropolis-within-Gibbs does exist in the literature...I just don't like it. See paragraph 5 of http://www.jstor.org/stable/2346066

    As to your real issue. Yes, it's hard/interesting. Contact me offline if you have representative prior and likelihood functions for me to play with.

    ReplyDelete
  3. Iain: Thanks for the correction! Do you mean that you reserve the term "Gibbs" for when you have analytic or fast or tractable non-Metropolis sampling methods for individual parameters (conditioned on everything else)?

    ReplyDelete
  4. I've indeed tried to use emcee to sample from a hierarchical model and I've some preliminary results showing that emcee finds it really difficult to sample from it, while a Metropolis-within-Gibbs (MwG) method converges quite fast. I have to say, though, that I tried my own implementation of full Metropolis and other general purpose routines having the same issues than with emcee. The "one at a time" property of MwG really helps a lot in convergence. The reason, I think, is that the model (used to explain oscillations in solar coronal loops) consists of several sets of parameters that are used to explain some observations, which are fully independent. So, in principle, these sets of model parameters are independent and the only connection between these parameters occur at the prior.

    Andres

    ReplyDelete
  5. "If I worked in stats I think I would try to make that thing."

    You DO work in stats, among other things. Discreteness of fields is an illusion.

    ReplyDelete
  6. Hogg: yes, Gibbs sampling is about sampling exactly from conditional distributions p(x_i|x_\i). The new setting of x_i, doesn't depend on the old. The usual methods are by inverting the CDF, or (adaptive) rejection sampling. Metropolis or slice sampling leaves the conditional distribution invariant, but the distribution of the new x_i depends on the old setting.

    Something else I meant to say here is that MCMC methods nicely compose. Alternating emcee and component-wise Metropolis updates might work better than either individually. If one of the updates is useless, the worst that happens is that you waste 50% of your CPU time.

    ReplyDelete