inferring evolution, hidden Markov model

Sriram Sankararaman (Harvard) gave a great Computer Science Colloquium today about inferring the evolutionary tree (well, it isn't really a tree) from genetic information, particularly as regards humans and neandertals. He is able to show, using the statistics of DNA variability, that humans and neandertals had intermixing long after they separated (both geographically and as species). He was also able to show that there is statistical evidence for the sterility (infertility) of males after speciation. Awesome stuff, and very related to cosmology in many ways: The models are of two-point statistics of the DNA sequences, not the sequences themselves, and the probabilistic modeling methods (approximate Gaussian likelihood functions and MCMC) are very similar indeed.

Prior to that, in group meeting, McFee and Huppenkothen jointly proposed a plan for clustering black hole timing data using a hidden Markov model: The idea is that the data are generated by a probability distribution that is set by a state, and there are finite probabilities of transitioning from state to state at each time step. This is a well-understood idea in machine learning, but also very close to how we think about the generation of the timing data, fundamentally. Great plan! Huppenkothen's first order of business is to run k-means in a feature space (for initialization of the HMM).


  1. "data are generated by a probability distribution"

    One of my life goals is to get you to stop saying this.

  2. is it the "by" or the "probability" that you dislike? Or the "generated"?

  3. I remember thinking this paper made a step in the right direction for HMMs in astronomy: http://adsabs.harvard.edu/abs/2014ApJ...791...24M