I added text to the fitting-a-line document about foreground–background mixture modeling for robust fitting. This is a big favorite of mine because it is fast, simple, and very close to the right thing to do. It also resolves some of my issues of a few days ago.
Phil Marshall showed up in Heidelberg today; he and Kasper Schmidt (MPIA) and I had lunch and discussed (among other things) the confirmation and rejection of hypotheses, which rarely—in the real world—goes according to either the Bayesian or the frequentist methodology. For example, the WMAP-1 paper was taken to be an awesome confirmation of the standard CDM model (and it was!) even though it had a bad chi-squared value (so frequentists were wrong to be excited) and it wasn't being competed against any serious alternative model (so Bayesians have nothing to say at all). I think this all comes down to message length, but I certainly haven't worked it all out yet.
After my binary programming failures of yesterday, Bovy implemented a Gibbs sampler and solved my problem overnight. But then in discussing the issues, we concluded—as we always do—that the only principled way to deal with bad data or outliers is to model them. This means performing density estimation or modeling of the distribution function for the outliers simultaneously with performing the fit on the inliers, so all the data can be generatively modeled as a two-component mixture. One component is the inliers, with model parameters fitting those inliers. The other is the outliers, with model parameters describing the distribution of outliers. I think we may have to switch to that in the robust fitting section of the now-infamous
fitting a straight line document.
I worked into the wee hours on robust fitting with arbitrary assignments of the binary classification "good" or "bad" to each data point. Robust fitting methods like this are beautiful, but exponential in the number of data points (this is binary programming, discussed on this blog previously in a different context). I attempted to make use of the magic of sampling but with little success.
I spent most of the day reading carefully and commenting on Koposov's paper on modeling the cold GD-1 stream in a (flexible) Milky Way potential. The paper puts some strong constraints (fully marginalized, of course) on Galaxy and gravitational-potential parameters; for some parameters these are the only constraints at this length scale.
I also attended a nice short talk about transit timing by Monika Lendl (MPIA). It turns out that exoplanets that transit their parent stars produce timing sequences that are extremely sensitive to resonant perturbers. A Jupiter-mass transiting exoplanet can produce observable timing residuals when perturbed by even an Earth-mass (or smaller) perturber, if that perturber is interacting resonantly. This is potentially extremely sensitive, but it has the great problem that for a given timing residual pattern, while it may be easy to detect the perturber, it looks very difficult to understand the perturber uniquely.
I spent the day commenting on papers by various co-authors. Over coffee, Eric Bell got onto the subject of orthogonal charge transfer, which is one of the methods by which PanSTARRS is going to maintain good image quality. In OCT, the idea is to monitor bright stars on short timescales, and shift the charge on the CCD east, west, north, or south as the point-spread function shifts, to keep the light falling in the most compact PSF possible. This technology has been developed and tested well at Hawaii. I got interested, because it is a perfect problem (if and how to move the charge, when) to cast into the form of Bayesian decision theory, one of my current pet methodologies.
I revived the project with Schiminovich to measure the extragalactic intensity falling on the Milky Way from quasars as a function of redshift. As expected, as you approach Lyman-alpha, it comes from z<3 quasars, and from a broad redshift range within that.
I spent my high-quality time today reading very carefully the first science chapter of Ronin Wu's thesis. Her thesis will be on the star-formation and radiative properties of galaxies, and the first chapter is on the mid-infrared properties of extremely low-luminosity galaxies from the SDSS.
Vivi Tsalmantza and I attempted to re-start our project looking for double-redshifts in the SDSS spectroscopy. We discussed the issues of wavelength coverage—which, with empirical galaxy templates, is a function of the two redshifts—and interpolation, which ought to be done by cubic spline (with SDSS spectra, at least). I started writing up the issue of hypothesis testing in this situation of variable wavelength coverage.
I spent the day reading and commenting on Dustin Lang's dissertation chapter on the kd-tree. Lang and Mierle have created for the Astrometry.net system the fastest kd-tree ever. The speed-up comes from tricks you can play with static data structures. That is, if you have a data set that changes rarely (or never) and you need to do fast lookup in the data set far more frequently than you need to edit the data set, you can build a static block in memory that contains the data, structured as a hierarchical binary tree, but without any pointer dereferencing or complicated (and therefore expensive and slow) structures. The thesis chapter on this is great, and it will make a great paper. The code is available under the GPLv2 license by request to Lang.
On the plane I wrote in—and at MPIA I discussed (with Chien Peng of Victoria and MPIA)—my document on straight-line fitting, which I hope to finish up in the next week or two. I must finish it now if I am going to get onto the next topic and be ready for the IMPRS Summer School.
With Peng and Eric Bell (MPIA, Michigan) I discussed the possibility that the initial mass function of stars (the frequency distribution for stars of different masses at birth) might vary. It is a bit of an exaggeration to say that Bell believes that such a discovery (variation) would mean the end of galaxy astrophysics, while Peng and I believe that it is almost inevitable that it must vary. Bell noted that it would not be the end of astrophysics if we could measure and model the variations as a function of controlling parameters. That would be good, but it is possible that it depends on things nearly impossible to measure.
Bovy and I discussed two of his dynamical inference problems today. The first is with Reid's masers, where Bovy finds, when he marginalizes over even the most basic unknowns, that the masers are not highly informative about the potential (or rotation curve) of the Milky Way. That is not to say that we don't have much knowledge about the potential, but just that the masers aren't the source of it. We differ from Reid et al in our conclusions because we aren't as confident about the input assumptions.
The second is the Solar System demo paper. Tremaine proposed some methods that treat the eccentricity distribution better and don't depend on the energy distribution. We are not sure that there is any way to write the problem down that is insensitive to the energy distribution that doesn't make other assumptions we are unhappy about. This all gets at the question of how
generative your generative model needs to be. Does it really need to generate your observations? And if so, at what level? I think ideally at whatever level your observational uncertainties are simple (understood, uncorrelated, close to Gaussian, and so on).
Research time is getting scarce this week as I prepare for travel to Heidelberg for my annual summer spa there. I have spent what time I can reading and commenting on Dustin Lang's monumental PhD thesis (advisor: Sam Roweis). As my loyal reader knows, Lang is the author of the Astrometry.net automated calibration system, which is composed of clever geometric hashing, exceedingly fast kd-tree-based index lookup, Bayesian decision theory, and lots of web, script, and WCS candy. The thesis therefore contains both nice theory and beautiful engineering. It is a pleasure seeing it come together.
[According to Blogger (tm), this is my thousandth post to this research blog!]
I spent a few research-minutes today looking at the code of Aukosh Jagannath (NYU), who is working on the kinematics of cold streams in a toy galaxy, subject to perturbations by compact substructure, which leave behind coherent traces, in principle. He is making some fun movies; now can we actually detect substructure using streams?
On the plane home I finished working out the Bayesian framework for inferring the potential of the Milky way and the velocity and mass distributions produced by the launch mechanism for hypervelocity stars. Now, to the SDSS data, where there are a few known. I am not sure there are enough, and with only photometric parallaxes, the observations may not be highly constraining on the potential (which is what I care about). I am *sure* there is enough information already to interestingly constrain the velocity distribution produced by the launch mechanism.
I spent the day brushing up and giving my seminar on dynamical inference at OCIW. During it I was reminded what a good idea Tremaine's "two-star streams" is. We should do that. Any volunteers?
After a morning performing my Spitzer oversight duties, I worked on a simple, highly approximate, but maybe useful model of the fastest stars in the Galaxy, continuing where I left off on hyper-velocity stars.
Jay Anderson (STScI) gave the colloquium at OCIW today. He showed the incredible precision he can get in astrometry, proper motions, and photometry with HST when he models the image pixels directly, as we advocate. His color–magnitude diagrams for globular clusters are simply incredible. As is his astrometric precision (few percent of a pixel) for point sources, despite the fact that the images are only barely sampled. The science is good and the engineering is sweet.
Phil Marshall and I spent the afternoon in (and near) Marshall's UCSB office, trying to extract all the stars we can from the enormous HST dataset he put together for our lens searching. During a break, I met Brendon Brewer (UCSB), who is a real Bayesian.