Phil Marshall emailed me, asking about the original citation / derivation of the rule of thumb that a source detected in an image at some signal-to-noise ratio [s/n] can be centroided to an accuracy of about the FWHM divided by [s/n]. It was funny he asked, because we had discussed the very same issue only days earlier in responding to the referee for the faint-motion paper. Rix found King (1983), which probably is the first paper to discuss this (interested if anyone out there knows a more recent reference). Nowadays, the standard answer is the Cramer-Rao bound
(Robert Lupton said this in response to a query from me), but that isn't quite the answer most people are looking for.
2008-10-22
rule of thumb
2008-10-21
USNO-B and GALEX, supervised
I got stranded in Nantucket by high winds (cancelled ferries). This cost me Monday, and I spent parts of today making up for it. My research time was spent with Schiminovich, talking about what we should do with the SDSS and GALEX, and what we will do in the very short term. The very short term project is to use SDSS and GALEX to learn
what quasars look like and then find them all-sky with USNO-B1.0 and GALEX. Same with white dwarfs. This is a nice project in supervised methods for automated classification, something I was railing against in Ringberg.
2008-10-18
AAVSO
Spent the afternoon at the AAVSO annual meeting in Nantucket (yes, my travel schedule is not sane). My word are the AAVSO observers impressive! Every talk showed ridiculous light curves with incredible sampling and huge signal-to-noise, and many of the photometry sources are people working visually (with their eyes, no detectors). The data are consistent from observer to observer and highly scientifically productive. Of course, many of the AAVSO members use CCDs too, and these tend to be among the best calibrated and understood among hobbyist setups. Naturally, that is why I am here.
2008-10-17
minimum message length
On the plane home from Germany, I worked on various writing projects, including the transparency paper and my Class2008 proceedings. I tried to write down what minimum message length could say about the Milky-Way-reconstruction problem from astrometric measurements of stellar motions and parallaxes. I have a strong intuition that there is a correct—or at least very useful—approach that could be inspired by or directly derived in the context of the idea that the most probable (posterior-probable) model is the one that provides the best (lossless) compression of the data given the coding scheme suggested by your priors. If I could write it down, it might help with the upcoming GAIA data.
2008-10-16
class2008, day three
On the third day of Classification and Discovery, I chaired a session on the time domain; I was blown away by the data from the CoRoT experiment. But I was even more fired up by Anthony Brown's description of the problem of inferring Galactic structure from GAIA data. This problem has so many awesome aspects, including a good argument for generating
the data with the model (think Lutz-Kelker problems with parallaxes), to a huge issue with priors (because the mission measures positions and velocities but not accelerations, and accelerations are what the Galaxy produces). I will say more about the latter when I get it sorted out in my head. GAIA really will provide the best inference problem ever encountered in astrophysics.
2008-10-15
class2008, day two
This morning concentrated on understanding galaxies in large surveys. Among a set of interesting talks about galaxy classification, Boris Haeussler gave a nice talk in which he put the standard 2-d galaxy fitting codes through their paces, and found some very interesting things, including underestimated errors—even when he puts in fake data for which the fitting codes are not making approximations! Vivienne Wild spoke about a robust PCA and its use in understanding rare populations such as post-starbursts and their role in galaxy continuity. Two of my favorite topics in one talk! The PCA adjustment is very smart although somewhat ad-hoc (not described in terms of probabilistic inference). The post-starburst work is even better; it confirms our results that suggest that post-starbursts are key in the evolution of stellar mass from the blue to red sequences. Many other good contributions too numerous to mention, with a lot of people working on optimal extraction of information from spectra; very encouraging for the future of spectroscopy.
2008-10-14
class2008, day one
The afternoon of the first day of Classification and Discovery concentrated on classification methods, almost all supervised (learn with training set, run on larger data). I am largely against these methods, in part because very few of them make good use of the individual noise estimates, and in part because your training data are never the same—in important respects—as your real data. However, a nice discussion ensued, led in large part by Alexander Gray (Georgia Tech); in this I argued for generative models for classification, but of course these are only possible when you have a good model of both the Universe and your hardware!
2008-10-10
more writing
Spent my research time today cleaning up my class2008 proceedings, which is now a full-on polemic about massive data analysis. In the process, I learned something about minimum message length in Bayesian model selection; we have been using this but I didn't know how rich the subject is (though I don't like the persistent comment that it encodes Occam's razor
—another good subject for a polemic). On the airplane to Germany I will have to convert all this into a talk.
2008-10-09
wrote like the wind
In a miraculous couple of hours, I cranked out the remainder of our class2008 proceedings—the necessity of automating calibration, and methodologies for automated discovery in the context of a comprehensive generative model—to make a zeroth draft
. In writing this, I realized that we have actually demonstrated most of the key concepts in this automated discovery area in our faint-source proper-motion paper.
Lang has promised me not just criticism, but a direct re-write of parts, within 24 hours.
2008-10-07
writing
In the small amount of research time I got today, I wrote my Class2008 proceedings as rapidly as possible.
2008-10-06
catalogs as image models
I worked more on my position
on catalogs, with some help from Lang. Here are some key ideas:
- Catalogs originated as a way for astronomers to communicate information about images. For example, Abell spent thousands of hours poring over images of the sky; his catalog communicated information he found in those images, so that other workers would not have to repeat the effort. This was at a time that you couldn't just
send them the data and the code
. - Why did the SDSS produce a catalog and didn't just release the images? Because people want to search for sources and measure the fluxes of those sources, and people do this in standard ways; the SDSS made it easier for them by pre-computing all these fluxes and making them searchable. But the SDSS could have produced a piece of fast code and made it easy to run that code on the data instead; that would have been no worse (though harder to implement at the present day).
- One of the reasons people use the SDSS catalogs is not just that they are easy to use, but that they contain all of the Collaboration's knowledge about the data, encoded as proper data analysis procedures. But here it would have been more useful to produce code that knows about these things than a dataset that knows about these things, because the code would be readable (self-documenting), re-usable, and modifiable. Code passes on knowledge, whereas a catalog freezes it.
- The catalogs are ultimately frequentist, in that hard decisions (about, say, deblending) are made based on arithmetic operations on the data, and then the down-stream data analysis goes according to those decisions, even when the real situation is that there is uncertainty. If, instead of a fixed catalog there was a piece of code that takes any catalog and returns the likelihood of that catalog given the imaging, we could analyze those decisions probabilistically and do real inference.
And other Important Things like that.
2008-10-05
catalogs polemic
I started writing my contribution to Classification and Discovery in Large Astronomical Surveys; I am writing about a generative model of every astronomical image ever taken. But right now the part I am most interested in is the part about catalogs being—explicitly—bayesian models of the imaging on which they are based. If the community adopted this point of view, it would have a number of advantages, in the documentation, usability, communication, interoperability, construction, and analysis of astronomical catalogs. I am trying to make this argument very clear for the proceedings.
2008-10-03
lucky supernova, classification of algorithms
Alicia Soderberg (CfA) gave the astro seminar today, on a supernova she discovered by a soft x-ray flash apparently immediately at shock break-out, in other words at the beginning of the explosion, long before the optical came to maximum light. This permitted the study of the supernova from beginning to end. Unfortunately, her discovery involved an incredible amount of luck and we will have to wait for the next generation of x-ray experiments to discover these routinely. In answer to an off-topic question from me, she said that to her knowledge, there is no pre-cursor activity that precedes break-out. I asked because this would be an interesting effect to look for in historical data sets.
In the evening I finished writing up my short document that describes my classification of standard data-analysis algorithms.
2008-10-02
insane theories, super-k-means
I can't say I did much research today, but while I failed to do research, Bovy (who is also attending a scientific meeting) looked at contemporary models that violate transparency to fix
the supernovae Ia results in an Einstein—de Sitter Universe. These models are somewhat crazy, because they end up building epicycles to fix a problem that isn't really a problem, but in principle we will rule them all out with BOSS.
In my sliver of research time (and with Roweis's help), I figured out that PCA, k-means, mixture-of-gaussians EM, the analysis we did in our insane local standard of rest paper, and taking a weighted mean
are all different limits of one uber-problem that consists of fitting a distribution function to data with (possibly) finite individual-data-point error distributions. I am trying to write something up about this.
2008-10-01
Tolman and Etherington
In working on the More paper, I found myself looking through cosmography literature from 1929 through 1933. There is a series of papers by Tolman, in which he works out the Tolman test
for the expansion of the Universe, which I think of as being a test of transparency and Lorentz invariance. Tolman worked out the test in the context of one world model (de Sitter's); his interest was in understanding the possible physics underlying the steady-state model; Etherington generalized it to a wider range of world models in 1933. After Etherington's generalization, the community should have realized that the test doesn't really test expansion per se, but it does test relativity and electromagnetism in that context.