I wrote scripts today to scrape images off the web and submit them to our nascent beta Astrometry.net system. We also set our short-term goals in a phone-con with Lang, Chen, and Lalimarmo.
Foreman-Mackey and I decided at lunch that the density of his calibration grid for SDSS Stripe 82 and the radius out to which he selects stars around each grid point are both complexity parameters in our calibration model. The first sets the number of free parameters, and the second sets the smoothness. We worked out a semi-practical way to search this space, applying cross-validation to choose the complexity parameters. It is an enormous amount of data; F-M has re-photometered every point source in every run, engineering-grade or science-grade, whether the source was detected or not. That's a good fraction of a billion measurements.
In Jagannath and my project to fit dynamical models to streams in phase space, we have a simple problem, which is to take a diagonal covariance tensor for the noise in the observables (distances, angles, proper motions) and transform it, by the best first-order approximation, to a non-diagonal covariance tensor for the noise in the phase-space coordinates. This transformation is a bad idea because distance uncertainties plus the non-linear transformation take Gaussian noise in the observables to non-Gaussian noise in the phase-space coordinates. However, it is a very good idea because if we do this transformation and live with the inaccuracy it brings (it brings inaccuracy because we are treating the noise in phase space as Gaussian), our code becomes very fast! We are checking our math now (for the Nth time) and coding it up.
I spent most of the day at scicoder, Demetri Muna's workshop for astronomers who code. I spoke about the practice of building academic software systems—pair coding, functional testing, and using packages vs writing your own—and then went to lunch with a small group. On the
writing your own point, I said that it is a good idea to both write your own and use pre-built packages, because you learn so much by writing your own, and you get so much performance out of (well built) industrial-strength code (though you can use your own if performance isn't a problem). Partly my remarks are motivated by the point that academic programming is about learning, not just shipping. In the afternoon, Muna taught us about using R for browsing data and making plots. Tsalmantza and I wrote all of our heteroscedastic matrix factorization stuff in R, so I was already a believer, though Python is my one true love.
Foreman-Mackey showed me a beautiful periodogram (well, actually a generalization of the periodogram to more general periodic functions) for one of the SDSS Stripe 82 RR Lyrae stars. We are so close to being able to start a proper search for new stars. Lang and I worked on tasking Lalimarmo and Chen (our GSOC team) on critical tasks for taking Astrometry.net to beta on the web. Part of the issue is keeping the development cycle non-onerous but data-preserving.
Lang and I, at the request of Zeljko Ivezic (Washington) worked on carving out from the Tractor just the minimal code required to grab a single SDSS field, grab the relevant catalog entries and calibration data, and synthesize the image. This permits comparison of SDSS images with the implicit catalog model.
I spent a good chunk of the day figuring out how detection of isolated sources in astronomical imaging relates to the Bayesian evidence, and then decision theory. I am nearly there. The nice thing is that if you see it as a subject in decision theory, you can infer things about investigators' utility by looking at the likelihood cut at which they cut off their catalogs. In the short term, it is teaching me something about measure theory in practice.
Lang and I are big on pair coding, where we are either co-spatial or else have an audio channel open (usually Skype) and work together in an editor (usually emacs, and using unix screen we can both
be there when we are not co-spatial). Today we had the audio channel open, but Lang worked on getting our mixture-of-Gaussians PSF code working on the output of the SDSS K-L PSF model while I filed a final report for one of my grants. Not exactly pair-coding, but it is more fun than grinding out code and reports alone!
The most surprising thing I learned today was from Benitez about the JPAS survey, a Spain–Brazil collaboration to do 56-band imaging to get large-scale structure. It is an ambitious project, and designed for maximum efficiency. It is also funded; I will be interested to see it proceed.
Talks by Richards, Bernstein, and Willman all set me up nicely; they all said that source classification and characterization at low signal-to-noise is extremely important scientifically, very difficult, and essentially probabilistic. They all showed incredible things we could do with respect to quasar science, weak lensing, and Local Group discovery if we can classify things properly at the faint end. After this crew, Gray, Lupton, and I spoke about computational methods, with Gray concentrating on classes of problems and the algorithms to make them fast and effective, me concentrating on producing probabilistic outputs (that is, no more hard, final, static catalogs), and Lupton talking about how it worked in SDSS and how it could work better in LSST and HSC. Lupton's talk closed the meeting, and it was a pleasure!
One constant note throughout the meeting, and especially today, was that a lot of science and discovery was enabled by the exquisite photometric calibration of the SDSS. I am proud of my (admittedly very small) contributions to that effort and the enormous amount they have paid off in so many areas.
There was a lot of transient discussion today, with talks about PTF and LSST. In Quinby's PTF talk, he showed Nick Konidaris's SED Machine which looks like a prototype for fast transient response. On Monday Strauss noted that LSST will issue 100,000 alerts per night, so a lot of thinking has to go on about how to deal with that. Transient argumentation continued into the semi-organized post-talk discussion over beer. Walkowicz's talk had some nice meta-discussion about the different choices you make for rare events vs common events, and within that, for known types of events vs unknown types.
For me the highlight talk today was by Dave Monet, talking about the future of astrometry. He very rightly pointed out that if Gaia and LSST work as planned, it is not just that they will dominate all of astrometry, but more importantly that they will do just about as well as it is possible to do, ever. That is, you could only beat them with insanely expensive things like new kinds of launch vehicles. So the key point for an astrometrist is how to be doing things that make sense in this context. I agree! Also, in response to a comment by me, he endorsed my (strongly held) position that the point of astrometry is not to produce long lists of numbers (positions, parallaxes, and proper motions); the point of astrometry is to answer high-level science questions, and with Gaia-scale data no-one knows really how to do that at the limit of what's in-principle possible. One of the most interesting things Monet said is that there is no real point in working on the global astrometric solution right now; there are no critical science questions that depend on it, and Gaia will break you.
In a day in which Lang and I spent an inordinate amount of time formatting our Comet 17P/Holmes paper for error-free submission to the AJ (this was hilarious but not exactly enlightening scientifically), one bright spot was having lunch with Rob Fergus, his student Li Wan, and my student Dan Foreman-Mackey to discuss the first steps towards our probabilistic theory of everything. The short-term goal is to create a quantitative model of every pixel of digital or digitized astronomical imaging ever taken from any facility anywhere at any time in any bandpass. Once that's done, we will think about bigger projects. But seriously, we came up with some (still vague) first steps, one of which is for the computer scientists to read some astronomy papers, and for the astronomers to read some computer-science papers.
Today was the first day of Very Wide Field Surveys in the Light of Astro2010 in Baltimore. The talks were overwhelmingly extragalactic in emphasis, and there were many about not-yet-taken data. Highlights for me included: Jim Gunn (Princeton) making some of Blanton and my work on galaxies in SDSS his prime motivation for his new projects; Martha Haynes (Cornell) showing examples of
dark galaxies from ALFALFA that contain substantial HI gas but no stars to the limits of very deep optical imaging; Matt Jarvis (Hertfordshire) explaining that LOFAR and similar surveys record the amplitudes and phases from many antennae, but then delete them after map-making, so certain kinds of reanalyses will be impossible, no matter what; Chris Martin (Caltech) showing one of the eclipsing binaries found by Schiminovich, Lang, and me in the GALEX data; Ned Wright (UCLA) showing an incredibly low-temperature brown dwarf (maybe the lowest ever?) from WISE; Steve Warren (Imperial) showing an embargoed (very) high-redshift quasar from UKIDSS with a beautiful spectrum and praising Daniel Mortlock for his excellent target-selection skills; and a great pair of rapid-fire poster sessions in which each poster at the meeting got one viewgraph and exactly sixty seconds of summary time. Many luminaries are present at the meeting, and many old friends too (and some in both categories simultaneously).
This past Spring I revived an old project to build some kind of professionally and statistically useful atlas of galaxies from the SDSS data. I spent some time today in the wilderness planning the stages of the project. When I look at the non-triviality of doing a good job on it, I think that even the infrastructure work required to generate the Atlas will produce several publications. First order of business: Get The Tractor running on very (angularly) large galaxies in the SDSS to measure their properties consistently.
Schiminovich and I met briefly today with his undergraduate researcher Adam Greenberg (Columbia), who is going to look at doing a better job of fitting the eclipses we have discovered in the GALEX time-domain data. His first order of business is to implement a solid-disk-on-solid-disk eclipse model; then look at better models. We discussed how to parameterize and initialize the model so that simple local optimization should work.
Throughout the day, Phil Marshall, Douglas Applegate (Stanford), and I had a long and detailed discussion about Bayesian evidence (or Bayes factors) as compared to cross-validation in model selection. This is an issue that I have been thinking about a lot and these two helped me sharpen up and modify my thinking substantially. Despite being a Bayesian in practice, I don't believe that most uses of the Bayesian evidence in the literature are correct or justified, mainly because the integral depends so much on aspects of the prior which are (in practice) chosen by the investigator not via an introspective or probabilistic analysis of her or his true prior beliefs but rather more-or-less at random. That is, the prior doesn't really represent prior belief. In this, I am closer to Andrew Gelman (Columbia), who sees the prior as a pragmatic (and testable) regularization of the problem; indeed Gelman and I discussed this point in a separate thread this week.
Foreman-Mackey and I spent some time discussing his robust model of photometric calibration for multi-epoch imaging surveys. The basic idea is that every star is drawn from a distribution that is a mixture of variable and non-variable stars, and that every observation of every star is drawn from a distribution that is a mixture of good (inlier) and bad (outlier) measurements. These mixtures permit the model to be most constrained by the most valuable data, but at the expense of many nuisance parameters. It works remarkably well on real data from SDSS Stripe 82, and has great general applicability in the world of time domain astrophysics. Our current goal is to understand how to model variations in calibration parameters in time and angle.
Today we got OpenID working on Astrometry.net's nascent beta-test web site. This means that users can log in using one of their existing logins from some existing site they trust; we don't have to maintain our own user records or authentication system. I love web 2.0!
For days 4 and 5, team (Chen and Lalimarmo) will be in Princeton, working with Dustin Lang. Good luck team!
Today was all about database schema for the beta-test (rather than alpha-test) version of Astrometry.net. The idea is to get web-submitted images to insert into a sensible data management system, and then from there to the calibration system and back to the user, with all parts (submissions, images, results, and so on) stored in normal form. By the end of the day, there was a functioning new web system that does a good chunk of what our existing alpha-test site does but with much better underlying technology.
Astrometry.net's Google Summer of Code interns, Kevin Chen and Carlos Lalimarmo, showed up today to start a one-week code sprint and team meeting, in New York and Princeton. Their first job is to make our web presence look and act a little less 1994, if you know what I'm sayin'. By the end of the day there was a minmal API, but not yet any functional web pages. Because the GSOC is such a competitive program, we have two great coders in Chen and Lalimarmo; their excellence was evident even on day one.
On the plane home from La Palma, I read this paper (PDF) by Gelman and Shalizi on the philosophy behind statistics. They, despite being Bayesians, argue that since we don't literally believe as True (with a capital
T) any of our models (and I argued the same here), the literal probability interpretation of Bayesian reasoning is flawed. They argue that all the important moments of statistics occur at the points of model checking, model investigation, and the choice of which models to compute in the first place. Gelman and Shalizi's paper was a great post-meeting read; now enough philosophy and back to work!
On the last morning of the summer school, Lupton (Princeton) led (and I assisted in) a lab session on image processing, where the task was to patch missing data in an image using an interpolator based on a Gaussian process. During it, I realized (as Lupton had earlier) that an excellent cosmic-ray detector could be based on modeling an image with a Gaussian process, where the covariance of the process was set locally with the correct covariance expected given the point-spread function (and minimal assumptions about the properties of stars and galaxies in the images). I also realized that Lang and my Tractor project (current version number: vaporware) to model astronomical images could also be used as an interpolator for astronomical images (and a cosmic-ray detector; hell it does everything of course). All that said, interpolation is never necessary; missing data might be ugly, but it is only a problem if you are using algorithms you shouldn't be.
In other parts of the day, Tsalmantza and I worked on finishing our empirical spectral modeling method paper, with a goal of finishing next week.
In the morning, Lupton (Princeton) spoke about imaging data, with a lot of time spent on the (enormous, enormous) value of being band-limited, and how you measure sources with maximum-likelihood techniques. He gave big shout-outs to Lang and my preliminary (and unpublished) results from The Tractor: our project to build a model of all the astronomical imaging in the SDSS, improving on and giving a more transparent and modifiable probabilstic basis to the output of the SDSS software. One of the remarkable aspects of this project, which Lupton noted, is that a very simple model of galaxies and stars does a damned good job of explaining the vast majority of SDSS image pixels, so modeling is not only the best thing you can do, it is also close to saturating the information content in the data.
One thing I realized during his session is that imaging survey data analysis could be much more accurate and precise if we built (hierarchically) priors (maybe even physically motivated ones) on the point-spread function. All the regularizations in use are very heuristic and issues are clearly visible. Another realization I had is that, for faint sources found ab initio in some data, the optimization of the likelihood with respect to position ensures that any unmarginalized flux estimate will be an over-estimate. I have thought about this, years ago, but never delivered. Airplane project? Or maybe I should just offer beer to the first of my loyal readers who demonstrates the magnitude of the effect as a function of signal-to-noise and shows that marginalization corrects it.
In the afternoon I spoke about hierarchical modeling, as
demonstrated in our
exoplanet distribution modeling projects and our extreme-deconvolution
projects. Bailer-Jones (MPIA) objected to my use of the word
uninformative to describe certain kinds of priors. I agree; I
was using that because it is the jargon of the day. You are always
injecting information with your priors; if you can go hierarchical,
you inject correct information.
In the morning I wrote a Gaussian processes code to model radial velocity data, just for fun. I am definitely reinventing the wheel, but I am learning a lot. I am using Python classes to cache all the expensive matrix operations; this should make things as fast as they can be without serious engineering.
In the afternoon, Lupton (Princeton) talked about the SDSS and other large surveys. He said that the decisions they made to make the catalog would not all be agreed upon by all users, but they were science-driven, and driven by particular goals. Then, when asked how we could re-make those decisions and re-analyze the data, he essentially said
you can't. But he followed that by saying that he wants LSST to be different, with reanalysis possible through smart APIs or equivalent. This meshes nicely with things Anthony Brown said on day 1.
There were a bunch of talks on classifying variables, all using the Random Forest method. I have to learn more about that. A discussion following these talks got a little bit into the issues of generative modeling vs black-box classifying. I far, far prefer the former, of course, because it advances the science (and does a better job, I hope) while performing the classification.