code documentation as memory

Schiminovich and I met at an undisclosed location early in the morning and spent most of the day working on our eclipsing-white-dwarf paper, using the time-domain information in the GALEX raw data stream. As with so many projects, we are sitting on beautiful results but have no paper. We are resolved to fix that in the next few weeks. One thing I was reminded of is that if you don't work on a project for many months, and then have to write up a method section, you better hope you documented your code!


philosophers are different

I spent a couple hours at NYU Philosophy, attending a seminar by Jim Weatherall (UC Irvine) about the status of geodesic motion in the absence of force (the law of inertia) in General Relativity and also Newtonian Gravity. He gave a nice demonstration that when you view both theories in their geometric forms (natural for GR, lately done for Newtonian), the proofs of Newton's First Law in each case look pretty similar. He is attacking a long-held view (promoted by Einstein himself) that only in GR does inertial motion have a clear explanation, that is, is not a postulate. Apparently Eddington first made this point to Einstein, and Einstein was stoked about it. What Weatherall showed is that it is not really correct: In its geometric form, Newtonian Gravity provides the same proof (with actually slightly fewer assumptions, in part because causal structure is so simple in Newtonian Gravity).

As usual when going on safari in other departments (Math, CS, Biology, and so on), I learn as much about the practices of the other field as I do from the talk itself. Philosophy talks are scheduled for two hours, seminar for an hour, then questions for an hour, with a short break in-between. Questions are handled formally by a moderator. It is an absolutely excellent format that encourages well thought-out questions and serious, detailed answers; maybe we should consider adopting it?



In a strange coincidence (though perhaps not totally unexpected), I am teaching advanced electromagnetism to a few of our seniors (fourth-year undergraduates) and at the same time, after a long session staring at images from the Project 1640 coronograph, Fergus and I decided that we need to at least discuss and explore the possibility that we might model the electromagnetic fields inside the instrument. That is, we need to figure out if it is possible to model not just the intensity field but the electric and magnetic fields (or, in the steady state, you can think of it as an amplitude and a phase at the detector surface). To my knowledge, except in radio astronomy, this has not been done: Optical (and near-optical) astronomers think of the "thing" in the telescope as being the intensity field (or worse, photons), not the electric and magnetic fields. The challenge is: Superposition really applies only to the electric and magnetic fields, not the intensity field; but at the same time, CCD-like detectors only measure (a noisy sampling of) the intensity field. Saturday night found me starting to write and test some very simple code, with delta-function sources and delta-function pixels.


large galaxies, cosmic rays

It comes as a surprise to many that it is much harder to precisely measure the properties of very bright, nearby galaxies than it is to measure the properties of much more distant but similar objects! (Same for very bright stars too, in modern digital imaging.) Part of this is because at high signal-to-noise you see the (badly modeled) details of your point-spread function better. But the bigger issues are that nearby galaxies span field boundaries (in any blind survey, like SDSS), span flat-field and sky variations (because of their large angular sizes), and tend to be blended with background galaxies and foreground stars. Mykytyn, Foreman-Mackey, and I discussed all these issues over a long post-lunch meeting.

In the early morning, Andrew Flockhart (NYU), Fergus, and I discussed our project to use supervised classification machine-learning techniques to identify the cosmic rays robustly in single-epoch, single-exposure HST imaging. We decided to start with nearest-neighbor techniques and move to support vector machines, before going to any heavy machinery. We built our training set with multi-exposure imaging from the HST Archive.


responding to referee; the disk

I spent the day in Princeton; the morning with Bovy talking about the Milky Way disk and the afternoon with Lang working on the response-to-referee on the Comet Holmes paper. We are very, very behind schedule on that! We made figures that compare the Comet Holmes orbit we inferred to the NASA orbit. We don't get quite the right orbit, in part because our model of the data we scraped from the web is so crude.

Bovy and I discussed his results on the kinematics of mono-abundance subpopulations in the Milky Way disk, a follow up to his paper on the spatial structure of those same populations. We also discussed his measurement of the disk rotation curve with APOGEE data; he gets a low-amplitude (relative to Reid and The Colbert Report) rotation curve, which is intriguing.


software repository, licensing

Inspired by emails from Stumm and Foreman-Mackey, Lang and I had a long conversation about software repositories: SVN vs GIT, in the cloud vs at home, one repository or many, what level of organization, and so on. It is a difficult set of problems, and different solutions serve and support different kinds of development styles and communities. We have decided to migrate one sub-project from the Astrometry.net SVN repository to a github GIT repository as a test balloon, and that led to a round of the endless discussion of licenses. Someone needs to write the document software licensing for astronomers and be done with it!

In the afternoon, I started to read Bovy's latest manuscript about chemical-abundance sub-populations in the Milky Way disk, but now in velocity space.


segmenting images and inferring motion

Over in Fergus's computer science group, Deqing Sun (Brown) gave a very nice talk about measuring motion in image sequences (think movies) by building a generative model of moving layers with sharp boundaries. He constructs a prior over image segmentations by segmenting the image using threshold-crossing of a (very local) smooth Gaussian process; this permits an analytic prior. The results are beautiful and effective and conform to common sense and also come close to world-record performance against quantitative benchmark tests (with known ground truth). His system performs well in part because it is a (approximate, simplified, sensible) full generative model for the data: It has a large number of parameters, a proper prior over those parameters, and a sensible likelihood function, and he can optimize it. He didn't try to sample from the posterior PDF, but he has only worked (so far) at very high signal-to-noise.



Foreman-Mackey returned from his furlough at Queens, where he was finishing a paper with Widrow on Andromeda. I quizzed him about some details of cacheing (very slow computations) in my Python RGB-to-CMYK code; he had good ideas. One thing he noted that instead of doing if rgb in cache.keys(): it might be far faster to do try: cmyk = cache[rgb] and then catch the KeyError exception. Apparently that is the rage and style in Python programming. He also promised to help me Python-package and docstring everything. Looking forward to it!

[Note added a few minutes later: Switching from the keys() check to the try style sped up the cache retrieval by a factor of 40!]


planning and not doing

I worked on making a somewhat binding schedule for creation of my Large Galaxy Atlas and the associated publications; the rest of the day was spent on various complicated issues related to my position as Director of Undergraduate Studies; that is, forbidden content on this blog.



I spent the entire morning in the Hungarian Pastry Shop (Columbia hangout) with Schiminovich, discussing our various GALEX projects, but especially our discovery of white-dwarf companions (some of which are very low mass, possibly sub-stellar) and our evil plans to extract and distribute the full time-tagged photon list, along with the relevant spacecraft pointing and sensitivity data to make full use of them. I am so excited about this project: It will provide the highest time-resolution that is physically possible given the aperture, optics, and detectors on the spacecraft. If that doesn't lead to interesting time-domain science, I don't know what will.


making a catalog is not easy

Today David Mykytyn (NYU undergrad) and I specified Mykytyn's project to be the construction of a "Large Galaxy Catalog" from the SDSS imaging, using the Tractor to do the galaxy measurements (sizes, surface brightnesses, colors, and magnitudes). Today we discussed many of the complicating issues, which include (but are not limited to) the facts that: (1) angularly large galaxies often overlap multiple SDSS fields, usually taken on different nights, (2) they often overlap very bright stars, which can dominate the photon count in the face of the galaxy, (3) the SDSS software (optimized for much more angularly small galaxies) shreds them often into many pieces, (4) lumps and bumps in the intensity field can be features of the galaxy or confusing foreground objects (stars) or background objects (distant galaxies). We have hacks for all of these issues (not yet implemented), but we would like some principled approaches. It's hard, because, as I have lamented before, despite a hundred years of expensive and painstaking work by thousands of very bright people, astronomers do not have a generative model for galaxies!


finding spectroscopic supernovae in real time

Back in November, Or Graur (AMNH, Tel Aviv) came to us at NYU and pitched a method for finding supernovae superimposed on the SDSS-III BOSS spectra of early-type galaxies. His system is essentially a generative model of both galaxies and supernovae, so it appeals to me. He has been successful running other data sets, but if we ran his stuff in BOSS on the mountain each night at the end of observing, we could discover and announce supernovae in real time. My only substantial research today was pitching this to the SDSS-III Collaboration; a pitch is required because we can only do this if the Collaboration accepts Graur as an External Collaborator.


pretty much zip

It being a holiday, I didn't do much here, except for some planning for my non-written Atlas.


printer calibration test strip

I made this printer test (8-bit CMYK TIFF file) for my RGB-to-CMYK conversion project. If you can print this out on a CMYK printer (no, there is absolutely no reason to look at it on the screen), and if you can be sure your print driver is not flattening to RGB before doing a reconversion to CMYK (this is hard to know, given the craziness of the print driver world), then printing this and comparing it to a screen view (no, there is absolutely no point in printing it) of the original (8-bit RGB JPEG) leads to a (very rough) printer calibration. In making this test strip, I have reduced my 12-parameter (already simplfied) printer model to only 3 parameters. Paper (arXiv-only, I expect) and open-source code coming soon.


multi-band Tractor

One of the ways in which the Tractor is qualitatively better than other methods for measuring the properties of stars and galaxies in imaging is that it can fit multiple images—with different seeing, taken on different nights, and taken through different bandpasses—simultaneously, delivering consistent shapes, colors, and variability information despite heterogenous data and with no requirement of "stacking" before measurement. All that is true in theory but until today most of this functionality was vapor-ware. Today, Lang and I (well, really Lang, with me watching) got the Tractor working on multi-band, heterogenous imaging by permitting the "fluxes" or "magnitudes" of the objects to be arrays of values, one per band. In the future, we hope to work in spaces with well-defined priors, learned hierarchically, but we have a start. When we applied the code to a small snippet of SDSS data, we found some tiny band-to-band astrometric offsets, for which (along with photometric calibration and PSF) the Tractor can also fit.


more deconvolution

Hennawi and I started to write down a real likelihood function and some priors for the combining-without-stacking problem, thinking about quasars and other high-redshift objects. One of the key ideas is that we can deconvolve the spectral information in the low-spectral-resolution broad-band photometry to get a higher spectral-resolution spectral energy distribution, by using the idea that at different redshift we see similar populations but at different rest wavelengths. Blanton pointed us to this paper by Csabai et al.


don't stack your data (UV edition)

Hennawi is visiting NYU for a couple of days, and pitched to me a few projects all centered around the idea of getting more information out of a set of noisy observations than you can get by just stacking. One cool idea is to get medium-resolution spectral components (or distribution in spectrum space) with only broad-band photometry. Another is to get the same even in the presence of variable and spiky IGM absorption. Another is to get IGM absorption statistics from broad-band photometry alone. And so on. He emphasized that the UV bump in quasars is not well observed (because it is in the UV and either cut off by atmosphere or else IGM), despite the fact that it is the most direct observable created by the accretion flow. All we did today is talk, but tomorrow we will write down likelihood functions (the first step in any project!).



I finally coded up and got running a RGB-to-CMYK conversion that is based on the physical properties of the printing device. The (perhaps insane) idea I have in mind is the following: Our RGB images of SDSS galaxies, if you view them on a normal RGB monitor, have a definite, quantitative relationship between the light hitting your eyes and the intensity hitting the telescope. It is non-trivial and non-linear, but it is quantitatively traceable and (lossy) invertible. When we print these out on a CMYK printer, this is not true, in part because the RGB-to-CMYK conversion is heuristic and doesn't in any way model the physical process of light hitting the page, being attenuated by the ink, and then reflecting. This process, for example, is multiplicative (not subtractive as is usually said). It is multiplicative with multipliers less than one, and (strongly) wavelength-dependent. The model I have built of this process can (in principle) make it once again true that the reflected light from the page (when viewed with a standard room illumination, say) has properties that are quantitatively and (lossy) invertible back to the intensity falling on the telescope. Why do I do these thing?



I have been critical of Popper (or really of a cartoon version of Popper who lives in my mind) in the past, so when I passed a copy of his book The logic of scientific discovery in a Wellington, NZ bookstore, I picked it up. I have only just started, but I realize that his really important contribution—and one with which I agree wholeheartedly and as do all probabilistic reasoners (I hope)—is that we should stop trying to solve any kind of problem of induction: We do not generate general rules by making repeated, specific observations! We use repeated observations to test hypotheses. We create generalizations, figure out their consequences, and test those against the data. The data do not create our laws; we create them.

Where I disagree with Popper is in the question of falsification. Popper believes (I think; I haven't read him yet!) that laws can only be falsified when compared with data. I believe that falsification is never absolute, and that falsification of competitors can be effectively confirmatory to the competing hypothesis.


over-estimated error variances

[On travel, so posting is irregular; see Rules.]

Astronomers spend a lot of their time estimating their errors (meaning the variances or standard deviations of the noise contributions to their measurements), as they should! However, often these error analyses can make a lot of unwanted assumptions. For example, in the case of stellar metallicity measurements (the case I am working on with Bovy and Rix), the errors are estimated by looking at the variation across stellar models, where the range of stellar models is both too large (many of these models are in fact ruled out by the data) and too small (many real observed stars are not well fit by any model), the errors estimated by standard error estimation techniques can be either too small or too large, depending on how the modeling disagrees with reality.

In the limit of large amounts of data, any machine-learner will tell you that if your uncertainty variainces matter (and they do), then you must be able to infer them along with the parameters of true interest. That is, when you have a lot of data, your data themselves probably tell you more about your measurement error properties (your noise model) than do any external or subsequent error analyses. The crazy thing is that it is clear from the informative detail we see in Figure 2 of this paper that the team-reported error variances on the SEGUE metallicity measurements are substantial over-estimates! There might be large biases in these measurements but there simply can't be large scatter. Now how to convince the world?