tranparency: the monopole term

Worked on the paper I am writing with More on the transparency of the Universe, showing that the consistency of baryon acoustic feature (not oscillation) measurements with supernovae type Ia measurements provides a non-trivial constraint on Lorentz invariance and transparency. Right now this is not super precise, but it is highly complementary to measurements of absorption (presumably by dust) in lines of sight near galaxies, because there is no model-independent way to integrate the absorption signal correlated with galaxies to the mean, global value—what I would call the monopole term.


mixture of delta functions

Spent the day working on a specialization of mixture-of-gaussians (as a model for a distribution function in a high-dimensionality space) to mixture-of-delta-functions (which would have terrible likelihood for any data set except when you consider that there are observational errors). With Bovy's help I realized that the method we published in 2005 in this unlikely place actually doesn't work for the zero-variance corner of model space. Have to figure out why.



I spent part of the day thinking about and part of the day writing about a generalization of the k-means clustering algorithm to the case where there are missing data dimensions and dimensions measured with varying quality. That is, I am attempting to generalize it so that it clusters the data by chi-squared rather than uniform-metric squared distance. This, if I am right, will be a maximum-likelihood model for the situation that the underlying distribution is a set of delta functions and the data points are samples of that distribution but after convolution with gaussian errors (different for each data point). My loyal reader will recognize this as a statement of the archetypes problem on which I have been working for the last week or so.


Spitzer data, MW halo

Wu, Schiminovich, and I discussed data reduction for our large Spitzer program of spectroscopy.

At lunch—Columbia's Pizza Lunch—I described some of Koposov's results on the Milky Way halo potential as measured by a globular-cluster tidal stream.


faster code

Bovy has re-written all our "infer d-dimensional distribution functions when you have noisy data with missing values" code in C and it appears to be much faster than the (heroic) code written by Roweis and Blanton back in the day when we were all discovering the value of pair coding. Bovy and I spent some time discussing split and merge, which is a method for exploring mixture-of-gaussians models when you think you might be stuck in a local minimum.

Bovy and I also discussed the problem of comparing millions of SDSS spectra to one another in finite time. We figured out that the full N-squared calculation would take a year even if we coded it in machine language, so we want to do the full comparison only after trimming the tree with some reliable heuristics. We came up with a straw plan, but I am suspicious about its reliability (that is, we don't want to trim valid leaves) and effectiveness (that is, we want to massively speed things up).


extreme galaxy formation

My day-derailed-off-research-by-undergraduate-studies yesterday was interrupted by a nice talk by Schiminovich about what we have learned from GALEX and Spitzer about star formation in galaxies as a function of galaxy redshift and galaxy specific star-formation rate.

Today I plotted the number of archetypes required to represent a galaxy spectroscopic sample (from Moustakas) as a function of the statistical precision (as measured by chi-squared). The number monotonically increases with the precision, but differently than I expected.


integer programming

Roweis and my approach to constructing archetypes—small subsets of data points that represent all data points—is one of integer (or actually binary integer) programming. You have a large number of data points, and you include a small number of them, and exclude the rest, subject to constraints (the constraints that each point in the large set be represented), and optimizing some cost function (the total number of archetypes, in the simplest case). In general, these problems are, indeed, NP hard, as I suspected (below).

Roweis had the good idea of approximating the binary programming problem with a linear programming problem, and then post-processing the result. This is a great idea, and it works pretty well, as I discovered this morning, when everything came together and my code just worked. However, the number of archetypes we were getting in our post-processing was significantly larger than that expected given the performance of the linear program approximation.

It turns out that standard linear programming packages (open source glpk and commercial CPLEX, for examples) have integer and binary programming capabilities. These also solve the linear program first and then post-process, but they do something extremely clever in the post-processing step and are much better than my greedy algorithm. They both come very close to saturating the linear programming optimal cost, for the problem we currently care about (although CPLEX does it much, much faster than glpk, in exchange for infinitely larger licensing fees).

It was a very satisfying, research-filled day. As time goes on I will let my loyal readers know why we are interested in this.



I worked on code to generate from a set of delta-chi-squared values a linear program in CPLEX LP format for the archetypes project. Most of the difficulty was in formatting the lines, of course!


linear programming

Spent the day learning about linear programming, for Roweis and my spectroscopic archetypes project. Our project is an integer programming problem, which is NP hard (I think), but we have a linear programming approximation. Linear programming is something I learned in high school; now there are lots of free codes that can deal with hundreds of thousands or millions of variables and constraints. Unfortunately, the languages with which the programs can be specified are a bit non-trivial; I have nearly figured out how to code my problem in one of those languages, but I don't know which language to use.


no research

I blame teaching (much as I love it), committees, and email.


tidal stream radial velocities

I briefly helped Koposov this morning on an observing proposal to follow up his statistical measurement of the proper motion of a cold tidal stream with stellar radial velocities. The combination of transverse and radial velocities with distance and angular information means that if this proposal is accepted, Koposov will have not only full 6d phase-space information, but he will have that along the length of a long stream. This permits extremely precise orbit modeling, or, as Rix would say, a direct measurement of the acceleration due to gravity (velocity of the stream and curvature of the stream makes acceleration of the stream).


black-hole orbits

Today was black-hole-orbit day, with talks by Gabe Perez-Giz and Janna Levin (both Columbia) on methods for calculating and classifying all possible orbits around black holes. Their techniques make use of periodic orbits, which comprise a dense set that fully covers orbit space (at least for extreme mass ratio). Two nice talks and lots of discussion.


source variability

My only contributions to astrophysics knowledge today were (1) helping van Velzen extract and analyze galaxy (yes galaxy, not star) light curves from the SDSS Southern Stripe, and (2) discussing the V-max or Malmquist issues in flux-limited samples with Wu.


cosmic-ray anisotropy, regularization and convergence

I had the privilege of serving on the PhD thesis committee for the defense of Brian Kolterman's (NYU) PhD thesis today. He performed a set of very careful statistical tests of the angle and time distributions of about 1011 few-TeV cosmic rays incident on the Milagro experiment. He finds an anisotropy to the distribution in celestial coordinates, he finds a time dependence to that anisotropy, and he finds the (expected, known) effect of the orbit of the Earth around the Sun. The most surprising thing is the time dependence of the (very small but very high significance) anisotropy. After the very nice defense, Gruzinov and I spent some time arguing about whether the anisotropy and its time derivative were reasonable in the context of any simple model in which the cosmic ray population is fed by supernovae events throughout the disk of the Galaxy. I think I concluded that his results must put a strong constraint on the coherence or large-scale structure of the local magnetic field.

Bovy and I discussed the convergence and regularization of the mixture-of-gaussians model that he is fitting to the error-deconvolved velocity distribution in the disk in the Solar Neighborhood. We read some of the literature on EM and it was very instructive. Now Bovy has some serious coding to do. If he succeeds with all these enhancements, he will be hitting this problem with a very large hammer.


running code

I helped Sjoert van Velzen (NYU, Amsterdam) run our SDSS code to extract multiply observed objects in the SDSS Southern Stripe. He is looking for extremely variable AGN. Later, I rooted around with Wu in ancient directories looking for our copy of the PEGASE models for stellar populations. We found them. Can't say as I did much other research today!


shutting down star formation, galaxy templates

At group meeting, Wu and Moustakas spoke about galaxies with very high star formation rates, and galaxies that have just shut down their star formation abruptly. Wu is looking for the trigger for the cessation in these galaxies, which look like a generic phase in galaxy evolution. Moustakas has been looking at incredibly high velocity (thousand km/s) outflows of gas, possibly driven by very strong star formation.

After lunch I spoke with Bovy about my ideas to replace a PCA space with a set of hard templates in redshift determination and outlier finding for galaxy spectra. Although you need many more templates to represent the galaxies than you would need basis spectra to represent the spectra at some level of precision, you don't have to do nearly as much work to match spectra to individual unmixed templates as you have to do to find the position of the spectrum in the n-dimensional PCA space, in principle. Whether that is true in practice too, I don't yet know.


information in an image, Auger

I had long conversations with Rob Fergus (NYU) and Lang about the information content of an image. Dustin resolved some of my paradoxes, in particular he figured out that you can't say anything about how much information is in your image unless you know the distribution function over images! That distribution function is incredibly hard to describe, of course, since even for tiny images like Fergus's, it has a googol-cubed dimensions! Fergus's approach is to describe this function by sampling it, in some sense, but really what we have to do in our work is just approximate it in some sensible way.

In the afternoon, Jeff Allen (NYU) gave an excellent PhD candidacy exam talk in which he described how to reconstruct the physical properties of cosmic rays incident on the atmosphere from their fluorescence and ground-shower properties as measured by different Auger instruments. There are lots of puzzles in the reconstructions, and some of them will have mundane resolutions. There is an interesting possibility, however, that some will have extremely non-trivial resolutions.


computer failure

My household had a computer failure yesterday, which was impressive, since it was the first day of school. Fortunately, my ridiculous attention to backups paid off and we lost nothing. But it threw the supply chain into disarray and I spent today looking at purchasing some incredibly cheap spares. This is not research.


back to work

It was tough to get back to work in New York, and I spent most of the day getting ready for teaching and committees. That's not research.

I spent an hour talking with Bovy about the reconstruction of the velocity field of stars in the disk. This project is moving slowly just because our (well justified, correct) algorithm is extremely slow. I am very excited about the inference aspects of this project, because we are going to be able to make a lot of predictions about stellar radial velocities, and we will be able to test those predictions with extant data.

On a related note, I started writing a short polemic about representing posterior probability distributions (think Bayes) by sampling, and how that helps in complex scientific tasks.