K-nearest-neighbors spectral modeling

K-nearest-neighbors is all the rage in Camp Hogg these days. Pursuant to our conversations this week, I wrote a short document explaining how we could use it to determine G-dwarf ages with SEGUE spectra. The issue is that you want to see very subtle chromospheric activity on top of coincident absorption lines. You need a good model of each spectrum, better than theoretical models or spectral libraries can provide. My document explains how you could do it with the large extant data set, capitalizing on the neighbors in spectral space. Rix is about to hand me the spectra.


data-driven chemical tagging

Rix and I spent a big chunk of the day talking about everything we have been doing since the summer. We veered off into my dream project of doing stellar chemical abundance analyses without a good physical model of the stellar atmosphere or emission. I think this is possible. And in discussion with Rix, I realized that it could be directly related (or very similar to) support vector machines with the kernel trick. The kernel can be flexible enough to adapt to star temperature and surface gravity, but not flexible enough to adapt to chemical abundance changes. I just want to figure out how to work with missing data and noise (as my loyal reader knows, this is my biggest problem with most standard machine-learning methods). We also discussed the NSF Portfolio Review, the modeling of tidal streams, age-dating stars, the three-dimensional distribution of dust in the Milky Way, and other stuff.


exoplanet statistics

On the plane to Tucson, I spent a few minutes trying to formalize a few ideas about limiting cases we might find in exoplanet population statistics, which we can use to build intuition about our full probabilistic inference. The first is that if the mutual inclinations are low (very low), the observed tranet (a transiting planet is a tranet) multiplicity distribution translates very simply (by geometric factors) into the true planet multiplicity distribution. The second is that if the multiplicity is high, the observed tranet multiplicity distribution constrains a ratio of the true planet multiplicity to the mutual inclination distribution width. The third is that if the multiplicity could somehow found to be dynamically maximal (if that is even possible to define), then the tranet multiplicity distribution translates very simply into the true planet mutual inclination distribution. I am not sure that any of this is useful or even correct.


Bayes at scale

Camp Hogg met with David Blei (Princeton) again today to discuss some more the inference generalities that came up last week. In particular, we were interested in discussing what aspects of our planned exoplanet inferences will be tractable and what aspects will not, along with whether we can co-opt technology from other domains where complex graphical models have been sampled or marginalized. Blei's initial reaction to our model is that it contains way too much data to be sampled and we will have to settle for optimization. He softened when we explained that we can sample a big chunk of it already, but still expects that fully marginalizing out the exoplanets on the inside (we are trying to infer population properties at the top level of the hierarchy, using things like the Kepler data at the bottom level) will be impossible. Along the way, we learned that our confusion about how to treat model parameters that adjust model complexity arises in part because that genuinely is confusing.

We also discussed prior and posterior predictive checks, which Blei says he wants to do at scale and for science. I love that! He has the intuition that posterior predictive checks could revolutionize probabilistic inference with enormous data sets. He gave us homework on this subject.

Today was also Muandet's last day here in NYC. It was great to have him. Like most visitors to Camp Hogg, he goes home with more unanswered questions than he arrived with, and new projects to boot!



Foreman-Mackey gave his PhD Candidacy exam talk today and passed, of course. He talked about hierarchical Bayesian modeling of exoplanet populations, which is his ambitious thesis topic. I don't think I have ever had a student better prepared to do an ambitious project with data! He also has an eleven-month-old arXiv-only paper with 25 citations already. In the morning, Fadely and I looked at results from Fadely's patch-based, nearest-neighbor-based image calibration system. It is fast and seems to be doing the right thing, but we don't understand all of the details or what we expect for the dynamics of the optimization. Late in the day, Fadely showed me some experiments that suggest that in pathological situations, it can converge to something that looks like the wrong answer.


support vector machines

Krik Muandet (MPI-IS) led a discussion today of support vector machines and extensions thereof. For me the most valuable part of the discussion was an explanation of the kernel trick, which is absolutely brilliant: It projects the data up into a much higher-dimensional space and permits accurate classification without enormous computational load. Indeed, the kernel obviates any creation of the higher dimensional space at all. Muandet then went on to discuss his support measure machine extension of SVM; in this extension the points are replaced by probability distribution functions (yes, the data are now PDFs). I was pleased to see that the SMM contains, on the inside, something that looks very like chi-squareds or likelihood ratios. Jonathan Goodman asked how the SMM classifier differs from what you would get if you just ran SVM on a dense sampling of the input PDFs. Of course it has to differ, because it takes different inputs, but the question, correctly posed, is interesting. We ended the talk with lunch, at which we asked Muandet to do some demos that elucidate the relationship between SVM and nearest neighbors (Fadely's current passion).

I spent a big chunk of the afternoon discussing gravitational radiation detection with Lam Hui (Columbia). I think we have resolved the differences that emerged during his talk here this semester.


SQL injection

It was 1995 again here at Camp Hogg. At dotastronomy NYC hack day, one of the participants (who I am leaving nameless unless he or she wants to self-identify in the comments) identified a SQL-injection vulnerability in the MAST (Hubble Space Telescope) astronomical data archive. I made the mistake of bug-reporting it late last night and then had to deal with the consequences of it today. It was my first experience of the grey-hat world (white hat: report bug without exploit; grey hat: exploit bug trivially and report it; black hat: exploit but don't report); grey-hat is effective but stressful and doesn't get you any love. The upshot is positive though: MAST will be more secure going forward.


statistical mechanics

I was in two exams today. The first was a candidacy exam for Henrique Moyses (NYU), who is working on the stochastic vortex, a response of random-walking particles to non-conservative forces. One of the main problems he has is how to measure the drift velocity field of a particle, when the secular drift is a tiny, tiny fraction of the random velocity. He has lots of samples, but fitting the steady part (mean) of the spatially varying stochastic velocity distribution function is not trivial. We vowed to discuss density estimation in January.

The second exam was a PhD defense for Ivane Jorjadze (NYU), who has various results on systems of jammed particles. He has worked out quantitatively a thermodynamic analogy for non-thermal but randomly packed solids with some specific experimental systems. He has also worked out the elasticity and normal modes of jammed soft particles (like cells). Beautiful stuff, with good physics content and also very relevant to biological systems. Congratulations Dr. Jorjadze!


hack day

I had a great time at dotastronomy NYC hack day, organized by Muench (Harvard). Foreman-Mackey and I (with help from Schwamb, Cooper, Taub) worked on a probabilistic analysis of exoplanet populations based on the Kepler objects of interest list. I wrote text, Foreman-Mackey wrote code, Cooper drew a graphical model, and Schwamb and Taub read papers. In parallel, many hacks happened. The most impressive was Schiminovich's 9-year-old son Theo's calculation of exoplanet transit lightcurves (including the secondary eclipse) in scratch. Theo went from not even knowing what an exoplanet is to doing this calculation in hours! Another impressive hack was Beaumont (Harvard), who hacked an OpenGL backend onto matplotlib, making the (notoriously slow but nice) plotting package outrageously fast for million-point plots. It was seriously impressive; I hope he pushes it to matplotlib development. There were many other nice hacks and a good time was had by all. Thanks to Muench, bitly, Seamless Astronomy, and Gorelick (bitly) for making it happen!


super-massive black-hole binaries.

Bovy showed up for the day, in part to discuss quasar target selection with Muandet. Late in the day we discussed more general methods for determining the Solar motion relative to the full Milky Way disk, given that there is mounting evidence that the entire Solar Neighborhood is moving faster (relative to the MW Center) than the mean rotation speed.

In the late morning there was a talk by Tanaka (MPA) about binary super-massive black holes and their observability. As my loyal reader knows, I think these are more elusive than they should be: Either there are very few of them or they hide by turning off all AGN activity. Tanaka discussed all options but contributed something very interesting: His accretion models for these objects suggest that they should be UV and X-ray dark relative to typical quasars, at least in the late stages of inspiral. So we could test our ambiguous candidates by checking whether they are UV dark. He proposed several other interesting tests, most of which take new data and serious observing campaigns. At some point I found myself wondering whether ULIRGS (which show IR AGN activity and star formation) could be SMBH binaries. They are UV-poor and clearly the result of mergers.


arXiv modeling

Camp Hogg (which includes Muandet these days) had lunch with David Blei (Princeton), who is a computer scientist and machine-learning expert. He told us about projects he is doing to index and provide recommendations for arXiv papers, based (presumably) on his experience with author–topic modeling. Blei is a kindred spirit, because he favors methods that have a graphical model or probabilistic generative model underlying. We agreed that this is beneficial, because it moves the decision making from what algorithm should we use? to more scientific questions like what is causing our noise? and what aspects of the problem depend on what other aspects?. These scientific questions lay the assumptions and domain-knowledge input bare.

We talked about the value of having arXiv indexing, how automated paper recommendations might be used, what things could cause users to love or hate it, and what kinds of external information might be useful. We mentioned Twitter. Blei noted that any time that you have a set of user bibliographies—that is, the list of papers they care about or use—those bibliographies can help inform a model of what the papers are about. For example, a paper might be in the statistics literature, and have only statistics words in it, but in fact be highly read by physicists. That is an indicator that the paper's subject matter spills into physics, in some very real sense. One of Blei's interests is finding influential interdisciplinary papers by methods like these. And the nice thing is that external forums like Twitter, Facebook (gasp), and user histories at the arXiv effectively provide such bibliographies.

Late in the day we met up with Micha Gorelick (bitly) to discuss our plans for the dotastronomy hack day in New York City this weekend (organized by Gus Muench, Harvard). We are wondering if we could hack from idea to submittable paper in one day.


parameterizing exoplanets

Foreman-Mackey and I had a long conversation about parameterizing exoplanet models for transit lightcurves (and also radial-velocity measurements, direct-detection experiments, and astrometric data). The issue is that there are multiple angles—between the orbital plane and the plane normal to the line of sight, between the line of sight and the perihelion direction, and between the perihelion and both the ascending node and the fiducial time (zero of time). These angles are all differently degenerate for different kinds of data, and a well-measured angle can be hidden if you parameterize it such that the well-measured angle appears as a difference or sum of parameters. We also pair-coded some improvements to his transit-fitting code. The fits—to Kepler and Spitzer data—are sweet. Because his code is flexible, he can learn the limb darkening along with all the orbit and transit parameters.

At applied-math-meets-astronomy group meeting, we discussed various things, including using classification within a sampler to split the parameter space (in which there is a multi-modal likelihood function or posterior PDF) into a set of unimodal sub-models. On this subject, it is great to have Muandet visiting from MPI-IS; he is an expert on support vector machines and their extensions. Another thing that we discussed is the possibility in the Goodman & Weare stretch move underlying emcee and Hou's code, we might be able to improve the acceptance rate by changing the proposal distribution in the one-dimensional walker-mixing direction. That is super cool and something to work on at sleep time.


calibration and classification

At computer-vision-meets-astronomy group meeting, Fadely showed awesome results that we can calibrate astronomical imaging (determine the dark and flat) without taking specific calibration data, or even identifying stars (let alone cross-matching stars in repeat observations). This is unreal! He also showed that the information that comes from high-signal-to-noise image patches is very different from the information that comes from low-signal-to-noise (read: blank) patches. Afterwards, Krik Muandet (MPI-IS) and I discussed how to evaluate his quasar target selection methods, which are based on support measure machines, a generalization of support vector machines to the space of probability distribution functions.


gravitational radiation all the time

Two long meetings today with Mike Kesden (NYU) and Gabe Perez-Giz (NYU) about gravitational radiation. In the first we discussed the idea (from Lam Hui) that binary pulsars could be used as resonant detectors of gravitational radiation. We agree with Hui's conclusions (that they can, but that they aren't all that sensitive at present) but we disagree with the interpretation of the result in terms of sources. We are trying to write a clarifying document. Or at least Kesden is.

In the second conversation we talked about doing an external analysis of extant LIGO data. I have various ideas about modeling the strain or strain noise using the (massive, abundant) LIGO housekeeping data, to increase the sensitivity of the experiment to real gravitational signals. We talked about a wide range of things and tentatively agreed that we should explore entering into a memorandum-of-understanding with the LIGO Collaboration. But first we are going to do a tiny bit more homework. That is, Perez-Giz is going to.


theories and measurements of galaxies

Annalisa Pillepich (UCSC) gave a seminar in the morning about the effects of baryons on galaxy formation, comparing dark-matter-only and more realistic simulations. She finds (as does Zolotov and collaborators) that baryons can have a big effect on the structure of dark-matter halos. She finds that baryons have a big enough effect to resolve many of the remaining issues in the comparison of simulations and observations of low-mass galaxies in the Local Group. Really her conclusion (and my reading of her and Zolotov's work) is that—though there are large uncertainties—the magnitudes of the effects of baryons are large enough that they could resolve most current issues, and other things might help resolve them too. So once again the dark matter could be vanilla CDM. That's bad!

In the afternoon, Stephane Courteau (Queens) argued similarly that theories of galaxy evolution have enough uncertainties in them that they don't really predict the properties of the galaxy population well, but from the opposite perspective: He argued that no simulations get all the observed properties of the disk-galaxy population right at the present day. He argued that there needs to be very strong feedback to explain the population in current semi-analytic-style models. He also showed some amazingly good scaling relations for disk galaxies, measured carefully and lovingly by his group.

Late in the day, he argued with me that doing galaxy photometry by fitting simple parameterized models is always wrong. That's what I am doing with my Atlas project with Mykytyn and Patel. I don't agree with Courteau here—my view is that all photometry is in effect model fitting—but his points are good and need to be addressed. Translated into my language, effectively he argues for very flexible photometric models of galaxies; non-parametric models if you will. I agree but only on the condition that those flexible models are regularized so that in effect galaxies observed at different angular resolutions are being photometered equivalently. Well, at least for my purposes I need that (I want to find similar galaxies at different distances).


emcee vs Gibbs

Tang, Marshall, and I have a problem for which we think Gibbs sampling will beat the ensemble sampler emcee (by Foreman-Mackey). The issue is that you can't really Gibbs sample with emcee: In Gibbs sampling you hold a bunch of parameters fixed while you independently sample the others, and swap back and forth (or cycle through groups of easily sampled subsets of parameters). In the ensemble (of ensemble sampling) if you have different values for parameters in the non-sampling subset, then the parameters in the sampling subset are drawn from different posterior PDFs. That breaks the whole ensemble concept. So we are confused. Tang's job is to implement simple Metropolis–Hastings Gibbs sampling and see if it beats emcee. I am afraid it will as the number of parameters gets large. This is just more annoying evidence that there is no one-size-fits-all or kitchen-sink sampler that solves all problems best. If I worked in stats I think I would try to make that thing.


Bloom in town

In the morning, with the computer vision types at NIPS, it was just us astronomers. Fadely and I worked on making fake data to test our calibration systems, while Foreman-Mackey ran MCMC sampling on some Kepler transit data. In the afternoon, I met up with Josh Bloom (Berkeley) at an undisclosed location to discuss various things in play. We spent a bit of time realizing that we are very closely aligned on automated decision making. Somehow it has to involve your long-term future discounted free cash flow, measured in currency units.


magnetic fields

In a low-research day, Jonathan Zrake gave a nice brown-bag talk about how magnetic fields are amplified in turbulent plasma. I also had conversations with Fadely, Willman, and collaborators about star–galaxy separation, and Foreman-Mackey about author–topic modeling.