cython and fastness

Yesterday I put some stolen time into looking at whether we can measure short-period stellar oscillations in long-cadence Kepler data. The point is that Kepler long-cadence data has 30-min exposure times, but stellar oscillations in G dwarfs have 5-ish-min periods. The aperiodicity of Kepler exposures might save the day: In principle the exposing is drifted relative to a periodic exposing by the light-travel-time variations to the Kepler field induced by the Kepler spacecraft orbit. I wrote code to forward-model this and see if we can infer the very short stellar-oscillation periods. It looks (from preliminary experiments) that we can! If the noise is friendly, that is.

Today I met up with Foreman-Mackey and he showed me how to convert the slow parts of my code over to cython. That sped up my code by a factor of 300. That's why I pay Foreman-Mackey the big bucks! So now I have a functional piece of code that just might be able to do asteroseismology way below the naive "Nyquist limit". Stoked! If we can get this to work, there are thousands (or even tens of thousands) of stars that might present to us their asteroseismological secrets.



At group meeting, Malz discussed this paper by Sheldon et al, which obtains a redshift distribution and individual-object photometric redshifts from photometric data and a heterogeneous training set of objects with spectroscopic redshifts. The Sheldon paper (which refers to the Cunha method) is an example of a likelihood-free inference, in that it creates a redshift distribution and posteriors for redshift for individual objects without ever giving a likelihood function. This is good, in a way, because it doesn't require parameterizing galaxy spectral energy distributions. But it is bad, because, for one, any user of these probabilistic outputs wants likelihood information not posterior information! And more bad because we do have ideas about the probability of the data—we know things about the noise in photometric space—that are the principal inputs to a likelihood function. Malz and I plan to write some kind of paper about all this, but we are still confused about exactly what that paper would contain.


AAAC, day 2

Today was the second day of the AAAC meeting in DC; I participated remotely from NYC. The most interesting discussions of the day were around proposal pressure and causes for low proposal acceptance rates (around 15 percent for most proposal categories at NSF and NASA). The number of proposals to the agencies is rising while budgets are flat or worse (for individual-investigator grants at NSF they are declining because of increasing costs of facilities). The question is: What is leading to the increase in proposal submissions? It is substantial; something like a factor of three over the last 25 years. At the same time, the AAS membership has increased, but only by tens of percent.

With data in hand, we can rule out a few easy hypotheses: It is not people just sending in lots of bad proposals; proposals are still very good and now proposals rated Excellent and Very Good have less than 50-percent acceptance rates. It is not proposals by young people; the increase seems to be coming from senior people (people with tenure). It is not more proposals in any one year by the same PI; most PIs put in only one proposal to (say) the NSF AAG call in a year. What it might be is that people are not waiting three years between proposal submissions; I know I don't! 25 years ago, a successful research-active astronomer would put in one proposal every three years. Now I think most of us put in a proposal most years. That might account for it, but we don't yet know. It is not absolutely trivial to get the data.

In addition to getting data from the funding agencies, we also plan to do some kind of survey of the community. This will give less "hard" data, but will be able to give answers to questions that we can't ask of the data at the agencies, about things like motivation, perception, and reasoning among proposal PIs.


AAAC, day 1

Today was a meeting in Washington, DC of the Astronomy and Astrophysics Advisory Committee that advises NSF, NASA, and the DOE on their overlapping interests in astronomy and astrophysics. A few highlights were the following:

NASA plans to start studies of some of the top mission concepts for big missions that might be put forward to the decadal survey in 2020. That is, they want the community to go into the 2020 decadal process with some really well researched and feasible, ambitious missions. The white-paper describing all this is here (PDF) and deserves reading and comment by the community. The best way to influence one of the mission studies is to join and get involved in the relevant "PAG" (whatever that is).

NSF and NASA did well in the FY15 budget process, but DOE HEP less well; this pressures DOE on its priorities. Nonetheless, it is attempting to move forward in sll three areas of next-generation CMB experiment, next-generation dark-matter detection experiment, and something like DESI. The DOE feels a strong commitment in all these areas.

Dressler is chairing a committee to assess the "lessons learned" from the last decadal process. Somehow this slipped past me un-noticed! I wrote him email during the meeting and will send him comments tomorrow, but in fact I think it is too late to influence the survey of the survey, which is too bad. As my loyal reader (or friend) might know, I have issues with some of the decadal-survey processes. In particular, I was horrified by the non-disclosure agreement that members of the survey were asked to sign. (That's why I didn't participate.) It said, and I quote:

"When discussing the survey process with anyone from outside the survey committee, a panel member, or a consultant appointed to the survey, you should avoid discussing any matter other than the publicly available information, such as shown on the NRC’s public web pages.

"Information discussed should be limited to presentations and discussions held in open session, materials circulated in open session only, and other publicly accessible information. Discussions by the committee, subcommittees, and panels held in closed session are confidential and should not be discussed or revealed. Similarly draft documents circulated in closed sessions of a meeting or in the time periods between meetings should be treated as privileged documents and should not be shared outside the survey. This is particularly important for any documents that contain draft conclusions or recommendations, since these survey outputs are not final until the NRC review process for the report is complete."

Astronomers will either honor this agreement—in which case we will never be able to know what really happened in the decadal survey—or else they will violate it—in which case they are violating the law! Indeed, and of course, I know many violations (since who can't talk about what happens behind closed doors?).

On the way home, I built a sandbox for looking at Nyquist-violating Fourier analysis of Kepler data. The idea is (and Vanderplas is a huge supporter of this) that if you have a great generative model of your data, there really is no true "limit". We shall see if we can do anything useful with it.


K2 search

I spent most of our snow day doing Center for Data Science administration, but when I got a bit of research time, I worked on Foreman-Mackey's K2 exoplanet search paper. I also had a great phone conversation with Rix about The Cannon, and especially what the next papers should be. We need a refactored code and code release paper (led by Anna Ho, MPIA) and we need to try using The Cannon to diagnose issues with the physical models. We also have to use The Cannon to tie together the stellar parameter "systems" of two (or more) surveys, since this was the whole point of the project, originally.


what can you learn from galaxy spectra?

In the lunch-time brown-bag spot, Mike Blanton gave a great talk about galaxy spectra from SDSS-IV, especially the MaNGA project. He drew an amazingly realistic galaxy spectrum on the board and talked about what can be learned from various spectral features, including galaxy age (or star-formation history), chemical abundances (including abundance anomalies that connect back to star-formation history), and mass function (or dwarf-to-giant ratio). The MaNGA survey will make all of these things possible as a function of position within the galaxy for ten-thousand galaxies, which is exciting. After the talk we discussed a bit what might happen if SDSS-IV APOGEE-2 took some long-exposure spectra of galaxies; presumably there would be tons of dwarf-to-giant information, and also a great measurement of the velocity distributions of stars of different types. Interesting to think about!


a little writing

In a work day that ended up being almost all not research, I did do a tiny bit of writing in my MCMC tutorial and some cosmological data-analysis fantasies.


finding exoplanets that aren't transiting

Avi Shporer (JPL) gave a nice talk this morning about finding planets in the Kepler data that are not transiting, using what's called "phase curves". There are multiple effects that make this possible: One is that close (hot) planets are heated asymmetrically, so they add to the lightcurve in a non-trivial way through the orbit. Another is relativistic beaming (seriously!) and another is produced by the ellipticity induced in the star tidally by the planet. Shporer has done a search of the Kepler data for such signals. He hasn't found many, but what he has found are all interesting. I got interested during the talk in the induced-ellipticity effect; this might have odd phase relations if the star is spinning.

In group meeting, we talked about various things, including Huppenkothen's Fermi proposal, Vakili's point-spread function model, and my proposals for cosmological inference using likelihood-free methods. I had lunch with Andrew Zirm (greenhouse.io) who used to work on galaxy evolution and is now a data scientist here in New York City. In the afternoon, Malz and I figured out the absolute simplest possible first project for doing hierarchical inference with probabilistic photometric redshifts.


stream team

Kathryn Johnston and Marla Geha both brought their groups to NYU to discuss tidal streams in the Milky Way. Ana Bonaca showed the calibration process underway on her huge survey of the Palomar 5 stream. This globular cluster tidal stream extends for many tens of degrees in the SDSS imaging but then goes off the edge of the survey. Her DECam extend along the probable track. No results yet, but it is very exciting to see if the stream continues and what can be learned from it. Price-Whelan showed tantalizing evidence that, in a potential filled with chaotic orbits, tidal streams from progenitors on orbits with fast diffusion rates (short Lyapunov times) will produce streams with very different morphologies from those on orbits with slow diffiusion rates (long Lyapunov times). It is not yet clear how simple or general this is, but it is possible that chaos will be visible directly in streams and that the long streams we see are on especially (or unusually) non-chaotic orbits. That could explain things about alignments and so on. Emily Sandford (Columbia) is looking at better methods for identifying substructure interactions in the morphologies (including density distribution) of tidal streams. She is able to find energy gaps very reliably and is now looking at identifying the interactions in the simulation that produced them. The streams she has (which are from realistic n-body simulations) are really strange and unlike our standard observed streams; this might all relate to Price-Whelan's project!


YCAA seminar

I spent the day at the Yale Center for Astronomy and Astrophysics (hosted by Tabby Boyajian), where I gave a seminar on data-driven models and The Cannon. I spent quite a bit of time distinguishing the model we use from vanilla machine learning; I really think that machine learning in its basic standard form is not that useful in astronomy (of course this depends in detail on your definition of the "scope" of "machine learning"). After my talk I did the rounds, and had many interesting discussions. One of these was with Andrew Hearin (Yale) who is working on generalizing the halo occupation model for large-scale structure modeling. He is convinced that we can't beat ten-percent-ish inferences about cosmology at 10-Mpc-ish scales unless we make the halo model far more general. At dinner, we had great conversations about the future of data-driven modeling and astrophysics. Thanks, Yale!


don't Yale at me!

I spent my research time today getting ready a talk for Yale tomorrow. I am speaking about The Cannon (Ness, Rix, and my data-driven model for stellar spectra), which is doing a great job of labeling APOGEE spectra. It is a new talk, so I started from scratch. Big question: Will the projector be 4:3 or 16:9? Of course I can switch my slides with one line of LaTeX.


light echos in LSST

Fed Bianco (NYU) gave a great talk today about eta Carinae and light echos, mass loss, and the progenitors of stripped supernovae. Amazing stuff, especially the light echos. One thing the whole audience was excited about was the idea that if you had truly high-quality synoptic imaging like LSST will provide, you might be able to do some amazing simultaneous modeling of the dust (the reflector) and time-variable point sources (the radiation field). My intuition that this "blind deconvolution" problem could be solved is based on the facts that (a) the radiation source is a very sparse set of point sources and (b) the dust lies in a (fairly) continuous, (fairly) isotropic distribution, while the light echo geometry is the surface of an expanding anisotropic ellipsoid. I bet that one isn't in the LSST Project Book.

In the afternoon, I spoke with Vakili about a possible huge speed-up in Cramer-Rao-saturating point-source centroiding methods, and I spoke with Huppenkothen about gamma-ray bursts in the Fermi data. On the latter, she is planning on putting in a data analysis proposal which would develop further our "Magnetron" project for modeling bursty bursts.


good ideas over beer

Three good ideas came up in discussion with Foreman-Mackey and Fadely today. First, we formulated the likelihood-free inference for cosmological parameters given large-scale structure data (that I mentioned two days ago at CtU 2015) and concluded that it might be possible to perform this inference inexpensively with data and simulations in hand (that is, if we can convince some of the BOSS people to help out). Second, we discussed expanding Fadely's work on modeling the SDSS imaging catalog output to an infinite mixture of Gaussians using something like the Dirichlet process. Fadely pointed out that this might be impossible because we are doing the XD thing of convolving each Gaussian with each individual data point's individual uncertainty variance tensor. We agreed to at least look at the math, or maybe employ one of our math friends to help. Third, we talked about finding planets in K2 data by simultaneously fitting each lightcurve with a transit model plus a Gaussian-process-combined set of other-star lightcurves. That is, what we have been thinking about for Kepler search but (a) fitting the nuisances simultaneously and marginalizing out, and (b) predicting that all this will be even more valuable for K2 data than it was in the Kepler data.


CtU 2015, day 3

I gave a second talk today at Computing the Universe, this time on nuisance parameter models. I continued my discussion of Gaussian processes and then made the following points: You must fit your nuisance simultaneously with your parameters of interest and you must marginalize them out; you don't want to estimate anything you don't care about. Don't fit and subtract or you will be subject to over- or under-fitting. The most conservative thing you can do is implement a model with a huge number of free parameters; it is not conservative to restrict the freedom of the nuisance model. Gaussian processes don't need to just be processes of spatial position or time; they can be of other features as well.

Again, great talks for the rest of the day; highlights included the following: Abel (Stanford) showed that if you reinterpret n-body simulations as being a set of points that define the corners of mass-filled tetrahedra (rather than mass points), the effective resolution increases and the accuracy for some calculations increases. He also talked about sensible and clever mesh refinement tricks to deliver very high precision at low additional cost. Norvig (Google) talked about the engineering at Google that permits huge operations on huge clusters with lots of contributing people. Press (Texas) gave a tour of all the things that he thought were essential to data analysis in astrophysics that are not in the third edition of Numerical Recipes. He talked about graphical models, kernel methods, Hamiltonian MCMC, the lasso (and other L1 regularizations), Dirichlet processes, the multi-armed bandit, and methods for the solution of systems of linear and non-linear equations. Skillman (Stanford) talked about an immense (trillion-particle) simulation that he has done and fully released on the web for anyone to use. His talk was hilarious in part because the data set is so damned huge. He has built great tools for interacting with the data efficiently. Schneider (Livermore) talked about our MBI entry to the GREAT3 weak-lensing competition. He did a great job of explaining the hierarchical inference, and describing The Tractor and also our importance-sampling inference method.


CtU 2015, day 2

Today was the second day of Computing the Universe 2015 and Cosmo Hack Week. I spoke today about replacing frequentist correlation-function estimators with probabilistic inferences. I talked about likelihood-free inference (called "ABC" in previous posts, but I don't like that name), and about writing down a likelihood function based on Gaussian processes. Martin White (Berkeley) was my live critic, which was useful (and flattering).

There were many good talks; some highlights were the following: Cohn (Berkeley) talked about using unsupervised machine-learning methods to create features or descriptors of galaxy mass build-up histories, and then using a supervised regression to relate them to single-epoch observables; that seems like a productive way to think about designing new experiments. Budavari (JHU) spoke about hierarchical Bayes using GPUs to do the heavy lifting. He spoke mainly about GPU architecture and coding, but the hierarchical inference ideas he gave as his starting point are very relevant to Malz's projects; I have to get the two of them in touch. Font-Ribera (Berkeley) and Beutler (Berkeley) talked about BOSS large-scale structure results; the former about the IGM and baryon acoustic feature at redshift 2.5, and the latter about galaxies and the same feature at low redshift. Beautiful results in both cases. Schmittful (Berkeley) gave a beautifully clear talk on three-point functions and optimal estimators thereof. He has very nice results based on analysis of an Edgeworth expansion away from the Gaussian.

After the talks was hacking; at the hacking session I got together a team to argue out my ideas about CMB foregrounds. The discussion was very clarifying and I think there are some very simple baby steps towards a good project.


The Cannon discussion section

I put my research time today into the discussion section in the nearly finished paper by Ness et al on The Cannon, our project to put parameter and chemical abundance labels on stars without using stellar models. In my current practice, I built a part of the discussion around our assumptions: What assumptions did we make, why did we make them, and what would be the costs and benefits of changing them.


autocorrelation function

My only research activity on the airplane home from New Zealand was to work more in our MCMC users' manual. I wrote about the autocorrelation function and autocorrelation time of the chain, and how to use it in judging convergence and estimating uncertainties on sampling-based integral estimates.


functional testing an image model

Frean and I—hacking for the fourth day in a row at the Victoria University Vic Books location—started the day by understanding, in real and fake data, how our image model likelihood behaves as a function of position and region size. There were some surprises there, all obvious in retrospect: One was that if there is any gradient or non-stationarity in the background pixels, the model wants to split the background into two (or more) huge boxes. Another was that the success of the model in distinguishing source from background is not strongly dependent on the prior we put on the background histogram model. That is, even if we give the same prior on histograms to the background and the source, the model still separates them.

By the end of the day we had a rudimentary optimizer and sampler for our model. Even with all our index trickery the model evaluations are still expensive. We hope to build a greedy optimizer version called "OExtractor" as a swap-in replacement for you-know-what!


coding a new image model

Frean and I started coding up our generative image model. Since all that matters is histograms, we only need counts in regions of the image. Frean had a nice trick in which we pre-cumulate what you might call the "indicator" images, one per histogram bin, and then counts in bins become some very simple operations only on the corners of the region. Awesome and fast and very numpythonic.

We tested the method by swinging a patch over a SDSS image, and looking at the likelihood of a source-plus-background model as a function of source location. Obvious sources in the SDSS image stood out at huge confidence in this one-source likelihood map. And the model is just "the pixel histogram of a source is different from that of the background" plus a bit of "we know the background is close to flat" in quantile-based bins. Within the source and background regions there are no beliefs; that is, we treat the pixels within the regions as having been generated iid.

This model is very naive—it doesn't represent our complex beliefs about astronomical images. However, this will hurt us most when we think about compact sources, and least when we think about large, low surfae-brightness (intensity) sources, which was the original goal of the project. We decided that we are not interested in bright sources: They are easy! This is a project about finding subtle features (and subtle problems with the data).


yet another generative model for images

Work continued today with Frean and Brewer. After a second long day of talk, we came up with a plan, which is a generative model for astronomical images and (limited) implementation goals. The model is that the image has compact patches generated by a "source" model and the rest is generated by a "background" model. Each of these submodels—every source and the background—has a unique generating pixel histogram; no other spatial information in the image is used at all! That is, the model is that there are contiguous patches, within each of which the pixel data are produced by iid draws from a unique histogram model.

The nice thing is that Frean has good "Dirichlet multinomial" technology for marginalizing over histograms under a particular prior, so the model can be marginalized trivially. This project violates many of my rules, because it makes very little use of my knowledge of astronomy, and it employs conjugate priors, which are always wrong. That said, all models are wrong, so sometimes expediency is a good idea. This project is very aligned with others of my rules, because we have been able to write down a full likelihood and priors over all of the nuisance parameters. That is, we have a generative model for astronomical images. It isn't a good model, but it is (or might be) very fast to compute. We also have a graphical-model representation of it all.

We are ready to code, and hope to start tomorrow. One strong intuition we have is that there are very simple index-algebra and precomputation things we can do to make the (marginalized) likelihood values very easy to compute. I hope this ends up being true!


products of deep networks, SKA

I spent the day (which is a Monday, but I am blogging it on a Sunday, because of international date-line insanity) with Marcus Frean (University of Victoria, Wellington) and Brewer (who came down to Wellington from Auckland). We spent the day arguing about what we should try to accomplish this week. When Frean and I last spoke, it was about probabilistic models for astronomical imaging, where he was trying to find subtle sources (like low surface-brightness galaxies) in radio imaging from the SKA pathfinders. Today we spent much of the day arguing about things we might try to accomplish this week, which ranged from tiny tweaks of things we have already done, or running existing software on new data, all the way to building a probabilistic generative model of all astronomical data ever (my usual plan!).

In the afternoon, we talked about deep networks and why they are powerful and yet tractable to train. I still don't fully understand, but I am closer. But then we discussed Frean's observation that most data sets are generated by multiple causes, so you really want not just a deep network, but a graphical model that has two (or more) independent deep networks pointing at the data. That is, we want products of deep networks, if we really want to represent the structure of the data in the world. We stared at that for a while; it appears to maybe require the probabilistic generalization of deep networks (about which Frean has no fear). He has some example code that operates the product of deep belief nets on toy problems!

At lunch, we were joined by Melanie Johnston-Hollitt (Victoria), who is deeply involved in New Zealand's (proposed, I think) buy-in to the SKA project. We discussed radio astronomy data analysis and the management and operation of global astronomical projects, including Sloan and SKA. We discussed New Zealand's role in the SKA; I am very pleased to see New Zealand in the SKA; this could be huge for New Zealand and for the whole astronomical community.


CMB foreground latent variable model

Vacation is over! I spent the train ride from Auckland to Wellington working on the first paper about The Cannon and also my MCMC tutorial. I also thought a bit more about CMB real-space analysis and foregrounds, which led to a breakthrough of sorts:

I realized that the ideal model for the CMB foregrounds is not to try to model each band with linear (or non-linear) combinations of other bands; the right idea is that there are a (hopefully small) number of latent variables that control the emission in many bands; the emission in any one band is a (possibly very non-linear) function of those latent variables. This model would consist of something like K latent-variable all-sky maps and a set of Gaussian processes (one for each observed band) that control how the latent variables generate the bands. Bonus points: Make the non-linear function (generated by the Gaussian process) also be a smooth function of frequency, which I think just involves a small meta-up. In writing this down, I realized that it is just a generalization of the Gaussian Process Latent Variable Model (GPLVM). Great project for Cosmo Hack Week coming up in a bit more than a week's time.