spike sorting and astronomical catalogs

In the morning, arguments continued with Magland about the neuroscience problem of spike sorting. He showed some beautiful visualizations of real spike-train data from live rat brains. We talked about the connection between spike sorting and decision making: Just like in astronomical catalog generation, unless you have a way to deliver a sampling or pdf in catalog space, a spike-sorting algorithm is a decision-making algorithm. As such, it must include a utility!

In the afternoon, I gave the Applied Mathematics Seminar at the Courant Institute at NYU. I spoke about exoplanet search, where we have used lots of applied linear algebra ideas.


spike sorting

On a totally different topic, I worked for a bit on the neuroscience spike-sorting problem with Magland. I had suggested a trivial E-M-like algorithm for the problem, and he was testing it out with some toy fake data. He found some anomalies in the scaling of the noise in the answer, which we spent time trying to unwind.


ages of stars, overlaps of sets

In the morning Keith Hawkins (Cambridge) showed up for my group meeting. He made the strong argument that if we could measure stellar ages, we could see structure in the Milky Way that won't be revealed by (blunt) chemical tracers alone. I agree, and he came to the right place!

In the afternoon at the SCDA, Charlie Epstein (Penn) gave Magland and me some very good arguments or intuitions about the chaotic map we have been playing with in the phase-retrieval problem. He argued that the projection operators can be designed to greatly increase the effective overlap of sets for which the map is looking. Recall that the goal is to find the overlap of two sets. In the conversation, he effectively generalized the concept of “reflection” (which might be, for example, across a line) to the equivalent for any kind of set (that is, for things other than lines and points). Awesome!


sampling in hard problems

In the morning, I had a call with Foreman-Mackey. We talked about various things. One is the possibility that we could fully sample the galaxy-deprojection or cryo-EM problems. My optimism comes from the fact that there are many samplings of low-level latent parameters that can be done independently at fixed high-level parameters. My pessimism comes from the fact that there are so many parameters. Foreman-Mackey was optimistic. We also talked about building a physical model for the Kepler focal plane (PSF, flat-field, and so on) for K2 data. We were a bit pessimistic about our options here, but we are contractually obliged to deliver something. We discussed ways we might combined data-driven and physics-driven approaches.

In the afternoon, Tarmo Aijo (SCDA) and the Rich Bonneau (SCDA) group talked with Greengard and me about their model for the time evolution of the human (gut) biome. They are using a set of Gaussian processes, manipulated into multinomials, to model the relative abundances of various components. It is an extremely sophisticated model, fully sampled by STAN, apparently. They asked us about speeding things up; we opined that it is unlikely (at the scale of their data) that the Gaussian processes are dominating the compute time.


stellar chemistry without models

In my tiny bit of research time today, I had a call with Rix, who wanted to talk about various projects related to APOGEE and The Cannon. In particular we discussed results from Bovy that show that individual star-forming clusters seem to truly form single-abundance populations, with a unique chemical signature. This result is remarkable because he achieved it without any use of physical stellar models! We discussed next steps for The Cannon, including a modification in which we control the model complexity differently at every wavelength. I have great hopes for this. My first-try technology: The Lasso.

Also had a nice call with Dalya Baron and Dovi Poznanski (TAU) about deprojecting galaxies. We figured out that early-type galaxies in the SDSS would be a good place to start, since I know there is a result lurking there.


magic of mathematics

I spent my research time today in Magland's office, watching him explore the magic of the iterated maps I discussed yesterday. To recap: These are chaotic maps that do not optimize any scalar objective function (that we know) but which are attracted to fixed points that are related to (project to) solutions of the equations we want (phase retrieval with arbitrarily good data). We wondered how the author of these maps created them; we tried experiments in which we parameterized various choices and saw which maps work and which don't. We wondered about possible connections to MCMC, which is a stochastic iterated map (these are deterministic). The math is magical.


optimization and chaotic maps

My mind was blown today by the most remarkable solution to the phase-retrieval problem. (This is the problem of inferring the phase of a Fourier Transform given only the squared amplitude, as comes up in diffraction microscopy.) The solution is a chaotic iterated map, designed such that its fixed points are (related to) solutions to the equation, and such that its dynamics is attracted to those fixed points. The paper we found it in doesn't explain how they found the map or how they constructed it or how they tested it, but it just straight-up rocks on our test problems. And these problems are supposed to be officially NP-Hard. That is, the chaotic map takes us to the correct solution (we have a certificate, as it were) with very high probability, in a problem that is supposed to be combinatorically impossible. How this relates to optimization, I don't understand: The map is not justified as an optimizer of any specific objective function. How this relates to MCMC is possibly interesting: An MCMC is like a stochastic iterated map. Magland ended the day very excited about the breakthrough, even though it means that all the code he has been building and testing is now (probably) obviated.


fitting stars, CMDs, and galaxies, day 3

On the last day of the workshop, various people were asked to talk about where they would like to be (or the field to be) in ten years. Rita Tojeiro (St Andrews) gave a nice argument that large LSS surveys are going to be critical and valuable for understanding galaxy evolution. Aaron Dotter (ANU) noted that, among other things, we are going to know a huge amount about stellar binaries. This is important for understanding stars (dynamics and eclipses to get masses and radii), exoplanets (inferences depend on multiplicity), and star formation.

It was a great meeting and thanks go to Charlie Conroy (CfA) for organizing (and for paying for all the food!).


fitting stars, CMDs, and galaxies, day 2

The day began with Pieter van Dokkum (Yale) and me arguing about what's important and how to achieve it. In my view, the biggest issue with everything is that we don't believe any of the models, and yet we have to do science with them. How does that work? I don't think anyone has a clear answer. You can say that you are only getting answers subject to very strong asusmptions—that's very true—but that doesn't tell you what to believe and what not to believe. Like, given that the line lists going into 1-D models are wrong in such-and-such ways, what conclusions about stars are in danger of being wrong? In some sense every model-based result we have is mixed with some probability of some kind of ill-specified null model that things are very wrong and anything (within reason, whatever that means) could be happening.

In my research, this really comes up in the question of whether we have true parameters for stars. That is, are we correct about log-g and T-eff and various chemical abundances? At some level it is not just that we can't be right fundamentally (stars don't even have steady-state values for these!) and not just that we can't be right in practice (our models give different answers depending on the data at hand, etc.), it is that we can't know that we are right even when we are right as we can be. All we can really know is whether we do a good job of predicting new data. Compare models in the space of the data! I emphasized that today.

In response to all these issues, I said one thing that made people uneasy: I said we should focus our attention on problems that can be solved with the tools at hand. We should try to re-cast our projects in terms of things for which our models produce stable predictions (like relative measurements of various kinds). I don't think we should choose our scientific questions based on an abstract concept of “what's interesting”. I think we should choose on the concrete concept of what's possible. I not only think this is true, I think it how it has always been in the history of science.


fitting stars, CMDs, and galaxies, day 1

Today was the first day of a workshop put together by Charlie Conroy (Harvard) to get people who model stars together with people who model stellar populations and people who model galaxies. I came because I want to understand better the “customers” for any further work we do modeling stars, either with The Cannon or else if we jump in to the 1-D modeling problem.

Everyone at the workshop said something today, which was amazing (and valuable) and way too much to report here. One highlight was Phil Cargile (Harvard) showing us how he can update atomic line parameters using observations of the Sun. We discussed how this might be done to jointly improve the predictions for many stars. Another highlight was Alexa Villaume (UCSC) showing us the provenance of the stellar parameters in some of the calibrator-level star sets. It was horrifying (one thing was a weighted average of literature values, weighed by publication date).

A number of people in the room are computing spectra (of stars or star clusters or galaxies) on a grid and then interpolating the grid at likelihood-evaluation time. This started a discussion of whether you should interpolate the spectra themselves or just the log-likelihood value. I argued very strongly for the latter: The likelihood is lower dimensionality and smoother than the spectrum in its variations. Not everyone agreed. Time for a short paper on this?


GPs in the Fourier domain

The day started with Dun Wang, Steven Mohammed (Columbia), David Schiminovich and I discussing the short-term plans for our work with GALEX. My top priority is to get the flat-field right, because if we can do that, I think we will be able to do everything else (pointing model, focal-plane distortion model, etc.).

Over lunch, Greengard and Jeremy Magland (SCDA) “reminded me” how the FFT works in the case of irregularly sampled data. This in the context of using Gaussian-process kernels built not in real space but in Fourier space. And then Greengard and Magland more-or-less simultaneously suggested that maybe we can turn all our Gaussian process problems into convolution problems! The basic idea is that the matrix product of a kernel matrix and a vector looks very close to a convolution, and the product with the inverse matrix looks like a deconvolution. And we know how to do this fast in Fourier space. This could be huge for asteroseismology. The log-determinant may also be simple when we think about it all in Fourier space. We will reconvene this conversation late next week.


the math behind optimization magic

There is much magical math these days around L1, LASSO, compressed sensing, and so on. These methods are having huge impacts across many fields, especially where data-driven models reign. There were two talks at SCDA today about these matters. In the morning, Christian Müller (SCDA) spoke about TREX, which is his set of methods for identifying predictors in very sparse problems. He showed incredible performance on a set of toy problems, and value in real problems. In the afternoon, Eftychios Pnevmatikakis (SCDA) reviewed a paper that proves some results related to the conditions under which an optimization problem (minimize blah subject to foo) will return the true or correct answer. There was a lot of geometry and there were some crazy sets. Definitions of “descent cone” and “statistical dimension” were introductions, for me, to some real math.



In the phase-retrieval problem, there are many objective functions one can write down for the problem, and many optimizers one can use, and many initializations, and many schedules for switching among objectives and optimizers. I spent my research time today playing in this playground. I got nothing awesome—everything gets stuck in local optima (not surprisingly).

One of the optimization methods I figured out turns the problem into a quadratic program with quadratic constraints (QCQP). This is convex if the constraints themselves are properly signed. They aren't! When they aren't, QCQP is apparently NP-Hard. So either this is going to be a tough optimization or else I am going to solve P = NP! Have I mentioned that I hate optimization?


phase retrieval code

I started a github repository with code for a probabilistic-modeling-based approach to phase retrieval. I started with optimization of a likelihood-like object. My current likelihood is clearly not convex, as expected: This is a quadratic (not a linear) model (the data is the squared norm of the Fourier transform of the image).


phase retrieval

I came back from #AstroHackWeek and #DSEsummit all fired up to work on Cryo-EM, but then Leslie Greengard, Charlie Epstein (Penn) and Jeremy Magland (SCDA) distracted me on the phase retrieval problem: In one-dimensional problems, if you know the squared amplitude of the Fourier transform of an all-positive function on a bounded domain, you do not know the function: There are true degeneracies. In higher dimensions, there are degeneracies possible, but generic two-d and three-d all-positive scenes in bounded domains are uniquely specified by the norm of the Fourier transform the vast majority of the time. Or so we think. There are arguments, not proofs, I believe. Anyway, Magland has coded up (an improved version of) one of the standard algorithms for retrieving the phases of the Fourier transform and spent the last week or two testing and adjusting it. He seems to find that the solutions are unique, but that they are very hard to find. We spent the day arguing around this. I am trying not to get sucked in!


#DSEsummit, day 3

The day started with Josh Tucker (NYU) talking about the SMaPP lab at NYU, where they are doing observational work in Politics and Economics using data science methods and in a lab-like structure. The science is new, but so is the detailed structure of the lab, which is not a standard way of doing Political Science! He pointed out that some PIs in the audience have larger budgets for their individual labs than the entire NSF budget for Political Science! He showed very nice results of turning political-science theories into hypotheses about twitter data, and then performing statistical falsifications. Beautiful stuff, and radical. He showed that different players (protesters, opposition, and oppressive regimes) are all using diverse strategies on social media; the simple stories about twitter being democratizing are not (or no longer) correct.

In the afternoon, we returned from Cle Elum to UW, where I discussed problems of inference in exoplanet science with Foreman-Mackey, Elaine Angelino (UCB), and Eric Agol (UW). After we discussed some likelihood-free inference (ABC) ideas, Angelino pointed us to the literature on probabilistic programs, which seems highly relevant. In that same conversation, Foreman-Mackey pointed out the retrospectively obvious point that you can parameterize a positive-definite matrix using its LU decomposition and then never have to put checks on the eigenvalues. Duh! And awesome.


#DSEsummit, day 2

In the morning, Katy Huff (UCB) gave an energizing talk about The Hacker Within, which is a program to have peers teach peers about their data-science (or programming or engineering) skills to improve everyone's scientific capabilities. The model is completely ground-up and self-organized, and she is trying to make it easy for other institutions to “get infected” by the virus. She had some case studies and insights about the conditions under which a self-organized peer-educational activity can be born and flourish. UW and NYU are now both going to launch something; I was very much reminded of #AstroHackNY, which is currently dormant.

Karthik Ram (UCB) talked about a really deep project on reproducibility: They have interviewed about a dozen scientists in great detail about their “full stack” workflow, from raw data to scientific results, identifying how reproducibility and openness is or could be introduced and maintained. But the coolest thing is that they are writing up the case studies in a book. This will be a great read; both a comparative look at different disciplines, but also a snapshot of science in 2015 and a gift to people thinking about making their stack open and reproducible.

I had a great conversation with Stefan Karpinski (NYU) and Fernando Perez (UCB) about file formats (of all things). They want to destroy CSV once and for all (or not, if that doesn't turn out to be a good idea). Karpinski explained to me the magic of UTF8 encoding for text. My god is it awesome. Perez asked me to comment on the new STScI-supported ASDF format to replace FITS, and compare to HDF5. I am torn. I think ASDF might be slightly better suited to astronomers than HDF5, but HDF5 is a standard for a very wide community, who maintain and support it. This might be a case of the better is the enemy of the good (a phrase I learned from my mentor Gerry Neugebauer, who died this year). Must do more analysis and thinking.

In the afternoon, in the unconference, I participated in a discussion of imaging and image processing as a cross-cutting data-science methodology and toolkit. Lei Tian (UCB) described forward-modeling for super-resolution microscopy, and mentioned a whole bunch of astronomy-like issues, such as spatially variable point-spread function, image priors, and the likelihood function. It is very clear that we have to get the microscopists and astronomers into the same room for a couple days; I am certain we have things to learn from one another. If you are reading this and would be interested, drop me a line.


#DSEsummit, day 1

Today was the first day of the Moore-Sloan Data Science Environments annual summit, held this year in Cle Elum, Washington. We had talks about activities going on at UW; many of the most interesting to me were around reproducibility and open science. For example, there were discussions of reproducibility badges, where projects can be rated on a range of criteria and given a score. The idea is to make reproducibility a competitive challenge among researchers. A theme of this is that it isn't cheap to run fully reproducible. That said, there are also huge advantages, not just to science, but also to the individual, as I have commented in this space before. It is easy to forget that when CampHogg first went fully open, we did so because it made it easier for us to find our own code. That sounds stupid, but it's really true that it is much easier to find your three-year-old code on the web than on your legacy computer.

Ethics came up multiple times at the meeting. Ethical training and a foregrounding of ethical issues in data science is a shared goal in this group. I wonder, however, if we got really specific and technical, whether we would agree what it means to be ethical with data. Sometimes the most informative and efficient data-science methods to (say) improve the fairness in distribution of services could easily conflict with concerns about privacy, for example. That said, this is all the more reason that we should encourage ethical discussions in the data science community, and also encourage those discussions to be specific and technical.


#AstroHackWeek 2015, day 5

In the morning, tutorials were again from Brewer and Marshall, this time on MCMC sampling. Baron got our wrap-up presentation document ready while I started writing a title, abstract, and outline for a possible first paper on what we are doing. This has been an amazingly productive week for me; Baron and I both learned a huge amount but also accomplished a huge amount. Now, to sustain it.

The hack-week wrap-up session blew my mind. Rather than summarize what I liked most, I will just point you to the hackpad where there are short descriptions and links out to notebooks and figures. Congratulations to everyone who participated, and to Huppenkothen, who did everything, and absolutely rocked it.


#AstroHackWeek 2015, day 4

The day started with tutorials on Bayesian reasoning and inference by Brewer and Marshall. Brewer started with a very elementary example, which led to lots of productive and valuable discussion, and tons of learning! Marshall followed with an example in a Jupyter notebook, built so that every person's copy of the notebook was solving a slightly different problem!

Baron and I got the full optimization of our galaxy deprojection project working! We can optimize both for the projections (Euler angles and shifts etc) of the galaxies into the images and for the parameters of the mixture of Gaussians that makes up the three-dimensional model of the galaxy. It worked on some toy examples. This is incredibly encouraging, since although our example is a toy, we also haven't tried anything non-trivial for the optimizer. I am very excited; it is time to set the scope of what would be the first paper on this.

In the evening, there was a debate about model comparison, with me on one side, and the tag team of Brewer and Marshall on the other. They argued for the evidence integral, and I argued against, in all cases except model mixing. Faithful readers of this blog know my reasons, but they include, first, an appeal to utility (because decisions require utilities) and then, second, a host of practical issues about computing expected utility (and evidence) integrals, that make precise comparisons impossible. And for imprecise comparisons, heuristics work great. Unfortunately it isn't a simple argument; it is detailed engineering argument about real decision-making.