I spent the day working out derivatives of my diffraction imaging model with respect to parameters. The crazy thing is that there is a sum over components in the density model (for the molecule) and then a sum over photons within the image (or instance) and then a sum over orientation-angle samples, and finally a sum over images (or instances). And various of these sums have logs and exps inside them. So it is a mess! I wrote them out with a pen on paper, and then typed them up in a nascent paper. The short-term plan is to get to stochastic gradient.
In the morning I gave an informal talk at the Simons SCDA, about my work on The Cannon, and my issues (both good things and bad things) with machine learning. I discussed the point (which inspires our ABC research, and which I also discussed with Kravtsov at Chicago) that quantitative natural science is now almost entirely computational—meaning that the theory is a simulation that makes artificial data—and this leads to changes in how we do inference and speak about realities.
In the afternoon, I spoke with Megan Bedell (Chicago) about the echelle spectroscopy radial-velocity data she has; she has done some dimensionality reduction and there are promising opportunities I think to improve the end-to-end radial-velocity precision. I also worked on my celestial mechanics code for Price-Whelan; it is not working and I don't know why! My only option is to write proper tests. Tomorrow!
While all this was going on, in the background, Dun Wang has been steadily finding cool stuff in the K2C9 data, including, for example, this (previously known from the ground) baby!
At group meeting, Dun Wang showed new discoveries in the brand-new K2 Campaign 9 data. The K2 team released the data the moment it came off the s/c, in fact before they even ran their own data-reformatting code. This was no problem for us, because K2 god Geert Barentsen (Ames) has (clean-roomed? and) open-sourced some of the core code. Wang ran our image-prediction / image-differencing version of our CPM code to predict the data and identify sources with strange excursions. He found ones that looked like possible microlensing events (known and unknown) and published them to the K2C9 group. I asked him to re-run CPM with other parameters to make it less (and more) aggressive and thereby address the (probable) over-fitting. The next step will be to incorporate a microlensing model and fit the K2 systematics (the point of the CPM code) and the microlensing parameters simultaneously. We discussed next steps.
Later in the day I started actually writing the celestial mechanics code my new eclipsing-binary team (Price-Whelan, Ness, and Foreman-Mackey). It is probably all wrong, but it is core technology, so it needs to get instrumented with tests.
As part of my project with Adrian Price-Whelan (and also Melissa Ness and Dan Foreman-Mackey), I spent my research time today figuring out time derivatives of Kepler's equations. These so we can do simultaneous fitting of the eclipsing binary light curve and the radial velocities revealed in the double-line spectrum. This was actual pen-on-paper calculus! It's been a while, although as I took these derivatives, it reminded me that I have taken them many times in the past.
In the afternoon I had a great conversation with Duane Lee (Vanderbilt) about chemical tagging and nucleosynthesis. He is close to being able to fit our data in the Milky Way halo with a mixture of dwarf-galaxy stellar populations. That would be awesome! We talked about low-hanging fruit with our APOGEE chemical abundance data.
I have a dream! If we could get enough long-period eclipsing binaries with multi-epoch spectroscopy, we could go a long way towards building a truly data-driven model of stellar spectra. It would be truly data-driven, because we would use the gravitational model of the eclipsing binary to get stellar masses and radii, and thus give good label (truth) inputs to a model like The Cannon for the stellar spectra. (Yes, if you have an eclipsing binary and spectroscopy for radial velocities, you get everything.) And then we could get densities, masses, and radii of stars for the interpretation of transit and radial-velocity results on exoplanets, without relying on stellar models. There are lots of other things to do too, like build population models for binary stars, and exploit the stellar models for Milky Way science. And etc.
Today, because of a meeting cancellation, both Adrian Price-Whelan and I got the full day off from responsibilities, so we decided to use it very irresponsibly. We searched the (very incomplete and under-studied) Kepler eclipsing binary list for binaries with long periods, deep eclipse depths, and APOGEE spectroscopy. It turns out there are lots! We started with the system KIC 9246715, which is a red-giant pair.
In the APOGEE spectrum, the pair of velocities (double line) is clearly visible, and it clearly changes from epoch to epoch. We found the velocities at each epoch first by auto-correlation and then by modeling the spectrum as a sum of two single stars. A project is born!
I continued working on my document about release of data and code. Twitter (tm) continues awesome.
Research highlight of the day was a long discussion with Megan Bedell (Chicago) about the consistency of exoplanet-host radial-velocity measurements order-by-order in a many-order high-resolution echelle spectrograph. The measurements are made by cross-correlation with a binary (ugh) template, and some orders are consistently high and some are consistently low, and we very much hope there are other more subtle regularities to exploit. Why are there these discrepancies? Probably because the model is inflexible and wrong. Unfortunately we don't have access to it directly (yet) so we have to live with the cross-correlation functions. We discussed simple methods to discover regularities in the order-by-order offsets and results, and sent Bedell off with a long to-do list.
I ended the day with a long conversation with Kat Deck (Caltech). Among other things, we discussed what we would do with our lives if exoplanet research evolves into nothing other than atmosphere transmission spectroscopy and modeling. Of course neither of us considers this outcome likely!
I spent the day working on my document about releasing data and code. I tweeted (tm) some of the ideas in the paper and started responding to the storm of replies. The twitters are excellent for getting ideas from the community!
Although perhaps this doesn't count as research, I spent today at public middle school CIS 303 in the Bronx, for Career and College Day. I met lots of kids (and lots of other people talking about their careers) and said words and answered questions about my career and how I got here. The format was panels, interviewed by classrooms of kids. Most interesting idea of the day (and it was from other panelists): Success in a career requires empathy and the ability to listen. That's deep! Strongest impression of the day: Comparing this public middle school to that of my own daughter, I can (still) see a lot of disparity in the NYC public school system, and that disparity isn't just about money: It is also about discipline, school organization, and academic priorities. (These disparities are what got me studying education way back in the late eighties when I was in college.)
At the beginning of the day, I did get a bit of research in thanks to Jeremy Tinker (NYU), who showed the Blanton–Hogg group meeting what is currently going on with (finishing) BOSS data analyses and (starting) eBOSS ones. The combination of baryon acoustic feature scale measurements and redshift-distortion measurements lead to very strong constraints on cosmological parameters. I'd like to say more! But papers will appear within weeks.
I had a conversation today with Boris Leistedt about his work to build a latent model of galaxy SEDs and get template-space-marginalized photometric redshifts. I proposed that he instantiate the latent variables as observables (like rest-frame colors or line strengths or something); this will help the model break degeneracies and sensibly order the templates in the latent space. That is, it should regularize or simplify the model for inference. That's just an intuition. But it might also help people who have drunk less of our Kool Aid to understand!
Having sent my cosmology inference draft to various friendlies for them to beat it up, I returned to other priorities. I have the goal to finish two more papers before my sabbatical ends. The first would be something (in the Data Analysis Recipes series) about why you should (or shouldn't) release your data and code. I keep making the same arguments over and over in person and by email, so I should write them down once and for all! This is exactly why I started the fitting-a-line document o-so-many years ago. The second is a paper on diffraction imaging with very few photons. I booted up both of these projects today.
On the first, I made a draft table of all the things that come in on the pro and con sides for releasing data and code. There are lots of overlaps, and lots of things appear in both the pro and the con column! For example, documentation: This is a con, because it is a burden, but a pro because you are encouraged to do it (and benefit from it). And it applies to both data and code.
On the second, I worked through the mathematics of the Ewald sphere, in preparation for generalizing my code so that it doesn't have to work in the small-angle limit.
Sabbatical is an unreal experience. In case the loyal reader was wondering How awesome is your job?, let me add to the list of awesome the fact of having a year (every seven, in principle) with no teaching or committee duties, in which I can do whatever I want! One piece of evidence that this is happening right now is that I just finished the third (zeroth draft) first-author paper of this academic year, which is about six times as many first-author papers I complete in a normal year (yes, my normal rate is 0.5 first-author per year; thankfully my junior colleagues are far more productive and permit my co-authorship).
Of course these three first-author papers are not really done: On one I need to respond to referee, on one I am waiting for a bit of work from collaborators, and this new one is still very, very rough. But it is fully drafted, from end to end. The subject is: How we compare cosmological simulations to cosmological data, and the incorrect inferences we might be drawing by the wrongness of all this.
Boris Leistedt dropped in today and we discussed his methods to build a physically possible model of galaxy spectral energy distributions and therefore photometric redshifts, but with an exceedingly flexible model. His method is brilliant because it is entirely data-driven (no fixed templates) and yet it respects the physics of special relativity (the Doppler shift), which the machine-learning methods do not.
He made the amusing point that his method can be trained with a training set that contains literally a single galaxy with a spectroscopic redshift! That is, if you even have only one single redshift, you can put photometric redshifts (with, admittedly, large error bars) on all the other photometric galaxies! That is a property that no other data-driven method has. The point is that if you have multi-wavelength data on a single galaxy with a redshift, you can make rough predictions about how other galaxies would look at other redshifts.
His real breakthrough is the idea of using Gaussian processes to put priors on the spectral energy distributions (templates): If the SEDs are drawn from a Gaussian process, then all of the photometry (which consists of linear projections of the SEDs) is also drawn from a Gaussian process. We discussed the magic of all of this.
I also read a proposal by Daniela Huppenkothen, and wrote words in my inference-of-variances paper.
For my inference-of-variances project, I got ABC working. It is delivering a correct posterior, even at a reasonable (finite) distance threshold. All is good! I put a figure showing this into the nascent paper.
It was my pleasure to sit on the PhD defense committee of Gregory Green (Harvard) today. I had to do it remotely (for uninteresting reasons). Green has built a three-dimensional map of the dust in the Milky Way, by modeling every single star in the PanSTARRS data. This is an impressive feat computationally, since it is a huge problem, and also probabilistically, since most things you can write down are either intractable or wrong.
Being a good probabilistic reasoner, Green did something both tractable and correct, and got a beautiful map. His tours of the map in his presentation were mesmerizing. He was cagey about spiral structure; his method wouldn't necessarily find it even if it were there.
It was a great PhD defense based on absolutely great work. At the end of his talk he discussed ways he might do things that are even righter in the future, given our prior beliefs about the interstellar medium (and lots of new data). That's super interesting, and we hope to discuss when his dust settles (so to speak). Congratulations Dr Green!
I spent the day trying to understand the frequentist properties of empirical (sample) variances, and the properties (expectation value and variance) of estimators of the variance of the distribution that generated the sample. This is all related to my issues with cosmological inference, that I am trying to write up. I am not surprised cosmologists have made mistakes here; it is hard to understand even in the most trivial situation. I am working out the trivial case to make analogies to the real case.
I spent small bits of time between things writing in my short paper about inferring the variance of a distribution, given only samples from that distribution. I am trying to have a companion paper for the ABC paper being written by Hahn and Vakili about large-scale structure. Foreman-Mackey gave me some advice this week about how to pitch the paper, and I am trying to obey it.
In the afternoon I had a long phone call with Megan Bedell (Chicago) about how high-precision radial velocity spectrograph data are turned into radial-velocity measurements. She wants to reconsider parts of the pipeline, without rebuilding the whole pipeline from scratch (which is something I would like to do some day). She is currently analyzing her data with a closed-source pipeline, which is, apparently, the standard practice in this field. That has all sorts of bad properties, including that no-one can assert with confidence what the code is doing, nor check it! Bedell and I discussed the stage at which the code goes from a cross-correlation function (as a function of velocity) at every Echelle order to a single velocity. Apparently we can get the data-reduction black box to return enough information that we can reprocess this step.
It was my great pleasure to sit on the PhD defense committee for Adrian Price-Whelan (Columbia) today. He spoke about using phase-space structures, especially streams of stars, to constrain the gravitational field in the Milky Way. He pointed out that we should call it the “gravitational field” not the “gravitational potential” because the latter is only known up to a constant. (I think the argument is deeper: The potential exists only to be differentiated, never observed!)
Two important ideas that were under heavy discussion both in Price-Whelan's talk and in the discussion afterwards were the following: Price-Whelan may have the first ever empirical evidence for dynamical chaos outside of planetary systems. His evidence is in the form of morphological predictions for stellar streams. This argument is weak at present, but if the evidence grows stronger, he will be responsible for a new kind of fundamental measurement.
The other idea is that we might be making big mistakes interpreting the Milky Way entirely in terms of simple, integrable models. As my loyal reader knows, I have ranted about this for years, but my rants have been vapor-ware. Price-Whelan (along with Pearson at Columbia and Bonaca at Yale, I should add) is close to being able to address this point directly and quantitatively.
In addition to all of these things, Price-Whelan has delivered some great probabilistic methods and code, contributed heavily to important open-source astronomy projects, and led our community in education regarding code, data analysis, and research practice. It has been a great pleasure and opportunity for me to be involved in his work for (what his thesis points out is) the last decade!
I spent the day at the University of Chicago, hosted by Andrey Kravtsov (Chicago). Kravtsov and I discussed the great simplicity (on large scales) of the galaxy population, and also the teaching of stellar physics, in which he has made some very interesting innovations. Since he is a true scholar, we also talked briefly about the impact on philosophy of science when the models are fundamentally computational. He recommended this book, around which maybe we should build a reading group next year at the Simons SCCA.
I met with the large group of John Carlstrom (Chicago) and collaborators. They showed me the SPT data, and it is incredible: The CMB fluctuations are measured at enormous signal-to-noise and the point sources and SZ clusters are sharp points. Totally unlike any previous data. I met with Arieh Königl (Chicago), who predicted that we ought to see really cool relationships between planetary systems and the chemical abundances in their protoplanetary disks, mirroring things Schlaufman (OCIW) said to me a few weeks ago. Dan Holz (Chicago) and I discussed low-hanging fruit in the aftermath of the LIGO and exoplanet disk physics. We came around quickly to model interpolation or emulators or surrogates. Just like we have been discussing with Rix, Ting, and Conroy. With both students and postdocs I discussed possibilities for improving stellar radial velocity measurements and photometric calibration. What a great group of people at Chicago!
I spent some time with Adrian Price-Whelan discussing what we need to do to get ready for Gaia. We concluded that radial velocities will be extremely valuable; we should assemble every radial velocity for any star in the TGAS set prior to the Gaia data release. This might involve convincing surveys that have not yet done data releases to release some information on relevant subsamples. That would be a fun community-building project. After that, we discussed crazy projects related to life and intelligence. Price-Whelan is in the gap between thesis completion and defense, so he is currently open to crazy.
[I have been out on vacation; hence the lack of posts recently.]
Out of the blue came an email from So Hattori (NYUAD), who has found many single transits (long-period planet candidates) in the Kepler data. This is awesome! Foreman-Mackey and I discussed the goals for Hattori's project, and its relationships with Foreman-Mackey's current project (which has found a Saturn analog and has some occurrence rate results).
In the morning, I spent time with Leslie Greengard discussing various matters related to the description (on a computer, say) of continuous (and infinitely differentiable) surfaces. There are some outstanding problems, which seem like simple math problems but are unsolved. This has nothing to do with anything I am working on, but I could get hooked. Of course my position is that the determination of a surface given control points ought to be cast as an inference problem!
The remainder of the day was spent planning and outlining my argument in my “Inference of Variance” project: How the question of inferring the variance of a process (that generated some points or data, say) is related to the problem of cosmological parameter estimation, and how we can help the latter with work on the former.