I arrived in Heidelberg for my annual stay at MPIA today. I had a brief conversation with Rix about Milky Way streams, in preparation for Price-Whelan's arrival tomorrow. I also had a brief conversation with Finkbeiner (CfA) about PanSTARRS calibration. He can show that things are good at the milli-magnitude level, but I think he ought to be able to say even more precise things, given the number of stars and the regularities. In the end, most of his precision tests involve comparison to the SDSS-I imaging data, so that is what really limits the precision of his tests.
To test our pixel-level model that is designed to self-calibrate the Kepler data, we had Dun Wang insert signals into a raw Kepler pixel lightcurve and then see if when we self-calibrate, do we fit it out or do we preserve it? That is, does linear fitting reduce the amplitudes of signals we care about or bias our results? The answer is a resounding yes. Even though a Kepler quarter has some 4000 data points, if we fit a pixel lightcurve with a linear model with more than a few dozen predictor pixels from other stars, the linear prediction will bias or over-fit the signal we care about. We spent some time in group meeting trying to understand how this could be: It indicates that linear fitting is crazy powerful. Wang's next job is to look at a train-and-test framework in which we only use time points far from the time points of interest to train the model. Our prediction is that this will protect us from the over-fitting. But I have learned the hard way that when fits get hundreds of degrees of freedom, crazy shit happens.
At group meeting my new student Dun Wang (NYU) showed very nice results in which he has used linear combinations of the brightness histories of Kepler pixels to predict the brightness histories of other Kepler pixels. The idea is that inasmuch as pixels coming from other stars co-vary with the pixel of interest, that co-variability must be caused by the satellite and instrument. That is, this is a way of calibrating out the coherent calibration "noise" that comes from the fact that the observations are being made with a time-varying device.
After group meeting, Michael Cushing (Toledo) showed up to discuss fitting spectra. Like in previous conversations with Czekala and with Johnson, we talked about making a flexible model of calibration and an accurate noise model. We talked through the relative merits of putting complexity into the covariance function (as with Gaussian Processes) or into a parameterized model of calibration (which you fit simultaneously with everything else). These issues—of modeling calibration "noise" in spectroscopy and photometry—come up so much, I feel like we should organize a workshop. One thing that's nice about Cushing's plan is that he wants to use the expected intensity (not the observed intensity) to set the variance of the Poisson term in his noise model. That's the Right Thing To Do (tm). Of course it makes the likelihood function more complicated in some ways.
I spent the day up at Columbia, with the Stream Team, which is Marla Geha (Yale), Kathryn Johnston (Columbia), me, and parts of our research groups. We discussed just-finished papers, next papers, and things that have come up in the literature. On that last point, we spent some time discussing this paper by Gibbons et al on the Sagittarius stream. The paper makes potential inferences based on data, which is Good, but takes as its "data" a very limited set of measurements—a precession angle (between apocenters), two apocenter distances, and a progenitor 6-d position—and nothing else, which is Bad. We discussed the point that the limited set of measurements they used is not even close to a set of sufficient statistics; in particular, Price-Whelan has shown that you can get multiple potentials for any precession angle, and that the overall shape of the stream and the radial velocities of the stars in the stream will distinguish these options. When your data set contains many measurements (as theirs does) and when your model can predict those measurements (as theirs can), you only hurt yourself by using subsets of the data or limited, derived quantities! (I said all of this to Evans and Belokurov a couple months ago.) I don't want to harsh them out too much, though, because the stream literature has been rife with theory papers that don't confront data at all; this paper is a step in the right direction.
I spent the day in Berkeley, meeting with the Evaluation and Ethnography Working Group of the Moore–Sloan Data Science Environments. We discussed many issues of interest, but I was very focused on space: How can we use Data Science office and studio space to create and nurture participation and collaborations by many scientists? It was a very wide-ranging discussion, a lot of which was about sociological and psychological matters for which the primary data are qualitative. We spent some time talking about how to gather and analyze such data.
Today was day one of #ExoStat, hosted by Jessi Cisewski at CMU. The meeting is a follow-up to the #ExoSamsi meeting we held last summer at SAMSI, bringing together statisticians and exoplaneteers. The meeting was structured with talks in the morning and hacking in the afternoon. Two highlights in the morning talks were the following:
Dawson (UCB) talked about building flexible noise models for the Kepler lightcurves, and showing how those flexible noise models improved discovery and measurements of exoplanets. One of the most intriguing results in her talk was that she finds that impact parameter measurements are very biased or unstable, in the sense that she gets different answers with different priors; Foreman-Mackey and I have found the same recently. For Dawson this is critical, because she has identified that she can potentially say things about planetary system inclination distributions by measuring impact parameter variations.
Ian Czekala (CfA) spoke about flexible likelihood functions (noise models) for fitting stellar spectra. He finds, as many do, that though spectral models are incredibly good, there are smooth calibration issues and also individual lines that are slightly wrong. He has a covariance matrix (a non-stationary Gaussian process covariance matrix) that handles both of these things. On the "bad lines" issue, he puts in (for each bad line) a rank-one contribution to the covariance matrix that adds variance with the proper shape to be a varying line. This is a simple and beautiful idea. After his talk, we discussed modeling covariant groups of lines, and even how to discover such groups automatically. The long-term goal is automatic data-driven improvement of the spectral models.
I wrote words for my Atlas and also for Adam Bolton (Utah). The latter was a probabilistic generalization of the k-means clustering algorithm that might be able to deliver high-quality quasar archetypes for automated redshift fitting in SDSS-IV.
At group meeting at the end of the day, Ruth Angus (Oxford) showed her attempts to measure reliable rotation periods for stars using photometric variability. Both because of the short lives of sunspots (and other surface features), and because of differential rotation, the variability induced by rotation is not periodic and certainly not harmonic. Nonetheless, it shows up in the autocorrelation function as a negative autocorrelation at half-period lags and a positive correlation at full period. That said, the autocorrelation functions are noisy and they are point estimates. She is trying to move to a probabilistic framework using Gaussian Processes. She showed some early results and we offered consulting.
In the same meeting, Soichiro Hattori (NYUAD) showed some first results for planet search in the Kepler field, using a trivial "box model" and simple squared error objective. It looks like it works: When he scans through parameters, he gets minima in the right places. His next steps are to switch to a more sophisticated noise model (likelihood function) and then make it very fast.
First thing in the morning, I broke it to Jeffrey Mei that his results on spectral features associated with MW dust are almost certainly strongly affected by the SDSS spectroscopic calibration pipeline. He took it well, and we realized that we can just re-run our analysis on the completely uncalibrated spectrograph counts! This might just work. He is tasked with understanding how to access the raw counts from the SDSS-III spectroscopic pipeline.
At arXiv coffee, Patel summarized two papers by Gezari and collaborators on discoveries in GALEX and PanSTARRS time-domain data: A shock breakout flash at the start of a supernova and a putative tidal-disruption flare from a star destroyed by a close encounter with a black hole. These results provide a strict lower limit to what we might find in a search of the GALEX photon list.
At the end of a day not filled with research, I met up with Mike O'Neil (NYU) and Jon Wilkening (Berkeley) to talk about building basis functions in kernel space for kernels to use in Gaussian Process fitting of density fields in one to three ambient dimensions. Apparently there is a connection between this problem and quadratures. It has something to do with the fact that every matrix you make has a finite number of sample points but the positive-definite constraint on the kernel function is not just for every possible selection of spatial sample points but for the infinite dimensional limit. Quadratures relate integrals of functions to weighted sums of finite evaluations. As you can see from my vagueness, I don't really understand any of this yet, but if we could build a flexible basis for making covariance functions that are capable of representing (approximately) power laws with features, we could do a lot of interesting probabilistic cosmological inference.
I spent an hour this morning with Patel, discussing possible projects with the GALEX photon catalog. We tentatively decided to look for flashes or short-lived brightness increases, either on top of known sources or else in isolated regions. This project is interesting in itself, exercises the catalog, and also provides useful information for improving calibration and the instrument model.
At group meeting, Vakili showed a variant of a Gaussian Process, in which the latent function is still drawn from a Gaussian, but the data are related to the latent function by a fatter-tailed (t) distribution. He showed a beautiful simulation and output in which outliers totally mess up a Gaussian Process fit but are just straight-up ignored by the modified method. At this point, I don't even remember what the method is called, but it is extremely relevant to the quasar-fitting work by Mykytyn.
Foreman-Mackey may have actually finished his paper on exoplanet abundances today! I hope this is true and we submit tomorrow. I did work on the text for him, but only in the form of giving final comments. One of our main points is that the "rate" or "frequency" or "abundance" of Earth analogs should be expressed as an expected number per star per natural logarithm of period, per natural logarithm of radius. However, in the end, he also computed the number of planets that we expect to have in the Kepler field, with period between 200 and 400 days (Petigura's definition) and radius between 1 and 2 Earth radii (Petigura again), orbiting one of the 42,000 Sun-like stars, in such a way (inclination) that it would transit (conceivably) observably. The answer is nine. With large uncertainty. That is, we should be looking very hard for these Earth analogs, because there ought to be a few of them!
Adam Bolton (Utah) called and we discussed automated redshift finding in SDSS-IV and SDSS-III. Bolton is trying to make rigid "archetypes" to use as redshift-finding templates. I promised him that by Friday I would come up with a scheme for doing this in a data-driven way, using quasars from a wide range of redshifts. I may have over-promised!
In my new (June-only!) group meeting, we had short contributions from everyone. Highlights include: Patel showing the relationship between galaxy angular size and the number of citations on NED. There are some interesting outliers (small galaxies with many citations and large galaixies with few citations); and my new undergraduate researcher Soichiro Hattori (NYUAD) who started working for me at 10:30 but by 15:30 group meeting had already made a plot relevant to his research! Hattori is going to search the Kepler data for Earth analogs, using Foreman-Mackey's Gaussian-Process-based likelihood function.
I spent a great day in Princeton at the birthday and retirement celebration for Ed Groth (Princeton), who was instrumental in the HST WFPC project and is the originator of the incredibly influential Groth Strip. There were many great talks and reminiscences, a few of the highlights for me were the following:
Ed MacDonald (who worked on oceanography for the Navy and NATO) talked about moving data by paper tape from experiment to computer center, and the fact that mundane tasks are an important part of all important scientific discoveries. He noted that Bob Dicke (the leader in the Gravity Group at Princeton) was never afraid of doing mundane things in support of scientific discovery.
Bill Wickes (formerly of HP) talked about many things, not the least of which was the importance of calculators in scientific research. Indeed, calculators featured heavily in the stories and photographs from Groth's early days. Wickes is responsible for inventing and designing and improving various HP calculators. He also talked about the Gravity Group attitude of "you sit on it until it works", which is a very good principle for science!
Bruce Partridge (Haverford) discussed the precise timing of the Crab Pulsar, done at Princeton by him and Groth and others, which led to the discovery of period derivatives, second derivatives, and glitches. The timing was done very cleverly; he showed the electronics diagram. The Gravity Group was always motivated to precisely measure anything for which there was simultaneously a hope of precise measurement and a precise quantitative prediction. He showed also that the search for gravitational radiation was already in the air way back then.
Jason Rhodes (JPL) and Todd Lauer (NOAO) talked about HST imaging. Rhodes and Groth wrote one of the first papers on weak gravitational lensing. Lauer pointed out that Groth was instrumental in starting the HST Archive and our understanding of the huge legacy value of digital data sets.
Finally, Jim Peebles (Princeton) talked about correlation functions, on which he worked with Groth, and which remain the key tool of cosmology today. He showed some lovely visualizations of hand-taken data on galaxy counts from the 1960s and 70s. He highlighted the ways in which Groth's career spanned the transition from "small science" to "big science", doing important things in both modes. It was a great day!
At the final (and all-day) meeting of #NYCastroML we discussed time-series analysis, including spectral analysis, filtering, and Bayesian inference. This was followed by a hack session during which I met with Schiminovich and his group to discuss GALEX photons and Rutger van Haasteren (Caltech) and Michele Vallisneri (Caltech) to discuss application of our HODLR linear algebra tools to gravitational wave detection.
The day ended with Lia Corrales (Columbia) giving a short seminar on x-ray studies of dust, where forward scattering permits (in principle) inference of the distribution of dust in space and also grain size. The talk made me think that if you could have many x-ray point sources measured (and good knowledge of the point-spread function), you could in principle fully map the dust in three-space, and also figure out the three-dimensional positions of all the point sources. Probably not feasible, but interesting to think about.
Foreman-Mackey and I pair-wrote some more in his exoplanet populations paper. We made strict (and explicit) definitions of "rate" and "rate density" and audited the document to be consistent with those definitions. A rate is a dimensionless expectation value for an integer draw from a Poisson distribution. A rate density is something that needs to be integrated over a finite volume in some parameter space to produce a rate. We reminded ourselves that the model is an "inhomogeneous Poisson process" (inhomogeneous because the rate density varies with planet period and radius) and said so where appropriate. We massaged the text around the issues of converting rate estimates from other projects into rate densities to compare with our results. And we finished the figure captions. So close. I also wrote a bit in my own Atlas.
[Added after the fact: Above I am talking about the "rate" of a process inside a discrete population: This is about the rate at which planets host stars. There is another use of "rate" in physics that is number per time; it has to be integrated over a time interval to get a dimensionless number. The words "rate" and "frequency" both have these double meanings of either dimensionless object (in discrete probability contexts) or else number per time (in time-domain physics contexts).]
Vakili and I spent some time in the morning discussing the next steps in his work on PSF interpolation. He can show that his method is better than any polynomial interpolation, and it also provides a probabilistic PSF—that is, it returns a probability distribution function over PSFs. We debated whether to write a paper using this on LSST simulation data, which is fake data (bad) but where we know the truth so we can assess accuracy (good), or else using this on SDSS data, which is real data (good) but where we have no extrinsic handle on truth (bad). His assignment is to figure out just how mature LSST simulation outputs are in the relevant regards. We also discussed applying for the NSF WPS call (PDF).
IMHO, a paper should not have a conclusions section. That's what the abstract is for! That said, many of my most trusted and respected colleagues disagree (including, for example, Johnston). My view is that the final section should be a "discussion" in which the results are put in the context of other work, evaluated and criticized. That last point is the most important: You only understand what your results are when you understand how they could be wrong. Indeed, a critical examination of the results in light of the assumptions makes both the assumptions and the results more clear.
Foreman-Mackey and I spent a lot of time on these issues today as he finishes his exoplanet population paper. Foreman-Mackey's assumptions are very strong, of course, although we argue that they are weaker than those of any other study in this area. One of the things I love about principled probabilistic inference is that it makes it very easy (or almost necessary) to be explicit about your assumptions. In related news, Foreman-Mackey argues that three independent groups now—despite making very different assumptions—have obtained very similar results on the exoplanet radius distribution at Earth-ish radii, so those results are very likely correct (or, conceivably indicative of a problem with Kepler, which is a common component of all three).
My only substantial research was our weekly MCMC meeting with Goodman. We discussed Goodman's ideas about using all the likelihood calls in an MCMC run to inform a Gaussian-Process-based approximation to the likelihood function; the sampler proceeds by updating the approximation and moving by use of that approximation. He claims to have a provably correct method; we are just wondering whether there are problems for which this is a good idea in practice! We also discussed Hou's response to referee; this is nearly done and nearly ready to resubmit.