2015-03-31

#astrohackny, day N

At #astrohackny, Ben Weaver (NYU) showed a huge number of binary-star fits to the APOGEE individual-exposure heliocentric radial velocity measurements. He made his code fast, but not yet sensible, in that it treats all possible radial-velocity curves as equally likely, when some are much more easily realized physically than others. In the end, we hope that he can adjust the APOGEE shift-and-add methodology and make better combined spectra.

Glenn Jones (Columbia) and Malz showed some preliminary results building a linear Planck foreground model, using things that look a lot like PCA or HMF. We argued out next steps towards making it a probabilistic model with more realism (the beam and the noise model) and more flexibility (more components or nonlinear functions). Also, the model has massive degeneracies; we talked about breaking those.

2015-03-30

light echos

I hung out in the office while Tom Loredo (Cornell), Brendon Brewer (Auckland), Iain Murray (Edinburgh), and Huppenkothen all argued about using dictionary-like methods to model a variable-rate Poisson process or density. Quite an assemblage of talent in the room! At lunch-time, Fed Bianco talked about light echos. It is such a beautiful subject: If the light echo is a linear response to the illumination, I have this intuition that we could (in principle) infer the full three-dimensional distribution of all the dust in the Galaxy and the time-varible illumination from all the sources. In principle!

2015-03-28

radical fake TESS data

This past week, the visit by Zach Berta-Thompson (MIT) got me thinking about possible imaging surveys with non-uniform exposure times. In principle, at fixed bandwidth, there might be far more information in a survey with jittered exposure times than in a survey with uniform exposure times. In the context of LSST I have been thinking about this in terms of saturation, calibration, systematics monitoring, dynamic range, and point-spread function. However, in the context of TESS the question is all about frequency content in the data: Can we do asteroseismology at frequencies way higher than the inverse mean exposure time if the exposure times are varied properly? This weekend I started writing some code to play in this sandbox, that is, to simulate TESS data but with randomized exposure times (though identical total data output).

2015-03-26

probabilistic density inference, TESS cosmics

Boris Leistedt (UCL) showed up for the day; we discussed projects for the future when he is a Simons Postdoctoral Fellow at NYU. He even has a shared Google Doc with his plans, which is a very good idea (I should do that). In particular, we talked about small steps we can take towards fully probabilistic cosmology projects. One is performing local inference of large-scale structure to hierarchically infer (or shrink) posterior information about the redshift-space positions of objects with no redshift measurement (or imprecise ones).

Zach Berta-Thompson (MIT) reported on his efforts to optimize the hyper-parameters of my online robust statistics method for cosmic-ray mitigation in the TESS spacecraft. He found values for the two hyper-parameters such that, for some magnitude ranges, my method beats his simple and brilliant middle-eight-of-ten method. However, because my method is more complicated, and because it seems to have its success depends (possibly non-trivially) on his (somewhat naive) TESS simulation, he is inclined to stick with middle-eight-of-ten. I asked him for a full and complete search of the hyper-parameter space but agreed with his judgement in general.

2015-03-25

online, on-board robust statistics

Zach Berta-Thompson (MIT) showed up at NYU today to discuss the on-board data analysis performed by the TESS spacecraft. His primary concern is cosmic rays: With the thick detectors in the cameras, cosmic rays will affect a large fraction of pixels in a 30-minute exposure. Fundamentally, the spacecraft takes 2-second exposures and co-adds them on-board, so there are lots of options for cosmic-ray mitigation. The catch is that the computation all has to be done on board with limited access to RAM and CPU.

Berta-Thompson showed that a "middle-eight-of-ten" strategy (every 10 sub-exposures average all but the highest and the lowest) does a pretty good job. I proposed something that looks like the standard "iteratively reweighted least squares" algorithm, but operating in an "online" mode where it can only see the last few elements of the past history. Berta-Thompson, Foreman-Mackey, and I tri-coded it in the Center for Data Science studio space. The default algorithm I wrote down didn't work great (right out of the box) but there are two hyper-parameters to tune. We put Berta-Thompson onto tuning.

2015-03-24

dissertation transits

Schölkopf, Foreman-Mackey, and I discussed the single-transit project, in which we are using standard machine learning and a lot of signal injections into real data to find single transits in the Kepler light curves. This is the third chapter of Foreman-Mackey's thesis, so the scope of the project is limited by the time available! Foreman-Mackey had a breakthrough on how to split the data (for each star) into train, validate, and test such that he could just do three independent trainings for each star and still capture the full variability. False positives remain dominated by rare events in individual light curves.

With Dun Wang, we discussed the GALEX photon project; his job is to see what about the photons is available at MAST, if anything, especially anything about the focal-plane coordinates at which they were detected (as opposed to celestial-sphere coordinates). This was followed by lunch at facebook with Yann LeCun.

2015-03-23

Simons Center for Data Analysis

Bernhard Schölkopf arrived for a couple of days of work. We spent the morning discussing radio interferometry, Kepler light-curve modeling, and various thing philosophical. We headed up to the Simons Foundation to the Simons Center for Data Analysis for lunch. We had lunch with Marina Spivak (Simons) and Jim Simons (Simons). With the latter I discussed the issues of finding exoplanet rings, moons, and Trojans.

After lunch we ran into Leslie Greengard (Simons) and Alex Barnett (Dartmouth), with whom we had a long conversation about the linear algebra of non-compact kernel matrices on the sphere. This all relates to tractable non-approximate likelihood functions for the cosmic microwave background. The conversation ranged from cautiously optimistic (that we could do this for Planck-like data sets) to totally pessimistic, ending on an optimistic note. The day ended with a talk by Laura Haas (IBM) about infrastructure (and social science) she has been building (at IBM and in academic projects around data-driven science and discovery. She showed a great example of drug discovery (for cancer) by automated "reading" of the literature.

2015-03-20

health

I took a physical-health day today, which means I stayed at home and worked on my students' projects, including commenting on drafts, manuscripts, or plots from Malz, Vakili, and Wang.

2015-03-19

robust fitting, intelligence, and stellar systems

In the morning I talked to Ben Weaver (NYU) about performing robust (as in "robust statistics") fitting of binary-star radial-velocity functions to the radial velocity measurements of the individual exposures from the APOGEE spectroscopy. The goal is to identify radial-velocity outliers and improve APOGEE data analysis, but we might make a few discoveries along the way, a la what's implied by this paper.

At lunch-time I met up with Bruce Knuteson (Kn-X) who is starting a company (see here) that uses a clever but simple economic model to obtain true information from untrusted and anonymous sources. He asked me about possible uses in astrophysics. He also asked me if I know anyone in US intelligence. I don't!

In the afternoon, Tim Morton (Princeton) came up to discuss things related to multiple-star and exoplanet systems. One of the things we discussed is how to parameterize or build pdfs over planetary systems, which can have very different numbers of elements and parameters. One option is to classify systems into classes, and build a model of each (implicitly qualitatively different) class and then model the full distribution as a mixture of classes. Another is to model the "biggest" or "most important" planet first; in this case we build a model of the pdf over the "most important planet" and then deal with the rest of the planets later. Another is to say that every single star has a huge number of planets (like thousands or infinity) and just most of them are unobservable. Then the model is over the an (effectively) infinite-dimensional vector for every system (most elements of which describe planets that are unobservable or will not be observed any time soon).

This infinite-planet descriptor sounds insane, but there are lots of tractable models like this in the world of non-parametrics. And the Solar System certainly suggests that most stars probably do have many thousands of planets (at least). You can guess from this discussion where we are leaning. Everything we figure out about planet systems applies to stellar systems too.

2015-03-18

Blanton-Hogg group meeting

Today was the first-ever instance of the new Blanton–Hogg combined group meeting. Chang-Hoon Hahn (NYU) presented work on the environmental dependence of galaxy populations in the PRIMUS data set and a referee report he is responding to. We discussed how the redshift incompleteness of the survey might depend on galaxy type. Vakili showed some preliminary results he has on machine-learning-based photometric redshifts. We encouraged him to go down the "feature selection" path to start; it would be great to know what SDSS catalog entries are most useful for predicting redshift! Sanderson presented issues she is having with building a hierarchical probabilistic model of the Milky Way satellite galaxies. She had issues with the completeness (omg, how many times have we had such issues at Camp Hogg!) but I hijacked the conversation onto the differences between binomial and Poisson likelihood functions. Her problem is very, very similar to that solved by Foreman-Mackey for exoplanets, but just with different functional forms for everything.

2015-03-17

#astrohackny, CMB likelihood

I spent most of #astrohackny arguing with Jeff Andrews (Columbia) about white-dwarf cooling age differences and how to do inference given measurements of white dwarf masses and cooling times (for white dwarfs in coeval binaries). The problem is non-trivial and is giving Andrews biased results. In the end we decided to obey the advice I usually give, which is to beat up the likelihood function before doing the full inference. Meaning: Try to figure out if the inference issues are in the likelihood function, the prior, or the MCMC sampler. Since all these things combine in a full inference, it makes sense to "unit test" (as it were) the likelihood function first.

Late in the day I discussed the CMB likelihood function with Evan Biederstedt. Our goal is to show that we can perform a non-approximate likelihood function evaluation in real space for a non-uniformly observed CMB sky (heteroskedastic and cut sky). This involves solving—and taking the determinant of—a large matrix (50 million squared in the case of Planck). I, for one, think we can do this, using our brand-new linear algebra foo.

2015-03-16

probabilistic Cannon

The biggest conceptual issue with The Cannon (our data-driven model of stellar spectra) is that the system is a pure optimization or frequentist or estimator system: We presume that the training-data labels are precise and accurate, and we obtain, for each test-set spectrum, best-fit labels. In reality our labels are noisy, there are stars that could be used for training but they only have partial labels (logg only from asteroseismology, for example), and we don't have zero knowledge about the labels of the unlabeled spectra. This calls for Bayes. Foreman-Mackey drew a graphical model in the morning and suggested variational inference. Late in the afternoon, David Sontag (NYU) drew that same model and made the same suggestion! Sontag also pointed out that there are some new ideas in variational inference that might make the project an interesting project in the computer-science-meets-statistics literature too. Any takers?

2015-03-13

Tufts

I spent the day at Tufts, where I spoke about The Cannon. Conversation with the locals centered on galaxy evolution, about which there are many interesting projects brewing.

2015-03-12

GRE issues; binary star anomalies

Keivan Stassun (Vanderbilt) was at NYU all day, giving a morning talk about his very successful STEM PhD bridge program and an afternoon talk about stars (as they relate to exoplanets and other multiple systems). There was also a great discussion in-between, with academics from around the University in attendance. During lunch, Stassun emphasized that if there is one, single take-home thing we can do to improve the way we run our PhD programs, it is to stop using the GRE as an indicator of merit. He said that there is now abundant, redundant information and studies that show that GRE performance is a very strong function of sex and race, even controlling for scholastic aptitude. The adoption of the GRE was, of course, a very progressive thing: Let's judge applicants on objective measures of merit! But it turns out in practice that it does not measure merit. Most of us (myself included) think about the GRE anecdotally (what was it like for me, or for my students); but if we think about it systematically, I think we will find that we shouldn't be using it if what we want is to admit the best possible students. Stassun: Testify!

In the afternoon talk, Stassun showed some very tantalizing and very perplexing evidence that stars in trinary systems might be physically different from stars in binary systems! He showed that for "hard" eclipsing binaries, the consistency of the stellar radii and luminosities and masses with a deterministic set of relationships appears violated for binaries that have a distant tertiary companion. That is, the distant companion seems to affect the stars in the binary. The data set is still small, and it could be a fluke, but the observation makes clear predictions and presents an awesome physics puzzle. He also talked about the flicker method for determining stellar surface gravities, which I have discussed here previously.

2015-03-11

inferring evolution, hidden Markov model

Sriram Sankararaman (Harvard) gave a great Computer Science Colloquium today about inferring the evolutionary tree (well, it isn't really a tree) from genetic information, particularly as regards humans and neandertals. He is able to show, using the statistics of DNA variability, that humans and neandertals had intermixing long after they separated (both geographically and as species). He was also able to show that there is statistical evidence for the sterility (infertility) of males after speciation. Awesome stuff, and very related to cosmology in many ways: The models are of two-point statistics of the DNA sequences, not the sequences themselves, and the probabilistic modeling methods (approximate Gaussian likelihood functions and MCMC) are very similar indeed.

Prior to that, in group meeting, McFee and Huppenkothen jointly proposed a plan for clustering black hole timing data using a hidden Markov model: The idea is that the data are generated by a probability distribution that is set by a state, and there are finite probabilities of transitioning from state to state at each time step. This is a well-understood idea in machine learning, but also very close to how we think about the generation of the timing data, fundamentally. Great plan! Huppenkothen's first order of business is to run k-means in a feature space (for initialization of the HMM).