quasars! exoplanets! dark matter at small scales!

CampHogg group meeting was impressive today, with spontaneous appearances by Andreu Font-Ribera (LBL), Heather Knutson (Caltech), and Lucianne Walkowicz (Adler). All three told us something about their research. Font-Ribera showed a two-dimensional quasar—absorption cross-correlation, which in principle contains a huge amount of information about both large-scale structure and the illumination of the IGM. He seems to find that IGM illumination is simple or that the data are consistent with a direct relationship between IGM absorption and density.

Knutson showed us results from a study to see if stars hosting hot Jupiters on highly inclined (relative to the stellar rotation) orbits are different in their binary properties from those hosting hot Jupiters on co-planar orbits. The answer seems to be "no", although it does look like there is some difference between stars that host hot Jupiters and stars that don't. This all has implications for planet migration; it tends to push towards disk migration having a larger role.

We interviewed Walkowicz about the variability of the Sun (my loyal reader will recall that we loved her paper on the subject). She made a very interesting point for future study: The "plage" areas on the Sun (which are brighter than average) might be just as important as the sunspots (which are darker than average) in causing time variability. Also, the plage areas are very different from the sunspots in their emissivity properties, so they might really require a new kind of model. Time to call the applied mathematics team!

In the afternoon, Alyson Brooks (Rutgers) gave the astro seminar, on the various issues with CDM on small scales. Things sure have evolved since I was working in this area: She showed that the dynamical influence of baryonic physics (collapse, outflows, and so on) are either definitely or conceivably able to create the issues we see with galaxy density profiles at small scales, the numbers of visible satellites, the density distribution of satellites, and the sizes of disk-galaxy bulges. On the latter it still seems like there is a problem, but on the face of it, there is not really any strong reason to be unhappy with CDM. As my loyal reader knows, this makes me unhappy! How can CDM be the correct theory at all scales? All that said, Brooks herself is hopeful that precise tests of CDM at galaxy scales will reveal new physics and she is doing some of that work now. She also gave great shout-outs to Adi Zolotov.


single transits, redshift likelihoods

A high research day today, for the first time in what feels like months! In group meeting in the finally-gutted NYU CDS studio space, So Hattori told us about some single transits in Kepler and his putative ability to find them. We unleashed some project management on him and now he has a great to-do list. No success in CampHogg goes unpunished! Along the way, he re-discovered a ridiculously odd Kepler target that has three transits from at least two different kinds of planets, neither of which seems periodic. Or maybe it is one planet around a binary host, or maybe worse? That launched some email trail with some Kepler peeps.

Also at group meeting, Dun Wang showed some near-final tests of the hyper-parameter choices in his data-driven model of the Kepler pixels. It is getting down to details, but details matter. We came up with one final possible simplification for his hyper-parameter choices for him to test this week.

In the afternoon, Alex Malz came by to discuss Spring courses and we ended up working through a menu of possible thesis projects. One that I pitched is so sweet: It is just to write down, very carefully, what we would do if we had instead of a redshift catalog a set of low-precision redshift likelihood functions (with SED or spectral nuisance parameters). Could we then get the luminosity function and spatial clustering of galaxies? Of course we could, but we would have to go hierarchical. Is this practical at LSST scale? Not sure yet.


text, R, and politics

Today at lunch Michael Blanton organized a Data Science event in which Ken Benoit (LSE) told us about quanteda, his package for manipulating text in R. This package does lots of the data massaging and munging that used to be manual work, and gets the text data into "rectangular" form for data analysis. It also does lots of data analysis tasks too, but the munging was very interesting: Part of Benoit's motivation is to make text analyses reproducible from beginning to end. Benoit's example texts were amusing because he works on political speeches. He had examples from US and Irish politics. Some discussion in the room was about Python vs R; the key motivation for working in R is that it is by far the dominant language at the intersection of statistics and political science.


nuclear composition of UHECRs

Today Michael Unger (Karlsruhe) told us over lunch about ultra-high energy cosmic rays from Auger. There are many mysteries, but it does look like the composition moves to higher-Z nuclei as you go to higher energies, or at least that's my read. He told us also about a very intriguing extension to Auger which would make it possible to distinguish protons from iron in the ground detectors; if that became possible, it might be possible to do cosmic-ray imaging: It is thought that the cosmic magnetic fields are small enough that protons near the GZK cutoff should point back to their sources. So far this hasn't been possible, presumably because the iron (and other heavy elements) have charge-to-momentum ratios too large; they get heavily deflected by the magnetic fields they encounter.


Math-Astrophysics collaboration proposal

I spent a big chunk of the day today trying to write a draft of a collaboration proposal (really a letter of intent) for the Simons Foundation. That is only barely research.


exoplanet compositions

Today Angie Wolfgang (UCSC) gave a short morning seminar about hierarchical inference of exoplanet compositions (like are they ice or gas or rock?). She showed that the super-Earth (1 to 4 Earth-radius) planet radius distribution fairly simply translates into a composition distribution, if you are willing to make the (pretty justified, actually) assumption that the planets are rocky cores with a hydrogen/helium envelope. She inferred the distribution of gas fractions for these presumed rocky planets and got some reasonable numbers. Nice! There is much more to do, of course, since she cut to a very clean sample, and hasn't yet looked at the interdependence of composition, period, and host-star properties. There is a lot to do in exoplanet populations still!


training convolutional nets to find exoplanets

In group meeting today, a good discussion arose about training a supervised method to find exoplanet transits. Data-Science Masters student Elizabeth Lamm (NYU) is working with us to use a convolutional net (think: deep learning) to find exoplanet transits in the Kepler data. Our rough plan is to train this net using real Kepler lightcurves into which we have injected artificial planets. This will give "true positive" training examples, but we also need "true negative" examples. Since transits are rare, most of the lightcurves would make good negative training data; even if we used all of the non-injected lightcurves arbitrarily, we would only have a false-negative rate of a tiny fraction of a percent (like a hundredth of a percent).

That said, there were various intuitions (about training) represented in the discussion. One intuition is that even this low rate of false negatives might lead to some kinds of over-fitting. Another is that perhaps we should up-weight in the training data true negatives that are "threshold crossing events" or, in other words, places where simple software systems think there is a transit but close inspection says there isn't. We finished the discussion in disagreement, but realized that Lamm's project is pretty rich!


K2 pointing model

Imagine a strange "game": A crazy telescope designer put thousands of tiny pixelized detectors in the focal plane of an otherwise stable telescope and put it in space. Each detector has an arbitary position in the focal plane, orientation, and pixel scale, or even non-square (affine) pixels. But given the stability, the telescope's properties are set only by three Euler angles. How can you build a model of this? Ben Montet (Harvard CfA), Foreman-Mackey, and I worked on this problem today. Our approach is to construct a three-dimensional "latent-variable" space in which the telescope "lives" and then an affine transformation for each detector patch. It worked like crazy on the K2 data, which are the data from the two-wheel era of the NASA Kepler satellite. Montet is very optimistic about our abilities to improve both K2 and Kepler photometry.


single transits, new physics, K2

In my small amount of research time, I worked on the text for Hattori's paper on single transits in the Kepler data, including how we can search for them and what can be inferred from them. At lunch, Josh Ruderman (NYU) gave a nice talk on finding beyond-the-standard-model physics in the Atlas experiment at LHC. He made a nice argument at the beginning of his talk that there must be new physics for three reasons: baryogenesis, dark matter, and the hierarchy. The last is a naturalness argument, but the other two are pretty strong arguments! In the afternoon, while I ripped out furniture, Ben Montet (Harvard) and Foreman-Mackey worked on centroiding stars in the K2 data.


three talks

Three great talks happened today. Two by Jason Kalirai (STScI) on WFIRST and the connection between white dwarf stars and their progenitors. One by Foreman-Mackey on the new paper on M-dwarf planetary system abundances by Ballard & Johnson. Kalirai did a good job of justifying the science case for WFIRST; it will do a huge survey at good angular resolution and great depth. He distinguished it nicely from Euclid. It also has a Guest Observer program. On the white-dwarf stuff he showed some mind-blowing color-magnitude diagrams; it is incredible how well calibrated HST is and how well Kalirai and his team can do crowded-field photometry, both at the bright end and at the faint end. Foreman-Mackey's journal-club talk convinced us that there is a huge amount to do in exoplanetary system population inference going forward; papers like Ballard & Johnson only barely scratch the surface of what we might be doing.


regression of continuum-normalized spectra

I had a short phone call this morning with Jeffrey Mei (NYUAD) about his project to find the absorption lines associated with high-latitude, low-amplitude extinction. The plan is to do regression of A and F-star spectra against labels (in this case, H-delta EW as a temperature indicator and SFD extinction), just like the project with Melissa Ness (MPIA) (where the features are stellar parameters instead). Mei and I got waylaid by the SDSS calibration system, but now we are working on the raw data, and continuum-normalizing before we regress. This gets rid of almost all our calibration issues. The remaining problem (which I don't know how to solve) is the redshift or rest-frame problem: We want to work on the spectra in the rest frame of the ISM, which we don't know!


measuring the positions of stars

At group meeting, Vakili showed his results on star positional measurements. We have several super-fast, approximate schemes that come close to saturating the Cramér–Rao bound, without requiring a good model of the point-spread function.

One of these methods is the (insane) method used in the SDSS pipelines, which was communicated to us in the form of code (since it isn't fully written up anywhere). This method (due to Lupton) is genius, fast, runs on minimal hardware with almost no overhead, and comes close to saturating the bound. Another of these is the method made up on the spot by Price-Whelan and me when we wrote this paper on digitization bandwidth, with a small modification (involving smoothing (gasp!) the image); the APW method is simpler and faster than the SDSS method on modern compute machinery.

Full-up PSF modeling should beat (very slightly) both of these methods, but it degrades in an unknown way as the PSF model gets wrong, and who is confident that he or she has a perfect PSF model? Vakili is going to have a nice paper on all this; we started writing it just as an aside to other things we are doing, but we realized that much of what we are learning is not really in the literature. Let's hear it for the analysis of astronomical engineering infrastructure!


software and literature; convex problems

Fernando Perez (Berkeley), Karthik Ram (Berkeley), and Jake Vanderplas (UW) all descended on CampHogg today, and we were joined by Brian McFee (NYU) and Jennifer Hill (NYU) to discuss an idea hatched by Hill at Asilomar to build a system to scrape the literature—both refereed and informal—for software use. The idea is to build a network and a recommendation system and alt metrics and a search system for software in use in scientific projects. There are many different use cases if we can understand how papers made use of software. There was a lot of discussion of issues with scraping the literature, and then some hacking. This has only just begun.

At lunch, I visited the Simons Center for Data Analysis. I ended up having a long conversation with Christian Mueller (Simons) about the intersection of statistics with convex optimization. Among other things, he is working on principled methods for setting the hyperparameters in regularized optimizations. He told me many things I didn't know about convex problems in data analysis. In particular, he indicated that there might be some very clever and provably optimal (or non-sub-optimal) ways to reduce the feature space for the "Causal Pixel Model" for Kepler pixels that Wang is working on.


Kepler occurrence rate review, day 2

Today the review committee wrote up and presented recommendations to the Kepler team on it's close-out planet occurrence rate inference plans. We recommended that the big issues in occurrence rate—especially near Earth-like planets—are factor-of-two and larger, so the team ought to focus on the big things and not spend time tracking down percent-level effects. After the review I had long talks with Jon Jenkins (Ames) and Tom Barclay (Ames) about Kepler projects and tools.


Kepler occurrence rate review, day 1

Today I got up at dawn's crack and drove to Mountain View for a review of the NASA Kepler team's planet occurrence rate inferences. It was an incredible day of talks and conversations about the data products and experiments needed to turn Kepler's planet (or object-of-interest) catalog into a rate density for exoplanets, and especially the probabilities that stars host Earth-like planets. We spent time talking about high-level priorities, but also low-level methodologies, including MCMC for uncertainty propagation, adaptive experimental design for completeness (efficiency) estimation, and the relative merits of forward modeling and counting planets in bins. On the latter, the Kepler team is creating (and will release publicly) everything needed for either approach.

One thing that pleased me immensely is that Foreman-Mackey's paper on the abundance of Earth analogs got a lot of play in the meeting as an exemplar of good methodology, and also an exemplar of how uncertain we are about the planet occurrence rate! The Kepler team—and increasingly the whole astronomical community—is coming around to the view that forward modeling methods (as in hierarchical probabilistic modeling or approximate bayesian computation) are preferable to counting dots in bins.


DSE Summit, day 3

On the last day of the Summit, we spent the full meeting talking about the collaboration and deliverables for the funding agencies. That does not qualify as research. Late in the day I had a revelation about the relationship between ethnography and science. They are related, but not really the same. Some of the conclusions of ethnography have a factual or hypothesis-generating character, but ethnographic results do not really live in the same domain as scientific results. That is no knock on ethnography! Ethnographers can ask questions that we don't even know how to start to ask quantitatively.


DSE Summit, day 2

On the second day of the Moore–Sloan Data Science Summit, we did some awesome community building exercises involving team problem-solving. We then discussed and tried to understand how it relates to our ideas about collaboration and creativity. That was pretty fun!

At lunch I had a great conversation with Philip Stark (Berkeley) about finding signals in time series below the Nyquist (sampling) limit; in principle it is possible if you have a good idea what you are looking for or what's hidden there. We also talked about geometric descriptions of statistics: The world is infinite dimensional (there are a set of fields at every position in phase space) but observations are finite (noisy measurements of certain kinds of projections). This has lots of implications for the impact of priors (such as non-negativity), when they apply to the infinite-dimensional object (the latent variables, rather than the finite observations).

After lunch, it was probabilistic generalizations of periodograms with Jake Vanderplas (UW) and some frisbee, and then a discussion about the open spaces for Data Science that we are building at Berkeley, UW, and NYU. In all three, there are issues of setting the rules and culture of the space. I think the three institutions can make progress together that no one institution could make on its own.


DSE Summit, day 1

Today was the first day of the community-building meeting of the Moore-Sloan Data Science Environments, held at Asilomar (near Monterey, CA). The project is a collaboration between Berkeley, UW Seattle, and NYU; the meeting has about 100 attendees from across the three institutions. The day started with an unconference in the morning, in which I attended a discussion session on text and text analysis. After that, we got into small inter-institutional groups and worked out our commonalities (and then presented them as lightning talks), as a way to get to know one another and also introduce ourselves to the community. Much of the community building happened on the beach!


the Sun is normal; how did Jupiter form?

At group meeting, Wang reviewed Basri, Walkowicz, & Reiners (2013) on the variability of the Sun in terms of Kepler stars. It shows that (despite rumors to the contrary) the Sun is very typically variable for G-type dwarf stars. It is a very nice piece of work; it just shows summary statistics, but they are nicely robust and insensitive to satellite systematics.

Also in group meeting, Vakili showed first results from a dictionary-learning approach to the point-spread function in LSST simulated imaging. He is using stochastic gradient descent, which I learned (in the meeting) is useful for starting off an optimization, even in cases where the full likelihood (or objective) function can be computed just fine.

After lunch, Roman Rafikov (Princeton) gave a nice talk about the formation of giant planets. He argued that distant planets (like in HR 8799) might have a different formation mechanism that close planets (like Jupiter and hot Jupiters). One very interesting thing about planets—unlike stars—is that the structure is not just set by the composition; it is also set by the formation history.


what is the interstellar-medium rest frame?

I spoke with Jeffrey Mei (NYUAD) early in the morning (my time) to discuss his continuum-normalized SDSS spectra of standard stars. We are trying to look for absorption in the spectra that is associated with interstellar medium by regressing the spectra against the Galactic reddening. This is a great project, but has many complicated issues. Not the least is that it is easy to shift the spectra to the stellar rest frame, or even the Solar System barycentric rest frame, but it is hard to shift them to the mean (line-of-sight) interstellar-medium rest frame. I have some ideas, or we could look for blurry features, blurred by the interstellar velocity differences. Maybe Na D will save us?


Jupiter analogs, and exoplanet music

As with every Wednesday, the highlight was group meeting, which we held (as we do every Wednesday) in the Center for Data Science studio space. We discussed Hattori's search for Jupiter analogs in the Kepler data: The plan is to search with a top-hat function, and then, for the good candidates, do a hypothesis test of top-hat vs saw-tooth vs realistic transit shape. Then do parameter estimation on the ones that prefer the latter. This is a nice structure and highly achievable.

After that, we discussed sonification of the Kepler data with Brian McFee (NYU) and also his tempogram method for looking at beat tracks in music (yes, music). We have some ideas about how these things might be related! At the end of group meeting, we worked on Foreman-Mackey's and Wang's AAS abstracts, both about calibrating out stochastic variability in Kepler light-curves to improve exoplanet search.