On the second day of the Moore Foundation meeting, I gave my talk (about flexible models for exoplanet populations, exoplanet transits, and exoplanet-discovering hardware calibration). After my talk, I had a great conversation with Emmanuel Candès (Stanford), who asked me very detailed questions about my prior beliefs. I realized in the conversation that I have been violating all my own rules: I have been setting my prior beliefs about hyper-parameters in the space of the hyper-parameters and not in the space of the data. That is, you can only assess the influence and consistency of the prior pdf (consistency with your actual beliefs) by flowing the prior through the probabilistic model and generating data from it. I bet if I did that for some of the problems I was showing, I would find that my priors are absurd. This is a great rule, which I often say to others but don't do myself: Always sample data from your prior (not just parameters). This is a rule for Bayes but also a rule for those of us who eschew realism! More generally, Candès's expressed the view that priors should derive from data—prior data—a view with which I agree deeply. Unfortunately, when it comes to exoplanet populations, there really aren't any prior data to speak of.
There were many excellent talks again today; again this is an incomplete set of highlights for me: Titus Brown (MSU) explained his work on developing infrastructure for biology and bioinformatics. He made a number of comments about getting customer (or user) stories right and developing with the current customer in mind. These resonated for me in my experiences of software development. He also said that his teaching and workshops and outreach are self-interested: They feed back deep and valuable information about the customer. Jeffrey Heer (UW) said similar things about his development of DataWrangler, d3.js, and other data visualization tools. (d3.js is github's fourth most popular repository!) He showed some beautiful visualizations. Heer's demo of DataWrangler simply blew away the crowd, and there were questions about it for the rest of the day.
Carl Kingsford (CMU) caused me (and others) to gasp when he said that the Sequence Read Archive of biological sequences cannot be searched by sequence. It turns out that searching for strings in enormous corpuses of strings is actually a very hard problem (who knew?). He is using a new structure called a Bloom Filter Tree, in which k-mers (length-k subsections) are stored in the nodes and the leaves contain the data sets that contain those k-mers. It is very clever and filled with all the lovely engineering issues that the Astrometry.net data structures were filled with lo so many years ago. Kingsford focuses on writing careful code, so the combination of clever data structures and well written code gets him orders of magnitude speed-ups over the competition.
Causal inference was an explicit or implicit component of many of the talks today. For example, Matthew Stephens (Chicago) is using natural genetic variations as a "randomized experiment" to infer gene expression and function. Laurel Larson (Berkeley) is looking for precursor events and predictors for abrupt ecological changes; since her work is being used to trigger interventions, she requires a causal model.
Blair Sullivan (NC State) spoke about performing inferences with provable properties on graphs. She noted that most interesting problems are NP hard on arbitrary graphs, but become easier on graphs that can be embedded (without crossing the edges) on a planar or low-genus space. This was surprising to me, but apparently the explanation is simple: Planar graphs are much more likely to have small sets of vertices that split the graph into disconnected sub-graphs. Another surprising thing to me is that "motif counting" (which I think is searching for identical subgraphs within a graph) is very hard; it can only be done exactly and in general for very small subgraphs (six-ish nodes).
The day ended with Laura Waller (Berkeley) talking about innovative imaging systems for microscopy, including light-field cameras, and then a general set of cameras that do non-degenerate illumination sequences and infer many properties beyond single-plane intensity measurements. She showed some very impressive demonstrations of light-field inferences with her systems, which are sophisticated, but built with inexpensive hardware. Her work has a lot of conceptual overlap with astronomy, in the areas of adaptive optics and imaging with non-degenerate masks.