I spent most of the day at Columbia's Data Science Institute, participating in a workshop on data science in the natural sciences. I learned a huge amount! There were way too many cool things to mention them all here, but here are some personal highlights:
Andrew Gelman (Columbia) talked about the trade-off between spatial resolution and what he called “statistical resolution”; he compared this trade-off to that in political science between conceptual resolution (the number of questions we are trying to ask) and statistical resolution (the confidence with which you can answer those questions). He also talked about distribution (or expectation) propagation algorithms that permit you to analyze your data in parts and combine the inferences, without too much over-working.
Joaquim Goes (Columbia) talked about ocean observing. He pointed out that although the total biomass in the oceans is far smaller than that on land, it cycles faster, so it is extremely important to the total carbon budget (and the natural sequestration of anthropogenic carbon). He talked about the Argo network of ocean float data (I think it is all public!) and using it to model the ocean.
John Wright (Columbia) pointed out that bilinear problems (like those that come up in blind deconvolution and matrix factorization and dictionary methods) are non-convex in principle, but we usually find good solutions in practice. What gives? He has results that in the limit of large data sets, all solutions become transformations of one another; that is, all solutions are good solutions. I am not sure what the conditions are etc., but it is clearly very relevant theory for some of our projects.
There was a student panel moderated by Carly Strasser (Moore Foundation). The students brought up many important issues in data science, one of which is that there are translation issues when you work across disciplinary boundaries. That's something we have been discussing at NYU recently.