Today began with Bloom (UCB) talking about supervised methods. He put in a big plug for Random Forest, of course! Two things I liked about his talk: One, he emphasized the difficulties and time spent munging the data, identifying features, imputing, and so on. Two, he put in some philosophy at the end, to the effect that you don't have to understand every detail of these methods; much better to partner with someone who does. Then the intellectual load on data scientists is not out of control. For Bloom, data science is inherently collaborative. I don't perfectly agree, but I would agree that it works very, very well when it is collaborative. Back on the data munging point: Essentially all of the basic supervised methods presume that your "features" (inputs) are noise-free and non-missing. That's bad for us in general.
Based on Bloom's talk I came up with many hacks related to Random Forest. A few examples are the following: Visualize the content and internals of a Random Forest, and its use as a hacky generative model of the data. Figure out a clever experimental sequence to determine conditional feature importances, which is a combinatorial problem in general. Write a basic open-source RF code and get some community to fastify it. Figure out if there is any way to extend RF to handle noisy input features.
I didn't do any of these hacks! We insisted on pair-coding today among the hackers, and this was a good idea; I partnered with Foreman-Mackey and he showed me what he has done for K2 photometry. It is giving some strange results—it is using the enormous flat-field freedom we are giving it to fix all its ills—we thought about ways to test what is going wrong and how to give other parts of the model relevant freedom.
This might be interesting for the discussion, we have used RF and other ML techniques with "noise" perturbation to compute photo-z, here a link to the papers and codes for this
ReplyDeletehttp://lcdm.astro.illinois.edu/static/code/mlz/MLZ-1.1/doc/html/index.html
Matias