There is a paradox about deep learning. Which everyone either finds incredibly unconvincing or totally paradoxical. I'm not sure which! But it is this: It is simultaneously the case that deep learning is so flexible it can fit any data, including randomly generated data, and the case that when it is trained on real data, it generalizes well to new examples. I spent some time today discussing this with Soledad Villar (NYU) because I would like us to understand this a bit better in the context of possible astronomical applications of deep learning.
In many applications, people don't need to know why a method works; they just need to know that it does. But in our scientific applications, where we want to use the deep-learning model to de-noise or average over data, we actually need to understand in what contexts it is capturing the structure in the data and not just over-fitting the noise. Villar and I discussed how we might test these things, and what kinds of experiments might be illuminating. As my loyal reader might expect, I am interested in taking an information-theoretic attitude to the problem.
One relevant thing that Villar mentioned is that there is research that suggests that when the data has simpler structure, the models train faster. That's interesting, because it might be that somehow the deep models still have some internal sense of parsimony that is saving them; that could resolve the paradox. Or not!