If you have a training set of data X and (real-valued) labels Y, how to learn a model that can predict a new Y given a new X? Soledad Villar (NYU) and I have been working on this for quite some time, and just in the context of linear models and linear operators (that is, no deep learning or anything). We have come up with so many options. You can learn a traditional discriminative model: Do least squares on Y given X. That's standard practice and guaranteed to be optimal in certain (very limited) senses. You can learn a minimal generative model, in which you generate X given Y and then impute a new Y by inference. That's how The Cannon works. You can build a latent-variable model that generates both X and Y with a set of latent variables and then (once again) impute a new Y by inference. And this latter option has many different forms, we realized, even if you stick only to purely linear predictors of X given Y! Or you can build a Gaussian process in which you find the parameters of a Gaussian that can generate both X and Y jointly and then use the Gaussian-process math. (It turns out that this delivers something identical to the discriminative model for some choices.) Or you can do a low-rank approximation to that. OMG so many methods, every one of which is perfectly linear. That is, every one of these options can be used to make a linear operator that takes in a new X and, by a single matrix multiply, produces an estimate of the new Y.
In the last few weeks, we have found that, in many settings, the standard discriminative method—the method that is protected by proofs about optimality—is the worst or near-worst of these options, as assessed by mean-squared error. The proofs are all about variance for unbiased estimators. But in the Real World (tm), where you don't know the truth, you can't know what's unbiased. (Empirically assessing bias is almost impossible when X is high-dimensional, because the relevant bias is conditional on location in the X space.) In the real world, you can only know what performs well on held-out data, and you can only assess that with brute measures like mean-squared error. So we find that the discriminative model—the workhorse of machine learning—should be thought of as a tool of last resort. In this linear world.
Will these results generalize to deep-learning contexts? I bet they will.
No comments:
Post a Comment