In a great conversation with Soledad Villar (NYU) today, we realized that we have (more than) 10 methods for linear regression! Hahaha. Meaning: more than 10 differently conceived methods for finding a linear relationship between features X and labels Y using some kind of optimization or inference. Some involve regularization, some involve dimensionality reduction, some involve latent variables, and so on. None of them align with my usual practice, because we have put ourselves in a sandbox where we don't know the data-generating process: That is, we only have the rectangular X array and the Y array; we don't have any useful meta-data.
Things we are investigating are: Performance on predicting held-out data, and susceptibility to the double descent phenomenon. It turns out that both of these are amazingly strongly dependent on how we (truly) generate the data. In our sandbox, we are gods.
Relevant tweet https://twitter.com/gabrielpeyre/status/1288338463255941120
ReplyDelete