I read a paper today that is just about completely wrong. It is Linear regression in astronomy by Isobe, Feigelson, Akritas, Babu (1990). The paper presents five methods for fitting straight lines to data, and compares them. I think I have three objections:
First, they present procedures, and do not show that any of those procedures optimize anything that a scientist would care about. That is, they do not show that any procedure gives a best-fit line in any possible sense of the word best
. Now, of course, some of their procedures do produce a best-fit line under some assumptions, but they only give those assumptions for one (or two) of their five methods. In particular, the method they advocate has no best-fit interpretation whatsoever!. Scientists do not trade in procedures, they trade in objectives, and choose procedures only when they are demonstrated to optimize their objectives, I hope.
Second, when deciding whether to fit for Y as a function of X or X as a function of Y, they claim that the decision should be based on the physics
of X and Y! But the truth is that this decision should be based on the error properties of X and Y. If X has much smaller errors, then you must fit Y as a function of X; if the other way then the other way, and if neither has much smaller errors, then that kind of linear fitting is invalid. This paper propagates a very dangerous misconception; it is remarkable that professional statisticians would say this. It is not a matter of statistical opinion, what is written in this paper is straight-up wrong.
Third, they decide which of their methods performs
best by applying all five methods to sets of simulated data. These data are simulated with certain assumptions, so all they have shown is that when you have data generated a certain way, one method does better at getting at the parameters of that generative model. But then, when you have a data set with a known generative model, you should just optimize the likelihood of that generative model. The simulated data tell you nothing in the situation that you don't know the generative model for your data, which is either always or never the case (not sure which). That is, if you know the generative model, then just use it directly to construct a likelihood (don't use the methods of this paper). If you don't, then you can't rely on the conclusions of this paper (and its ilk). Either way, this paper is useless.
Wow, I am disappointed that this is the state of our art. I hope I didn't sugar-coat that critique too much!