bad inputs on the RHS of a regression are bad

I discussed new results with Christina Eilers (MIT), who is trying to build a simple, quasi-linear causal model of the abundances of stars in our Galaxy. The idea is that some abundance trends are being set or adjusted by problems with the data, and we want to correct for that by a kind of self-calibration. It's all very clever (in my not-so-humble opinion). Today she showed that her results get much better (in terms of interpretability) if she trims out stars that get assigned very wrong dynamical actions by our action-computing code (thank you to Adrian Price-Whelan!). Distant, noisy stars can get bad actions because noise draws on distance can make them effectively look unbound! And in general, the action-estimation code has to make some wrong assumptions.

It's a teachable moment, however, because when you are doing a discriminative regression (predicting labels using features), you can't (easily) incorporate a non-trivial noise model in your feature space. In this case, it is safer to drop bad or noisy features than to use them. The labels are a different matter: You can (and we do) use noisy labels! This asymmetry is not good of course, but pragmatic data analysis suggests that—for now—we should just drop from the training set the stars with overly noisy features and proceed.

No comments:

Post a Comment