Hogg's Research: bad inputs on the RHS of a regression are bad

2021-05-17

bad inputs on the RHS of a regression are bad

I discussed new results with Christina Eilers (MIT), who is trying to build a simple, quasi-linear causal model of the abundances of stars in our Galaxy. The idea is that some abundance trends are being set or adjusted by problems with the data, and we want to correct for that by a kind of self-calibration. It's all very clever (in my not-so-humble opinion). Today she showed that her results get much better (in terms of interpretability) if she trims out stars that get assigned very wrong dynamical actions by our action-computing code (thank you to Adrian Price-Whelan!). Distant, noisy stars can get bad actions because noise draws on distance can make them effectively look unbound! And in general, the action-estimation code has to make some wrong assumptions.

It's a teachable moment, however, because when you are doing a discriminative regression (predicting labels using features), you can't (easily) incorporate a non-trivial noise model in your feature space. In this case, it is safer to drop bad or noisy features than to use them. The labels are a different matter: You can (and we do) use noisy labels! This asymmetry is not good of course, but pragmatic data analysis suggests that—for now—we should just drop from the training set the stars with overly noisy features and proceed.

No comments:

Recent Collaborators

Adam Greenberg (Columbia)
Adam Myers (Wyoming)
Adi Zolotov
Adrian Price-Whelan (Flatiron)
Alex Malz (NYU)
Ana Bonaca (Harvard)
Andreas Küpper
Andy Casey (Monash)
Anna Y. Q. Ho (Caltech)
Anna-Christina Eilers (MPIA)
Aukosh Jagannath
Bernhard Schölkopf (MPI-IS)
Beth Willman (Arizona)
Boris Leistedt (NYU)
Brendon Brewer (Auckland)
Christopher Stumm (Etsy)
Dalya Baron (TAU)
Dan Foreman-Mackey (Flatiron)
Daniela Huppenkothen
David Mykytyn (NYU)
David Schiminovich (Columbia)
Demetri Muna
Dmitry Malyshev (Stanford)
Dun Wang
Dustin Lang (Princeton)
Ekta Patel (Berkeley)
Elisabeta Lusso (Arcetri)
Emily Griffith (Colorado)
Federica Bianco (NYU)
Fengji Hou
Hans-Walter Rix (MPIA)
Iain Murray (Edinburgh)
James Long (TAMU)
Jan Rybizki (MPIA)
Jeffrey Mei (NYUAD)
Jeremy Magland (Flatiron)
Jeremy Tinker (NYU)
Jo Bovy (Toronto)
Joe Hennawi (MPIA)
Joey Richards (Berkeley)
John Moustakas (Siena College)
Jonathan Bird (Vanderbilt)
Jonathan Goodman (NYU)
Kate Storey-Fisher (NYU)
Kathryn Johnston (Columbia)
Krikamol Muandet (MPI-IS)
Lauren Anderson
Leslie Greengard (Flatiron)
Lily Zhao (Flatiron)
Marcus Frean (Wellington)
Maria Kapala (Cape Town)
Marla Geha (Yale)
Megan Bedell (Flatiron)
Melissa Ness (Columbia)
Michael Blanton (NYU)
Mike O'Neil (NYU)
MJ Vakili (Leiden)
Morad Masjedi
Nora Eisner (Flatiron)
Paraskevi Tsalmantza
Phil Marshall (SLAC)
Rob Fergus (NYU)
Robyn Sanderson (Columbia)
Ronin Wu (Tokyo)
Rory Holmes (COM DEV)
Ross Fadely (Insight)
Ruth Angus (AMNH)
Sam Roweis (deceased)
Sarah Pearson (NYU)
Semyeong Oh
So Hattori (NYUAD)
Soledad Villar (JHU)
Stephen Feeney (Flatiron)
Steven Mohammed (Columbia)
Taisiya Kopytova (ASU)
Teresa Huang (NYU)
Tim Morton (Princeton)
Tom Barclay (NASA)

2021-05-17

bad inputs on the RHS of a regression are bad

No comments:

Post a Comment