2023-10-18

biases from machine learning

Today I gave a talk (with these slides) at a meeting in Denver for the NSF initiative Harnessing the Data Revolution. I spoke about the necessity and also the dangers of using machine-learning methods in scientific projects. I brought up two very serious possible biases. The first is that if emulators are used to replace simulations, and they can't be easily checked (because the simulation requirements are too expensive), the emulators will lead to a confirmation-bias problem: We will only carefully check the emulations if they lead to results that we don't like! The second bias I raised is that if we perform joint analyses on objects (stars, say) that have been labeled (with ages, say) by a machine-learning regression, there will in general be strong biases in those joint analyses. For example, the average value of 1000 age labels for stars labeled by a standard ML regression will not be anything like an unbiased estimate of the true average age of those stars. These biases are very strong and bad! That said, I also gave many example locations where using machine learning methods is not just okay but actually intellectually correct, in areas of instrument calibration, foregrounds, and other confounders.

The question period was great! We had 25 minutes of questions and answers, which ranged across a very wide set of topics, including statistics, experimental design, and epistemology.

No comments:

Post a Comment