2020-04-22

How many neighbors?

By something of a coincidence, both Ana Bonaca (Harvard) and Adam Wheeler (Columbia) are performing data analyses that require finding and performing simple operations on the K nearest neighbors of each data point in a large data set. I love these kinds of models, because if your fancy machine-learning method isn't far better than K-nearest-neighbors, why would you use it? K-nearest neighbors is fast, simple to explain, and entirely reproducible. Most machine-learning methods are slow, almost impossible to explain to the uninitiated, and irreproducible (sensitive to seeds and initialization).

In both cases (Bonaca and Wheeler), we are looking at performance (in cross-validation, say) as a function of K. My surprise today is that both of them are getting big K values (in the hundreds) as optimal. We discussed why this must be the case. In general I think the model likes large K when the data lie in a locally smooth distribution in the space, or the features you are trying to predict only depend in smooth ways on the data. Or something like that?

No comments:

Post a Comment