2020-06-25

reading a difficult (to me) paper

I participated in day 3 of #sdss2020 today, and even started to pitch a project that could make use of the (literally) millions of unassigned fiber–visits in SDSS-V. Yes, the SDSS-V machines are so high throughput that, even doing multiple, huge surveys, there will be millions of unassigned fiber–visits. My pitch is with Adrian Price-Whelan; it is our project to get a spectrum of every possible “type”of star, where we have a completely algorithmic definition of “type”. More on this tomorrow, I hope.

In the afternoon, I spent time with Soledad Villar (NYU) reading this paper (Hastie et al 2019) on regression. It contains some remarkable results about what they call “risk” (and I call mean squared error) in regression. This paper is one of the key papers analyzing the double descent phenomena I described earlier. The idea is that when the number of free parameters of a regression becomes very nearly equal to the number of data points in the training set, the mean squared error goes completely to heck. This is interesting in its own right—I am learning about the eigenvalue properties of random matrices—but it is also avoidable with regularization. The paper explains both why and how. Villar and I are interested in avoiding it with dimensionality reduction, which is another kind of regularization, in a sense.

Related somehow to all this, I have been reading a new (new to me, anyway) book on writing, aimed at mathematicians. The Hastie et al paper is written by math-y people, and it has some great properties, like giving a clear summary of all of its findings up-front, a section giving the reader intuitions for each of them, and clear and timely reminders of key findings along the way. It's written almost like a white paper. It's refreshing, especially for a non-mathematician reader like me. As you may know, I can't read a paper that begins with the word Let!

No comments:

Post a Comment