I had the great privilege of a long conversation today with applied mathematician Mark Tygert (NYU), who works on, among many other things, linear algebra and statistics. I wanted to discuss with him the various things Lang and I are doing to optimize our billion-parameter model of the sky (or, really, our thousand-parameter model of a tiny sandbox part of the sky). He gave us various pieces of very valuable advice, including:
- The Python scipy.sparse module for sparse-matrix manipulation is probably using ARPACK, which is also (not coincidentally) the right thing to be using for big problems like ours.
- When your matrix is sparse, and you want to control its dynamic range by re-scaling rows (or columns) to help with numerics, there is no difference between using the L1-norm, L2-norm, or L-infinity norm. Sparseness makes these (close to) equivalent.
- If, after normalizing rows, we get the SVD of our derivative matrix, the eigenvalues should be comparable; if this is the case, conjugate-gradient should work well.
- We probably shouldn't be using anything like Gauss–Newton method; it relies too heavily on the objective function being very very close to parabolic in the parameters. Conjugate gradient is almost certainly better. It is also provably optimal among methods that aren't permitted to invert the matrix of relevance.
- If we switch to conjugate gradient, we get a new knob, which is the trade-off between taking CJ steps and recomputing derivatives. That is both good and bad.
- The only other good alternative to CJ is quasi-Newton. If we want to give that a shot, we should use the BFGS method and nothing else.