stochastic gradient

Dilip Krishnan (NYU) came over to the CCPP to school us cosmologists in optimization today. The conversation (in the lounge) was wide-ranging, but the most interesting parts of it (from my perspective) were about stochastic-gradient methods in sparse problems.

A stochastic-gradient method (or online method) is one that permits you to look sequentially at parts of your data (not all the data at once) but nonetheless optimize global parameters that affect all of the data. The idea is: You have so much data, you can't fit it all on disk, so you just sample a bit of it, update your parameters according to what it says, throw away the bit of data, and grab the next bit. Iterate to convergence. Foreman-Mackey and I have been investigating this in the context of The Thresher.

A sparse problem (in my parlance) is one in which most parameters only affect a small fraction of the data, and most data are only affected by a small fraction of the parameters. It is not obvious to me that the stochastic-gradient methods will work (or work well) on sparse problems, at least not without modification.

I promised Krishnan that I would deliver to him an example of a realistic, convex, sparse problem in astronomy. Convexity ensures that we have no disagreements about what constitutes success in optimization. It will provide a sandbox for investigating some of these issues.


  1. Pedantic quibble regarding "You have so much data, you can't fit it all on disk": that might have been the motivation in the 1980s and earlier. These days we can store *gigabytes* of data in RAM. Even if your problem is so "small" that it's "only" 10^5 examples, you may still want stochastic optimization. In batch mode you chew through all 10^5 examples, move your parameters by epsilon, then start chewing through all the data again. In stochastic/online mode you make many tiny moves as you go. By the time you've read through all the data once you might be pretty close to an optimum. (Having said that, if I can get an answer with L-BFGS, I'll tend to just do that.)

    Sparse problems are common in industrial applications of stochastic machine learning. For example spam filtering: email classification might be the result of weighting 10^5-10^6 possible features (presence of words, particular word-combinations, and other features), but a particular email might only have 10^2 of these features. One doesn't want to explicitly look at the whole weight vector during an update. If regularization suggests shrinking every element after each update, then represent the weight vector as a scalar times a vector and update the scalar to scale every element. I learned that trick from Leon Bottou, there's other stuff on his site you probably want to know: http://leon.bottou.org/projects/sgd

  2. Iain: Thanks for the comments! Actually, in the case we care about, there are 10 to 100 Tb of data so it is still a problem! But I agree with your points. Thanks for the links.

  3. I'll throw in modeling the distribution of stars in the Milky Way halo as a superposition of a large number of streams as a possible example to fit your sparseness criteria