Roweis and my approach to constructing *archetypes*—small subsets of data points that represent

*all* data points—is one of integer (or actually binary integer) programming. You have a large number of data points, and you include a small number of them, and exclude the rest, subject to constraints (the constraints that each point in the large set be represented), and optimizing some cost function (the total number of archetypes, in the simplest case). In general, these problems are, indeed, NP hard, as I suspected (below).

Roweis had the good idea of approximating the binary programming problem with a linear programming problem, and then post-processing the result. This is a great idea, and it works pretty well, as I discovered this morning, when everything came together and my code *just worked.* However, the number of archetypes we were getting in our post-processing was significantly larger than that expected given the performance of the linear program approximation.

It turns out that standard linear programming packages (open source glpk and commercial CPLEX, for examples) have integer and binary programming capabilities. These also solve the linear program first and then post-process, but they do something extremely clever in the post-processing step and are much better than my greedy algorithm. They both come very close to saturating the linear programming optimal cost, for the problem we currently care about (although CPLEX does it much, much faster than glpk, in exchange for infinitely larger licensing fees).

It was a very satisfying, research-filled day. As time goes on I will let my loyal readers know *why* we are interested in this.