Here's a quotation from an email I sent to Schölkopf (MPI-IS) and Villar (JHU) today:
First, I believe (am I wrong?) that a CNN works by repeating the precisely identical weights for every pixel. So if, in a CNN layer, there are k channels of 3x3 filters, there are only 9k weights that set all the k responses of every pixel in the layer to the 3x3 pixels centered on that pixel in the layer below. The sparsity comes not just from the fact that each pixel in one layer connects to only the 9 pixels below it, but also from the fact that the 9k weights are the same for every pixel (except maybe at edges). That enforces a kind of translation symmetry.
Okay, now, we could make a non-convolutional neural net (NCNN) layer as follows: Each pixel is connected, like in the CNN, to just the 3x3 pixels in the layer below, centered on that pixel. And again, there will be k channels and only 9k weights for the whole layer. The only difference is that at each pixel, a rotation (of 0, 90, 180, or 270 degrees) gets applied and a flip (by the identity or across the x direction) gets applied to the weight maps. That is, every pixel has the same k filters applied but at each pixel, there has been one of the 8 rotation-reflection transformations assigned to the 9k-element 3x3 weight map. This NCNN layer would, like the CNN layer, have 9k weights in the layer, and it would be just as local and sparse as the matching CNN layer.
My conjecture is that the NCNN will perform far worse on image-recognition tasks than the CNN. It is also (fairly) easy (I believe) to build a NCNN from a light modification of a CNN code. Comparison is clean and straightforward. I am ready to bet substantial cash on this one.