Do we do paper discussions for fun? Figured I’d try it out to see who does CV here and maybe make some new friends. I’m not doing any professional CV work now, did some professionally a few years back, but am getting back up to speed (field moves fast). Anyways I found a pretty simple paper (https://arxiv.org/pdf/2007.13657.pdf) getting a bit of buzz on social media. Figured simple was good incase this flops socially(less time invested)…
Think the gist of the paper is they found projected SGD with an additional(scalar) tuning parameter (to force the soft threshold to knock out more weights) to improve CV tasks in MLP/FFNN topologys. This is reasonable, but isn’t it already well known? For example, convolutions are just structured typically sparse weight matrices, and pretty much any form of regularization helps NN’s… I guess not - some of the broader conclusions from the paper illustrate much of these studies aren’t known/talked about even if they might seem obvious. Seems like some nice derivations(that I haven’t hand checked) and ultimately it seemed to be a pretty easy thing to implement in Flux (my code is here:
Did a basic test where I made 9 dumby random normal features and 1 feature proportional(1/2) to some MSE property value. When we train with GD and the beta term set to 1(IE: LASSO conditions), we pretty much always arrive at the correct LASSO solution (good). But, ironically… when you increase Beta you can miss out on the correct LASSO solution. Pretty obvious but yea you can hit some bad local minima depending on weight initialization/data even when/if there is a systematic solution. In CV there’s a lot of noise/garbage so maybe it ends up being beneficial, but it does seem like it’s a little bit of a head scratcher claiming we are on our way to “learning” convolutions? I’d imagine there’s a way to penalize non-banded solutions to get the best of both worlds - but haven’t bothered looking into that - could be fun though.
What is interesting about this is - can Flux handle SparseArray weight matrices in “Dense” layers with any advantage? Or do they blow up at the GPU level with little performance boost? if so, we may find that when deployed we’d be looking at far less FlOPs per gradient update and making some CV tasks potentially very lean on deployment with old school MLP topologys… Kinda nice even if the comparison is pedagogical and not attempting SOTA implementations.
Ultimately I’d like to hear other peoples thoughts? Am I missing something more important? See a nice way to make the Flux code more legible? Wanna say hi?