10 min Paper Discussion + Flux Impl.: Towards Learning Convolutions from Scratch

anon92994695 · August 2, 2020, 1:45am

Do we do paper discussions for fun? Figured I’d try it out to see who does CV here and maybe make some new friends. I’m not doing any professional CV work now, did some professionally a few years back, but am getting back up to speed (field moves fast). Anyways I found a pretty simple paper (https://arxiv.org/pdf/2007.13657.pdf) getting a bit of buzz on social media. Figured simple was good incase this flops socially(less time invested)…

Think the gist of the paper is they found projected SGD with an additional(scalar) tuning parameter (to force the soft threshold to knock out more weights) to improve CV tasks in MLP/FFNN topologys. This is reasonable, but isn’t it already well known? For example, convolutions are just structured typically sparse weight matrices, and pretty much any form of regularization helps NN’s… I guess not - some of the broader conclusions from the paper illustrate much of these studies aren’t known/talked about even if they might seem obvious. Seems like some nice derivations(that I haven’t hand checked) and ultimately it seemed to be a pretty easy thing to implement in Flux (my code is here:
blasso gif.jl · GitHub).

BLASSO wts

Did a basic test where I made 9 dumby random normal features and 1 feature proportional(1/2) to some MSE property value. When we train with GD and the beta term set to 1(IE: LASSO conditions), we pretty much always arrive at the correct LASSO solution (good). But, ironically… when you increase Beta you can miss out on the correct LASSO solution. Pretty obvious but yea you can hit some bad local minima depending on weight initialization/data even when/if there is a systematic solution. In CV there’s a lot of noise/garbage so maybe it ends up being beneficial, but it does seem like it’s a little bit of a head scratcher claiming we are on our way to “learning” convolutions? I’d imagine there’s a way to penalize non-banded solutions to get the best of both worlds - but haven’t bothered looking into that - could be fun though.

What is interesting about this is - can Flux handle SparseArray weight matrices in “Dense” layers with any advantage? Or do they blow up at the GPU level with little performance boost? if so, we may find that when deployed we’d be looking at far less FlOPs per gradient update and making some CV tasks potentially very lean on deployment with old school MLP topologys… Kinda nice even if the comparison is pedagogical and not attempting SOTA implementations.

Ultimately I’d like to hear other peoples thoughts? Am I missing something more important? See a nice way to make the Flux code more legible? Wanna say hi?

Topic		Replies	Views
Two questions on Flux Machine Learning	23	4851	October 2, 2020
Generic Function to train NN w/ Flux Machine Learning flux	7	1668	April 14, 2020
Upsampling in Flux.jl Machine Learning flux	7	2755	November 3, 2019
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6808	October 6, 2018
Best practice for Flux.jl: how to untrack gradient? General Usage optimization , machine-learning	1	571	March 15, 2021

10 min Paper Discussion + Flux Impl.: Towards Learning Convolutions from Scratch

Related topics