I am working on an implementation of an ML algorithm called SLIDE (sub-linear deep learning engine) from an academic paper. I’ve included a link to a gist of the code: https://gist.github.com/outlace/c81a15ab77c1c9cebb56003545ee5512
Basically the algorithm tries to make a deep learning algorithm that scales sub-linearly with respect to the size of the parameter matrices of each layer. Normally a matrix multiplication of two n x n
matrices has a time complexity of O(n^2), but SLIDE proposes using a node sub-sampling scheme, so that given an n x m
matrix, where each row (or column, depending on how you setup the matmul) is a ‘node’ in the neural network, you sub-sample a small proportion of nodes (rows) s
and then do a matrix multiplication using this much smaller matrix s x m
where s << n
. If you increase the size of the matrix the sub-sampled number of rows s
will grow sub-linearly, so that your overall processing time grows sublinearly as you scale up the matrix.
Ok, so basically to implement this I just generate a list of indices of the nodes (rows) I want to use and then do the forward pass of my simple 2-layer fully connected neural network using these smaller matrices by doing a view/slice of the original parameter matrices during training. But it appears when Zygote/Flux is doing the backward pass to compute gradients it is using the full n x m
matrix and not my view/slice of it because the wall time is growing linearly as I increase the n
dimension, and it should be relatively constant as I am only sampling about 50 nodes regardless of the size of the original parameter matrix.
Is there anyway to get Zygote/Flux to do the backward pass using a view/slice so that I can reap the performance benefit? Or perhaps there’s an issue with my implementation. Any help is greatly appreciated.