Hi everyone, a while back I had a quick discussion with @outlace (over here) about implementing SLIDE. The basic idea is that you don’t need expensive GPUs or other hardware accelerators to train neural nets. All you need is a multi-core CPU and you can get the same performance, which sounds pretty amazing to me. I’d love to get a Julia implementation going based on the SLIDE paper, and there’s a few spin-off ideas in other papers that are also worth looking at. Basically I’m looking for other people who believe this is a good idea to explore, and who want to implement it together with some other people. Please reach out if you’re interested or want to know more
For those who are interested in the papers that I mentioned:
(one of) the main SLIDE paper(s)
speeding up SLIDE with vectorized operations
MONGOOSE (another speedup through more efficient hash table updating)
distributed SLIDE, i.e. multi-CPU
P.S. - As far as I know the algorithm currently only handles dense layers, but there’s no reason to think it can’t be extended to CNNs for example. Creating an implementation for this is also on my list of goals, but we’d be in uncharted waters so it’s more of a research project.
So it’s essentially a hash lookup based way of doing matrix-vector multiplication? It does sound like a nice use case for GitHub - PumasAI/SimpleChains.jl: Simple chains .
There are other unusual ways of training neural networks, such as the use of genetic algorithms in the early 90s that might be worth revisiting at the same time.
Yes the basic idea is to sparsify matrix-vector multiplications in a way that only the entries in the output with the biggest values are calculated. There’s been some research that showed that this kind of calculation works well for neural networks without losing (much) performance.
I haven’t seen SimpleChains.jl before, is it supposed to be a small framework for easily creating simple model architectures without too much hassle? And does work with Flux?
I would like to get some experience with genetic algorithms but I don’t think it’d be a good fit in this project. Or do you think a SLIDE system could help performance of genetic algorithms in some way?
I tried prototyping SLIDE here: GitHub - outlace/SLIDE-Pose-Estimation
I could’ve implemented it poorly but my results were not good. It does speed up large matrix multiplications but the accuracy was much worse, requiring a lot more training iterations, so overall it did not improve training time. The use-case in the SLIDE paper is for neural networks with a very large output space, e.g. like 100,000 possible classification classes hence output matrix of W x 100,000. Since the output layer is very sparse with one-hot encoding, SLIDE seems particularly helpful here, but with a smaller or denser output layers it doesn’t seem to help.
I remember you mentioned that before. It was a little disappointing but I still wonder whether there’s anything that can be changed to make it work better for layers that are not very wide. Do you remember tuning any hyperparameters of the LSH tables to see whether decreasing the sparsity a bit can help?
All you need is a multi-core CPU and you can get the same performance, which sounds pretty amazing to me.
Please reach out if you’re interested or want to know more
[…] but there’s no reason to think it can’t be extended to CNNs for example
Could i.e. AlphaZero.jl also be a use case for such or similar solution? I have to admit that I went extemly briefly through the papers provided above, however, [another disclaimer] based on my limited knowledge, I belive I have encountered pretty good results on CPUs vs GPUs with AlphaZero.jl (if interested pls see this [link]) [and another disclaimer - more or less this was my first contact with Julia - I am not a pro coder - and I never had / do not have any intention to discuss any points of view nor any coding strategy / implementation]. Also as for the papers, Im recalling this one "A Survey of Deep Learning on CPUs: Opportunities and Co-optimizations” by Sparsh Mittal, Poonam Rajputy and Sreenivas Subramoney [link].
The use-case in the SLIDE paper is for neural networks with a very large output space, e.g. like 100,000 possible classification classes hence output matrix of W x 100,000.
Out of pure curiosity, do you guys know (approximately) what is the size of AlphaZero’s output space?
I did my best to get it to work by fiddling around with it. Certainly I could be missing something so others should keep trying, however, I feel like if it were practical to get SLIDE to work for “regular” neural networks then the authors would have already published and showcased this, so the fact they demonstrated it on a very particular kind of neural network makes me think they also couldn’t find any advantage for the typical neural net.
I am now more interested in this paper “Multiplying Matrices without Multiplying”: Multiplying Matrices Without Multiplying , also see discussion at: Bolt: Faster matrix and vector operations that run on compressed data | Hacker News
I agree that using the algorithm as is probably won’t give the same great results that were shown for wide layers. Having said that I’m hopeful that it can serve as a solid foundation for an updated algorithm that’ll work great for smaller networks as well. To get there it’d be good to have a flexible code base to try things out, which in my opinion means Julia would be a better fit than the existing C++ code base.
That paper also looks pretty interesting, and it seems promising as well.
I’m not very familiar with alpha zero, but I’m pretty sure it uses a neural net, so SLIDE could be used to train it. I don’t think the network that they use has 100000 neurons in any of the layers though, so the issue that outlace brought up would probably be noticeable (see above).
I don’t know. I always thought that the predecessor had something like 80 layers and hundreds of thousands of neurons but definitely significantly less than 86 billions of my very personal ones. Just wanted to underline that should there be such an opportunity, I’d probably be very much interested in the topic that is being discussed here. I got several years of Atari Basic coding and recently I spotted that Julia Discourse was marking some anniversary of my activities here. Neural nets are to my interest, however, I’d like to underline again that I am not a pro coder (developer). Nerveless, I’d hope to try my best to be as contributive as possible. EDIT: I just spotted that @outlace is the author of a blog on ml/rl/ai and is one of the authors of a book on reinforcement learning so I am sorry about my unofficial tone of speech. (I was not aware about it and my approach here is usually easy going.)
I think that SLIDE algo has some issues, you should read their articles, MONGOOSE is prefered over SLIDE
i have an pytorch implementation of another powerful algo “Monarch” from HazyResearch , it has 2-4times performance gain over pytorch’s Linear , if you want i can paste pytorch code here
Really interested in contributing to this. Actually I would not even bother about Convolution layers and would try what people is investigating on Transformers.
Do you know about an implementation of something like “Multiplying Matrices without Multiplying” in Julia?