Hi. The ELM.jl implementation of Extreme Machine Learning looks fun and the single example given on Github looks simple enough.
Is there any other documentation on how to use activation functions other than sigmoid, or perhaps even using a customised loss function?
If you look at the implementation (https://github.com/lepisma/ELM.jl/blob/master/src/base.jl), the type HiddenLayer
has an activation function, and only sigmoid
is implemented. It seems like you could initialize the hidden layer with other activation function defined by you. Have you tried this?
Thanks. As it seems loadable as a package, I didn’t think of modifying the source. Also, I had wanted to use softmax, so a completely different architecture would be needed. Maybe I need to just regard this as an example and write my own from scratch.
It would be nice to be able to see how this compares with other methods for MNIST, for example (where images are vectorised).
If you wanted too you could write your own ELM very easily.
You only need to adjust things if your data is too big to do an inverse of the covariance matrix iirc. Most people aren’t dealing with that. This implementation should let you swap out the activation function for whatever you want, so alternatively you could just use that package.
Thanks very much!
I guess this is even easier than I thought, though I would have thought that more sophisticated methods for scaling random weights and for regularised inversion with cross-validated hyper-parameters would work much better and not be too much more expensive.
Anyway, I guess I will find out!
Thanks again.
You can scale the weights. The ELM.jl package uses randomly uniform distributed weights between (-1,1) iirc. There’s probably dozens of papers on how scaling them certain ways worked better for some problems. My understanding is that they should be random normal, but yeesh I haven’t thought about ELM’s for a long time now.
You can also CV the hyper parameter IE the reservoir size. This function doesn’t do that for you but that package does have CV’s in it.
This function does only the implicit regularization of using a pseudoinverse (ridge parameter based on the spectral norm IIRC but fact check that). But you could copy pasta it and add a scalar to the diagonal if you want with a general inverse of the covariance.
Yea let us know if this works for you if not I know we can help. ELM’s are very simple.
It’s important to mention that, although ELM’s can be faster to compute they do, theoretically, provide the same optimal answer(barring random chance and reservoir size) as say a LS-SVM with a gaussian kernel after tuning. Also worth noting, LS-SVM’s are the same as kernel ridge regression with a bias term. So those options are also worth exploring. ELM’s sound cool because they are all Extreme, but really they are a sort of short cut too otherwise less random methods. That shortcut has a bit of a cost, but people doing some online learning stuff appreciate their speed. All of those methods should also be in ChemometricsTools.jl but I would imagine there are other implementations around the ecosystem by now. Maybe not though. If not check the shoot out script in that repository and add a CV loop or two if it’s not already there.
Thanks so much. I have done alot with SVM’s (written papers even). I am really just exploring ELM’s because I came across a reference to them, and I had somehow missed them along the way, and wanted to have a bit of a play.
I think it is neat when ‘quick and dirty’ gets you a good deal of the way along (this is good for showing students also!).
Yes, I was talking about tuning the ridge parameter (or maybe with a lasso one as well which might reduce dimension).
Thanks again!
SVMs are awesome I don’t care what anyone says :).
I’ve never seen LassoELM, but in reservoir learning its very common (say in echo state networks) to introduce sparsity in the weights. In general sparsity is good anyways :). LS-SVM is basically a short-cut to SVM’s, and they usually offer less sparsity than true SVMs do. That said, optimizing LS-SVM only involves linear regression so - it’s much faster. And also, undergraduates can write the code themselves! Maybe not understand the theory, but hey, maybe now days they could.
I wrote a projected gradient descent LASSO in a gist somewhere… But I think there are probably dozens of Lasso’s regressions out there in the Julia ecosystem now days :). Here’s a nice package for the classic way of doing LASSO: https://github.com/kul-forbes/ProximalOperators.jl/blob/master/demos/lasso.jl
Anyways, yea let us know if you have any troubles.
Thanks - you’ve given me lots to play with.
BTW, Convex.jl does Lasso in a flash, and only needs about 3 lines of code.
Always forget how nice these optimizer packages are. Thanks for the tip! It’d be fun to test the results of say convex.jl with the old school approaches for efficiency, but too many balls in the air right now.
Maybe I’m too late to the party, but CLIMA’s GitHub - CliMA/RandomFeatures.jl: Modular random feature approximation in Julia hosts different random feature models.