This RBM used scalar product
BenchmarkTools.Trial:
memory estimate: 1.42 mb
allocs estimate: 814
--------------
minimum time: 72.793 ms (0.00% GC)
median time: 73.184 ms (0.00% GC)
mean time: 74.461 ms (0.05% GC)
maximum time: 83.571 ms (0.00% GC)
--------------
samples: 68
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Edit: Same results for with or without interpolation into @benchmark.
As a partial aisde, Iāve been working on some code lately in which this naive implementation of sigmoid (which would be called invlogit in the statistics world) is exactly the main source of problems: this exact definition uses a large proportion of total time in the inner loop of my code and itās also the primary source of numeric issues because of the limited range of x for which it generates a result thatās not exactly 0 or 1.
In statistics, you could probably do even better than optimizing sigmoid by noting that this function is essentially always mixed with other functions: you typically end up needing a mixture of log(sigmoid(x)) and log(1 - sigmoid(x)) for fitting logistic regression models, so optimizing those compositions could provide even greater improvements.
Thank you for taking the time to look at the code. I didnāt know about the @view option or about the fact that .+= avoids realocating memory. I have tested @Evizero code but I still get a lot of GC. This is what I get
memory estimate: 1.33 gb
allocs estimate: 22845
minimum time: 601.147 ms (20.87% GC)
median time: 633.693 ms (20.14% GC)
mean time: 624.172 ms (20.33% GC)
maximum time: 648.480 ms (19.94% GC)
Should I test the code in julia 0.6? I am testing it in julia 0.5.
Iām surprised nobody has mentioned BLAS yet - Iāve implemented exactly the same model a while ago, and BLAS gave pretty good performance boost and zero memory allocation (at least at these earlier days of Julia). For example, in real-life settings the expression above would work with mini-batches of, say, 1000 vectors (e.g. size(ehp) == (255, 1000); size(x) == (784, 1000)) and can be rewritten as:
@benchmark Delta_W = lr .* (ehp * x' .- ehn * xneg')
BenchmarkTools.Trial:
memory estimate: 5.38 mb
allocs estimate: 22
--------------
minimum time: 12.316 ms (0.00% GC)
median time: 29.460 ms (0.00% GC)
mean time: 28.002 ms (0.76% GC)
maximum time: 50.396 ms (0.00% GC)
--------------
samples: 179
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
or using BLAS:
@benchmark begin
gemm!('N', 'T', 1.0, ehp, x, 0.0, buf)
gemm!('N', 'T', 1.0, ehn, xneg, 0.0, Delta_W)
axpy!(-1.0, Delta_W, buf)
scal!(176400, lr, Delta_W, 1)
end
BenchmarkTools.Trial:
memory estimate: 0.00 bytes
allocs estimate: 0
--------------
minimum time: 10.181 ms (0.00% GC)
median time: 10.623 ms (0.00% GC)
mean time: 11.759 ms (0.00% GC)
maximum time: 45.104 ms (0.00% GC)
--------------
samples: 426
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Although I havenāt checked it on 0.6, so latest Julia may still beat these results.
EDIT: Made a mistake which resulted in unfair comparison. The gist I initially posted here in this exact post was wrong as the BLAS implementation is intended to do the whole loop at once, while the other do just one iteration. This mistake does not affect anything I posted above
@Evizero, it might be nice to have some version of this as a benchmark in the BaseBenchmarks package, to help use track Julia performance, if you want to do a pull request.
As soon as you are working with multiple vectors at once, so that you are doing BLAS3 operations (matrix multiplications etc.), then I agree that you definitely want to exploit a fast BLAS (like the OpenBLAS that Julia links). However, you donāt generally need to call low-level BLAS functions like gemm! directly. A_mul_Bt! and similar high-level functions are just as efficient.
For BLAS1 operations like axpy! and scal!, you are probably better off with the fusing broadcast operations. The increase in locality and reduction of other overheads that you get from fusing the loops will beat the minor optimizations that OpenBLAS can do for BLAS1 operations.
(And if most of your time is spent in BLAS3 operations, then you should expect essentially the same performance in Julia and NumPy, assuming that they are using the same BLAS library.)
Thank you a lot for your time. It turns out that most of the speed came from a transpose that your code avoided, not from the fact that you used A_mul_B!..
In your version you wrote
Delta_W .+= lr .* (ehp .* xā .- ehn .* xnegā)
In my version I had
Delta_W .+= lr * ( x * ehpā - xneg * ehnā)ā
I think youād be happy to hear that Julia in v0.6 will ātake transposes seriouslyā via the type system, making it essentially a free operation. It was a huge thread, culminating in:
Most nightlies havenāt had an update since then (only COPR has), so if you didnāt build it from source or use COPR, you wonāt have that update. It will say how many days from the last master at the top of the REPL when you open it up.
@davidbp is talking about the outer matrix transpose. I donāt think the rowvector PR affects that
EDIT: also, it looks like lr * (...)' gets translated into A_mul_Bc(lr, ...), so I donāt think (?) that the outer transpose influences much in your particular code.
Aside from that, also observe how your version uses * and the other .*. The * (without the dot) causes broadcast fusion to stop, which will cause the creation of temporary arrays. So really there are a few non-obvious sources of performance penalties.
The change @ChrisRackauckas is talking about affects the inner vector transposes and so gives an additional performance boost.