Hi folks,
in my process of building up a neural networks code which I can compare to my colleague’s Python code, I have a simple function that I’m calling over and over again

function Log_Unnorm_Probs_B(x, W, Nh)
@inline Log_1_plus_Exp(x) = x > 50.0 ? x : log(1.0 + exp(x));
b = view(W,:,1)
ww = view(W,:,2:Nh+1);
x'*b .+ sum(Log_1_plus_Exp.(x'*ww),dims=2)
end;

Now to my eyes that seems to be a bit too much computing time… and is vastly outperformed (a factor of 20 or so) by my colleagues Python+Numpy version. Not to speak about the corresponding CuArrays CUDA version, which is more than 100 times faster. Now I know CUDA is parallel and Numpy can also exploit multithreading etc… so I’m not for the fight about languages, I just want to know if my code is significantly fast in a Julia context, or can be drastically improved for speed performance

Based on my profiling code, I see that most of the time is spent in broadcasting, and the calculation of Log_1_plus_Exp.(x'*ww). Also, I don’t find any vectorization in the LLVM code.

I think, you should use a vectorized version (or use SIMD) instead of broadcasting here. Maybe because of that if else thing, compiler cannot generate efficient code.

Let’s see what others think

Edit: for some reason I looked at the wrong LLVM. There is some SIMD going on

I am a bit puzzled why the numpy code is so much faster. The number of allocations is fine, so most of the time is spent on actual number crunching. The question is then: is it related to threads, as you already mentioned or do you do “extra/unneeded” calculations in your Julia code, compared to the numpy implementation.

Based on what I can see, most of the time is spent inside the log(1+exp(x)) code. Even when removing the branch, there doesn’t seem to be a very large speedup.

I lost 15% runtime by just cleaning up the code, and writing a loop, most of the rest is log and exp which do not appear to be fast. Using a vectorization library should help. For example, AppleAccelerate is roughly 5 times as fast for the log/exp part.

Yeah sorry for the Nsamples thing… I always forget things
Regaring the numpy code, I’ll ask my colleague, so far I only have the results of the same kind of calculations.
Regarding the log(1+exp(x)) function, I do not know how to vectorize it in Julia, other than using map(). I tried that and it didn’t give any significant benefit, so that’s why I didn’t include it here.
BTW are you still getting 2ms with 4096 samples? I kee getting about 100ms which is a lot more, on my Xeon intel computer

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 1237.593
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4808.79
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

So then the conclussion is…? Is the code ok as it is? @DNF you say log/exp dosen’t seem to be fast: is that then a particular feature of Julia?
I mean, I use Julia because I do lots of number crunching which I want/need to be fast (I come from fortran), so maybe these operations have to be improved? I do not know…

You can achieve near 15% speedup without using any packages if you feed the function Log_1_plus_Exp(x) as a first argument to sum, it also cuts the total allocated memory in half. Using xt = transpose(x) is just for clarity, remember that transpose is lazy. BTW, VML.jl isn’t supported in Windows yet.

function Log_Unnorm_Probs_B(x, W, Nh)
Log_1_plus_Exp(x) = x > 50 ? x : log(1+exp(x))
b = view(W,:,1)
ww = view(W,:,2:Nh+1)
xt = transpose(x)
xt*b .+ sum(Log_1_plus_Exp, xt*ww, dims=2)
end
NNv = 300
NNh = 600
Nsamples = 4096
WW = randn(NNv+1,NNh+1)
xxx = rand([0.0,1.0], NNv+1, Nsamples)
xxx[1,:] .= 1.0
using BenchmarkTools
@btime Log_Unnorm_Probs_B(xxx, WW, NNh)
71.201 ms (16 allocations: 18.84 MiB)