# Is this function well optimized for speed?

Hi folks,
in my process of building up a neural networks code which I can compare to my colleague’s Python code, I have a simple function that I’m calling over and over again

``````function Log_Unnorm_Probs_B(x, W, Nh)
@inline Log_1_plus_Exp(x) = x > 50.0 ? x : log(1.0 + exp(x));
b     = view(W,:,1)
ww    = view(W,:,2:Nh+1);
x'*b .+ sum(Log_1_plus_Exp.(x'*ww),dims=2)
end;
``````

a direct run of it yields, for instance

``````NNv       = 300;
NNh       = 600;
WW        = randn(NNv+1,NNh+1);
xxx       = rand([0.0,1.0],NNv+1,Nsamples)
xxx[1,:] .= 1.0;

@btime Log_Unnorm_Probs_B(xxx,WW,NNh)

111.307 ms (18 allocations: 37.59 MiB)

4096×1 Array{Float64,2}:
3098.4794174438807
3063.7232498380554
3180.0858336955025
3100.0579926682003
2882.5404152783744
...
``````

Now to my eyes that seems to be a bit too much computing time… and is vastly outperformed (a factor of 20 or so) by my colleagues Python+Numpy version. Not to speak about the corresponding CuArrays CUDA version, which is more than 100 times faster. Now I know CUDA is parallel and Numpy can also exploit multithreading etc… so I’m not for the fight about languages, I just want to know if my code is significantly fast in a Julia context, or can be drastically improved for speed performance

Best regards and thanks,

Ferran

1 Like

What is the value of `Nsamples`? I set it to `1_000` just for testing and got something around 2 ms:

``````julia> @btime Log_Unnorm_Probs_B(xxx, WW, NNh)
2.332 ms (15 allocations: 940.58 KiB)
``````

EDIT: I realised I had a typo and `Nsamples` was 1000 in my test run.

1 Like

Yes, I should have looked at the output array size

Based on my profiling code, I see that most of the time is spent in broadcasting, and the calculation of `Log_1_plus_Exp.(x'*ww)`. Also, I don’t find any vectorization in the LLVM code.

I think, you should use a vectorized version (or use SIMD) instead of broadcasting here. Maybe because of that `if else` thing, compiler cannot generate efficient code.

Let’s see what others think

Edit: for some reason I looked at the wrong LLVM. There is some SIMD going on

I am a bit puzzled why the numpy code is so much faster. The number of allocations is fine, so most of the time is spent on actual number crunching. The question is then: is it related to threads, as you already mentioned or do you do “extra/unneeded” calculations in your Julia code, compared to the numpy implementation.

Could you maybe show the numpy code?

Based on what I can see, most of the time is spent inside the `log(1+exp(x))` code. Even when removing the branch, there doesn’t seem to be a very large speedup.

I lost 15% runtime by just cleaning up the code, and writing a loop, most of the rest is `log` and `exp` which do not appear to be fast. Using a vectorization library should help. For example, AppleAccelerate is roughly 5 times as fast for the `log/exp` part.

Still not as fast as numpy, though.

2 Likes

I recommend MKL.jl which is not limited to any OS (but I am not sure if it builds Julia for VM functions)

Part of MKL that have these functions is Vector Math Library

Edit: Testing on MKL.jl I didn’t gain any speed up, but VML gives 4.4X speed up!

Yeah sorry for the Nsamples thing… I always forget things
Regaring the numpy code, I’ll ask my colleague, so far I only have the results of the same kind of calculations.
Regarding the log(1+exp(x)) function, I do not know how to vectorize it in Julia, other than using map(). I tried that and it didn’t give any significant benefit, so that’s why I didn’t include it here.
BTW are you still getting 2ms with 4096 samples? I kee getting about 100ms which is a lot more, on my Xeon intel computer

``````Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1237.593
CPU max MHz:           3300,0000
CPU min MHz:           1200,0000
BogoMIPS:              4808.79
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
``````

So then the conclussion is…? Is the code ok as it is? @DNF you say `log/exp` dosen’t seem to be fast: is that then a particular feature of Julia?
I mean, I use Julia because I do lots of number crunching which I want/need to be fast (I come from fortran), so maybe these operations have to be improved? I do not know…

Ferran.

Trying VML.jl I get 4.4X speed up compared to broadcasting:

``````julia>@btime log(1.0 .+ exp(rand(4096)))
20.200 μs (8 allocations: 128.31 KiB)

julia>@btime log.(1.0 .+ exp.(rand(4096)))
88.499 μs (4 allocations: 64.16 KiB)
``````

This library is probably what your Python distribution uses by default

You can achieve near 15% speedup without using any packages if you feed the function `Log_1_plus_Exp(x)` as a first argument to `sum`, it also cuts the total allocated memory in half. Using `xt = transpose(x)` is just for clarity, remember that `transpose` is lazy. BTW, VML.jl isn’t supported in Windows yet.

``````function Log_Unnorm_Probs_B(x, W, Nh)
Log_1_plus_Exp(x) = x > 50 ? x : log(1+exp(x))
b  = view(W,:,1)
ww = view(W,:,2:Nh+1)
xt = transpose(x)
xt*b .+ sum(Log_1_plus_Exp, xt*ww, dims=2)
end

NNv = 300
NNh = 600
Nsamples = 4096
WW  = randn(NNv+1,NNh+1)
xxx = rand([0.0,1.0], NNv+1, Nsamples)
xxx[1,:] .= 1.0

using BenchmarkTools
@btime Log_Unnorm_Probs_B(xxx, WW, NNh)
71.201 ms (16 allocations: 18.84 MiB)
``````

Edit:

Thanks to @Amin_Yahyaabadi, Crown421/VML.jl now works nice in Windows.

3 Likes

It is supported. The link I sent is the PR that I just reviewed and tested on Windows:

3 Likes