Is this function well optimized for speed?

Hi folks,
in my process of building up a neural networks code which I can compare to my colleague’s Python code, I have a simple function that I’m calling over and over again

function Log_Unnorm_Probs_B(x, W, Nh)
    @inline Log_1_plus_Exp(x) = x > 50.0 ? x : log(1.0 + exp(x));
    b     = view(W,:,1)
    ww    = view(W,:,2:Nh+1);
    x'*b .+ sum(Log_1_plus_Exp.(x'*ww),dims=2)

a direct run of it yields, for instance

NNv       = 300;
NNh       = 600;
WW        = randn(NNv+1,NNh+1);
xxx       = rand([0.0,1.0],NNv+1,Nsamples)
xxx[1,:] .= 1.0;

@btime Log_Unnorm_Probs_B(xxx,WW,NNh)

  111.307 ms (18 allocations: 37.59 MiB)

4096×1 Array{Float64,2}:

Now to my eyes that seems to be a bit too much computing time… and is vastly outperformed (a factor of 20 or so) by my colleagues Python+Numpy version. Not to speak about the corresponding CuArrays CUDA version, which is more than 100 times faster. Now I know CUDA is parallel and Numpy can also exploit multithreading etc… so I’m not for the fight about languages, I just want to know if my code is significantly fast in a Julia context, or can be drastically improved for speed performance :slight_smile:

Best regards and thanks,


1 Like

What is the value of Nsamples? I set it to 1_000 just for testing and got something around 2 ms:

julia> @btime Log_Unnorm_Probs_B(xxx, WW, NNh)
  2.332 ms (15 allocations: 940.58 KiB)

EDIT: I realised I had a typo and Nsamples was 1000 in my test run.

4096 based on the answer

1 Like

Yes, I should have looked at the output array size :sweat_smile:

Based on my profiling code, I see that most of the time is spent in broadcasting, and the calculation of Log_1_plus_Exp.(x'*ww). Also, I don’t find any vectorization in the LLVM code.

I think, you should use a vectorized version (or use SIMD) instead of broadcasting here. Maybe because of that if else thing, compiler cannot generate efficient code.

Let’s see what others think

Edit: for some reason I looked at the wrong LLVM. There is some SIMD going on

I am a bit puzzled why the numpy code is so much faster. The number of allocations is fine, so most of the time is spent on actual number crunching. The question is then: is it related to threads, as you already mentioned or do you do “extra/unneeded” calculations in your Julia code, compared to the numpy implementation.

Could you maybe show the numpy code?

Based on what I can see, most of the time is spent inside the log(1+exp(x)) code. Even when removing the branch, there doesn’t seem to be a very large speedup.

I lost 15% runtime by just cleaning up the code, and writing a loop, most of the rest is log and exp which do not appear to be fast. Using a vectorization library should help. For example, AppleAccelerate is roughly 5 times as fast for the log/exp part.

Still not as fast as numpy, though.


I recommend MKL.jl which is not limited to any OS (but I am not sure if it builds Julia for VM functions)

Part of MKL that have these functions is Vector Math Library

Edit: Testing on MKL.jl I didn’t gain any speed up, but VML gives 4.4X speed up!

Yeah sorry for the Nsamples thing… I always forget things :frowning:
Regaring the numpy code, I’ll ask my colleague, so far I only have the results of the same kind of calculations.
Regarding the log(1+exp(x)) function, I do not know how to vectorize it in Julia, other than using map(). I tried that and it didn’t give any significant benefit, so that’s why I didn’t include it here.
BTW are you still getting 2ms with 4096 samples? I kee getting about 100ms which is a lot more, on my Xeon intel computer

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1237.593
CPU max MHz:           3300,0000
CPU min MHz:           1200,0000
BogoMIPS:              4808.79
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K

So then the conclussion is…? Is the code ok as it is? @DNF you say log/exp dosen’t seem to be fast: is that then a particular feature of Julia?
I mean, I use Julia because I do lots of number crunching which I want/need to be fast (I come from fortran), so maybe these operations have to be improved? I do not know…

Thanks for your help,


Trying VML.jl I get 4.4X speed up compared to broadcasting:

julia>@btime log(1.0 .+ exp(rand(4096)))
 20.200 μs (8 allocations: 128.31 KiB)

julia>@btime log.(1.0 .+ exp.(rand(4096)))
  88.499 μs (4 allocations: 64.16 KiB)

This library is probably what your Python distribution uses by default

You can achieve near 15% speedup without using any packages if you feed the function Log_1_plus_Exp(x) as a first argument to sum, it also cuts the total allocated memory in half. Using xt = transpose(x) is just for clarity, remember that transpose is lazy. BTW, VML.jl isn’t supported in Windows yet.

function Log_Unnorm_Probs_B(x, W, Nh)
    Log_1_plus_Exp(x) = x > 50 ? x : log(1+exp(x))
    b  = view(W,:,1)
    ww = view(W,:,2:Nh+1)
    xt = transpose(x)
    xt*b .+ sum(Log_1_plus_Exp, xt*ww, dims=2)

NNv = 300
NNh = 600
Nsamples = 4096
WW  = randn(NNv+1,NNh+1)
xxx = rand([0.0,1.0], NNv+1, Nsamples)
xxx[1,:] .= 1.0

using BenchmarkTools
@btime Log_Unnorm_Probs_B(xxx, WW, NNh)
    71.201 ms (16 allocations: 18.84 MiB)


Thanks to @Amin_Yahyaabadi, Crown421/VML.jl now works nice in Windows.


It is supported. The link I sent is the PR that I just reviewed and tested on Windows:


Yes, thanks @Amin_Yahyaabadi, your PR worked in Windows.

You’re welcome. We should mainly thank @crown421 for putting it together.