Is this function well optimized for speed?

Ferran_Mazzanti · November 13, 2019, 8:37am

Hi folks,
in my process of building up a neural networks code which I can compare to my colleague’s Python code, I have a simple function that I’m calling over and over again

function Log_Unnorm_Probs_B(x, W, Nh)
    @inline Log_1_plus_Exp(x) = x > 50.0 ? x : log(1.0 + exp(x));
    b     = view(W,:,1)
    ww    = view(W,:,2:Nh+1);
    x'*b .+ sum(Log_1_plus_Exp.(x'*ww),dims=2)
end;

a direct run of it yields, for instance

NNv       = 300;
NNh       = 600;
WW        = randn(NNv+1,NNh+1);
xxx       = rand([0.0,1.0],NNv+1,Nsamples)
xxx[1,:] .= 1.0;

@btime Log_Unnorm_Probs_B(xxx,WW,NNh)

  111.307 ms (18 allocations: 37.59 MiB)

4096×1 Array{Float64,2}:
 3098.4794174438807
 3063.7232498380554
 3180.0858336955025
 3100.0579926682003
 2882.5404152783744
...

Now to my eyes that seems to be a bit too much computing time… and is vastly outperformed (a factor of 20 or so) by my colleagues Python+Numpy version. Not to speak about the corresponding CuArrays CUDA version, which is more than 100 times faster. Now I know CUDA is parallel and Numpy can also exploit multithreading etc… so I’m not for the fight about languages, I just want to know if my code is significantly fast in a Julia context, or can be drastically improved for speed performance

Best regards and thanks,

Ferran

tamasgal · November 13, 2019, 9:41am

What is the value of Nsamples? I set it to 1_000 just for testing and got something around 2 ms:

julia> @btime Log_Unnorm_Probs_B(xxx, WW, NNh)
  2.332 ms (15 allocations: 940.58 KiB)

EDIT: I realised I had a typo and Nsamples was 1000 in my test run.

Amin_Yahyaabadi · November 13, 2019, 9:44am

4096 based on the answer

tamasgal · November 13, 2019, 9:46am

Yes, I should have looked at the output array size

Amin_Yahyaabadi · November 13, 2019, 10:02am

Based on my profiling code, I see that most of the time is spent in broadcasting, and the calculation of Log_1_plus_Exp.(x'*ww). Also, I don’t find any vectorization in the LLVM code.

I think, you should use a vectorized version (or use SIMD) instead of broadcasting here. Maybe because of that if else thing, compiler cannot generate efficient code.

Let’s see what others think

Edit: for some reason I looked at the wrong LLVM. There is some SIMD going on

tamasgal · November 13, 2019, 10:03am

I am a bit puzzled why the numpy code is so much faster. The number of allocations is fine, so most of the time is spent on actual number crunching. The question is then: is it related to threads, as you already mentioned or do you do “extra/unneeded” calculations in your Julia code, compared to the numpy implementation.

Could you maybe show the numpy code?

DNF · November 13, 2019, 10:10am

Based on what I can see, most of the time is spent inside the log(1+exp(x)) code. Even when removing the branch, there doesn’t seem to be a very large speedup.

I lost 15% runtime by just cleaning up the code, and writing a loop, most of the rest is log and exp which do not appear to be fast. Using a vectorization library should help. For example, AppleAccelerate is roughly 5 times as fast for the log/exp part.

Still not as fast as numpy, though.

Amin_Yahyaabadi · November 13, 2019, 10:21am

I recommend MKL.jl which is not limited to any OS (but I am not sure if it builds Julia for VM functions)
https://github.com/JuliaComputing/MKL.jl

Part of MKL that have these functions is Vector Math Library
https://github.com/Crown421/VML.jl

Edit: Testing on MKL.jl I didn’t gain any speed up, but VML gives 4.4X speed up!

Ferran_Mazzanti · November 13, 2019, 11:42am

Yeah sorry for the Nsamples thing… I always forget things
Regaring the numpy code, I’ll ask my colleague, so far I only have the results of the same kind of calculations.
Regarding the log(1+exp(x)) function, I do not know how to vectorize it in Julia, other than using map(). I tried that and it didn’t give any significant benefit, so that’s why I didn’t include it here.
BTW are you still getting 2ms with 4096 samples? I kee getting about 100ms which is a lot more, on my Xeon intel computer

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               1237.593
CPU max MHz:           3300,0000
CPU min MHz:           1200,0000
BogoMIPS:              4808.79
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K

So then the conclussion is…? Is the code ok as it is? @DNF you say log/exp dosen’t seem to be fast: is that then a particular feature of Julia?
I mean, I use Julia because I do lots of number crunching which I want/need to be fast (I come from fortran), so maybe these operations have to be improved? I do not know…

Thanks for your help,

Ferran.

Amin_Yahyaabadi · November 13, 2019, 12:21pm

Trying VML.jl I get 4.4X speed up compared to broadcasting:

julia>@btime log(1.0 .+ exp(rand(4096)))
 20.200 μs (8 allocations: 128.31 KiB)

julia>@btime log.(1.0 .+ exp.(rand(4096)))
  88.499 μs (4 allocations: 64.16 KiB)

This library is probably what your Python distribution uses by default

Seif_Shebl · November 13, 2019, 10:08pm

You can achieve near 15% speedup without using any packages if you feed the function Log_1_plus_Exp(x) as a first argument to sum, it also cuts the total allocated memory in half. Using xt = transpose(x) is just for clarity, remember that transpose is lazy. BTW, VML.jl isn’t supported in Windows yet.

function Log_Unnorm_Probs_B(x, W, Nh)
    Log_1_plus_Exp(x) = x > 50 ? x : log(1+exp(x))
    b  = view(W,:,1)
    ww = view(W,:,2:Nh+1)
    xt = transpose(x)
    xt*b .+ sum(Log_1_plus_Exp, xt*ww, dims=2)
end

NNv = 300
NNh = 600
Nsamples = 4096
WW  = randn(NNv+1,NNh+1)
xxx = rand([0.0,1.0], NNv+1, Nsamples)
xxx[1,:] .= 1.0

using BenchmarkTools
@btime Log_Unnorm_Probs_B(xxx, WW, NNh)
    71.201 ms (16 allocations: 18.84 MiB)

Edit:

Thanks to @Amin_Yahyaabadi, Crown421/VML.jl now works nice in Windows.

Amin_Yahyaabadi · November 13, 2019, 10:59pm

It is supported. The link I sent is the PR that I just reviewed and tested on Windows:

Seif_Shebl · November 13, 2019, 11:21pm

Yes, thanks @Amin_Yahyaabadi, your PR worked in Windows.

Amin_Yahyaabadi · November 13, 2019, 11:29pm

You’re welcome. We should mainly thank @crown421 for putting it together.

Topic		Replies	Views
Performance of log.(x) General Usage performance , broadcast	11	2736	April 12, 2019
Fast logsumexp Performance benchmark	14	7674	June 26, 2019
Sum(log.(p * C)) is 2 to 4 times slower than in NumPy Performance python , matrices	19	1405	January 26, 2021
Comparing exp() performance on Julia versus numpy Performance	38	2053	August 8, 2022
Trivial code change causes vectorization failure Performance	3	700	February 1, 2019

Is this function well optimized for speed?

Related topics