Why are calls to this function slow, compared to Python+Numpy?


#1

I have simple question. I have a function b, below which i make multiple calls too. But I am experiencing that it is about x3 slower than a Numpy implementaion in Python. Any idea on how to speed this up?

function b(a, k)
  return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a);
end

testing function

function test(s)
  g = zeros(s);
  for i =1:s
    g = b(a,i+1)^a;
  end
end

s = 500;
@time test(s)

Result:

0.000358 seconds (5.50 k allocations: 90.156 KiB)

I would expect an execution time of about 0.0001 secs. Any help will be appreciated.


#2

There are a couple issues:

  1. you define g = zeros(s), so it’s an Array{Float64, 1}, but then you set g = b(a,i+1)^a, so here it becomes a scalar (presumably a Float64, depending on the type of a). You probably want to initialize g to zero(s), not zeros(s).

  2. a is not given in your code. You probably use it a global variable, make sure to make at least a const, but for better performance it’s a good idea to pass it as a function argument.

With these changes I get the performance you expect

julia> using BenchmarkTools

julia> b(a, k) = ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a)
b (generic function with 1 method)

julia> function test(s, a)
           g = zero(s)
           for i =1:s
               g = b(a,i+1)^a
           end
           return g
       end
test (generic function with 1 method)

julia> s = 500
500

julia> @btime test($s, 3.0)
  115.908 μs (0 allocations: 0 bytes)
1.2537550024999987e8

#3

Thanks for the answer @giordano Sorry, but there were some bugs in the code I presented. Here is what I trying to do. Is there a way to make the computation of b go faster?

function b(a, k)
  return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a);
end

function test(s,a)
  g = zeros(s);
  for i =1:s
    g[i] = b(a,i+1)^a;
  end
end

s = 500;
@btime test(s,0.3)

183.833 μs (1 allocation: 4.06 KiB)


#4

You can avoid useless calculations, like raising to 1 / a and then to a. For real and positive quantities it should be safe to remove both of them. On my machine this reduces running time to a third, but this isn’t limited to Julia.

Initial value of g isn’t relevant, so you can initialize to a random vector with Vector{Float64}(s), rather then zeros(s). On my machine this shaves off one microsecond out of ~115.

I can’t see any other evident change you can apply to make your code run faster. Anyway, for the future, keep always in mind the tips provided at https://docs.julialang.org/en/stable/manual/performance-tips.html :wink:


#5

The following shaved off half the time for me

function test(s,a)
  g = zeros(s);
  Threads.@threads for i =1:s
    g[i] = b(a,i+1)^a;
  end
end

@btime test($s,0.3)
61.320 μs (2 allocations: 4.11 KiB)

Before @threads
129.483 μs (1 allocation: 4.06 KiB)

with julia started with 4 threads.


#6

But this is like cheating :smile:

How many physical CPUs do you have? I have 4 and running Julia with 4 threads running time decreases by a factor of ~4. With more than 4 threads performance gets worse, because of hyper-threading, maybe this is your case.


#7

Okey, guys thanks for the answers. However, I cannot understand why the code below (with the line filling G just set to some random float number) runs in 50% of the time as the one with filling of G using the b function included. Is this computation so slow?

function gen_volterra_for(a,s,n,rand_1,rand_2)
  G = zeros(s+1);
  Z_1 = zeros(s);
  Z_2 = zeros(s+1);
  for i =1:s
    #G[i+1] = b(a,i+1)/(n^a);
    G[i+1] = 34.74;
    Z_1[i] = get_sigma_Z1(n)*rand_1[i]
    Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
  end
  return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end

and

function gen_volterra_for(a,s,n,rand_1,rand_2)
  G = zeros(s+1);
  Z_1 = zeros(s);
  Z_2 = zeros(s+1);
  for i =1:s
    G[i+1] = b(a,i+1)/(n^a);
    #G[i+1] = 34.74;
    Z_1[i] = get_sigma_Z1(n)*rand_1[i]
    Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
  end
  return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end

Setup:

a = -0.43;
rho = -0.9;
eta = 1.9;
S_0 = 1.0
xi = 0.235^2;
n = 312;
s = round(Int,n*1.0;)
dt = 1.0/n;
mean = 0;


rand_nums = randn(3*n)
rand_1 = rand_nums[1:n]
rand_2 = rand_nums[n+1:2*n]

@btime volterra = gen_volterra_for(a,s,n,rand_1,rand_2)

Various utility functions


function b(a, k)
  return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
end

function power_ker(a, x)
  return x^a;
end

function get_sigma_Z1(n)
  return sqrt(1.0/n);
end

function get_sigma_Z2(a,n)
  return sqrt(1.0/((2.0*a+1.0)*n^(2.0*a+1.0)))
end

function get_rho(a)
  return sqrt(2.0*a+1.0)/(a+1.0)
end

function direct_conv{T}(a::Array{T}, b::Array{T})
    m = length(a)
    n = length(b)
    c = zeros(T,m+n-1)
     @inbounds @simd for j=1:m
        @inbounds @simd for k=1:n
            c[j+k-1] += a[j]*b[k]
        end
    end
    return c
end


#8

The power function (e.g. k^(a+1.0)) is currently not optimized very well, and dominates your code. Eventually, LLVM should automatically invoke a vector math library for this, but for now you might try something like VML.jl


#9

Is it all julia code? I need this to work with the ForwardDiff package.
Edit: Nevermind, its Intel MKL so fortran/C I guess.


#10
julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial: 
  memory estimate:  139.97 KiB
  allocs estimate:  7807
  --------------
  minimum time:     777.090 μs (0.00% GC)
  median time:      794.260 μs (0.00% GC)
  mean time:        811.997 μs (0.92% GC)
  maximum time:     2.395 ms (59.16% GC)
  --------------
  samples:          6140
  evals/sample:     1
julia> function gen_volterra_for(a,s,n,rand_1,rand_2)
         G = zeros(s+1);
         Z_1 = zeros(s);
         Z_2 = zeros(s+1);
         @inbounds @simd for i =1:s
           G[i+1] = b(a,i+1)/(n^a);
           #G[i+1] = 34.74;
           Z_1[i] = get_sigma_Z1(n)*rand_1[i]
           Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
         end
         return (direct_conv(Z_1,G)[1:s+1]+Z_2);
       end
gen_volterra_for (generic function with 1 method)

julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial: 
  memory estimate:  18.06 KiB
  allocs estimate:  6
  --------------
  minimum time:     245.693 μs (0.00% GC)
  median time:      247.533 μs (0.00% GC)
  mean time:        251.392 μs (0.48% GC)
  maximum time:     2.119 ms (83.37% GC)
  --------------
  samples:          10000
  evals/sample:     1

For reference, @inbounds contributed just under a microsecond. @simd deserves the credit.


#11

I do not get anything near the ~4x speedup you are experiencing. What could be the reason for this?


#12

If you never rebuilt your system image and are just using a generic installation, then you may not have the right compiler optimizations enabled.


#13

This computesthe same thing as the original post about 10x faster on my machine, without “cheating” using threads:

function test(s,a)
         g = Vector{Float64}(s)
         ap1= a+1
         ap1inv = inv(ap1)
         @inbounds for i =1:s
           k = i+1.0
           @fastmath g[i] = (k^ap1-(k-1)^ap1)*ap1inv
         end
end

New to julia: Convolution and juliadiff
#14

FastMath is a prickly beast…
Original is the corrected original (still doing zeros() and [ ^(1/a) ] ^a ), “fm” is your version with FastMath, and “no_fm” is your version minus the FastMath macro.

julia> @benchmark original($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     144.635 μs (0.00% GC)
  median time:      145.724 μs (0.00% GC)
  mean time:        147.749 μs (0.00% GC)
  maximum time:     205.575 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark fm($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     134.556 μs (0.00% GC)
  median time:      134.811 μs (0.00% GC)
  mean time:        139.382 μs (0.00% GC)
  maximum time:     175.917 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark no_fm($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     70.826 μs (0.00% GC)
  median time:      71.886 μs (0.00% GC)
  mean time:        72.413 μs (0.00% GC)
  maximum time:     104.729 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark original($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     152.398 μs (0.00% GC)
  median time:      179.032 μs (0.00% GC)
  mean time:        178.279 μs (0.00% GC)
  maximum time:     321.960 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark fm($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     9.448 μs (0.00% GC)
  median time:      9.947 μs (0.00% GC)
  mean time:        10.187 μs (0.00% GC)
  maximum time:     31.462 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark no_fm($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     73.765 μs (0.00% GC)
  median time:      74.019 μs (0.00% GC)
  mean time:        75.585 μs (0.00% GC)
  maximum time:     110.715 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I’ve experimented with FastMath before and hadn’t been able to find a case where it actually made things faster, and was excited to see on where it did.
However, while it is blisteringly fast for a = 3.0, it takes roughly twice as long when a = 0.3, or if a = only approximately 3.0…

julia> @benchmark fm($s, 3.00000000001)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     134.603 μs (0.00% GC)
  median time:      134.761 μs (0.00% GC)
  mean time:        136.377 μs (0.00% GC)
  maximum time:     174.652 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Gross. I would stay away from FastMath, without good references on conditions where it will actually consistently perform well. =/

EDIT: For this function, FastMath is fast if a = -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, or an integer >= -1.
Otherwise it is nearly twice as slow as no FastMath (on my machine).


#15

Okey, good to know. I will have \alpha \in (-0.5,0) . So from what I understand, FastMath will not give any speed up.