# Why are calls to this function slow, compared to Python+Numpy?

#1

I have simple question. I have a function b, below which i make multiple calls too. But I am experiencing that it is about x3 slower than a Numpy implementaion in Python. Any idea on how to speed this up?

function b(a, k)
return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a);
end


testing function

function test(s)
g = zeros(s);
for i =1:s
g = b(a,i+1)^a;
end
end

s = 500;
@time test(s)


Result:

0.000358 seconds (5.50 k allocations: 90.156 KiB)

I would expect an execution time of about 0.0001 secs. Any help will be appreciated.

#2

There are a couple issues:

1. you define g = zeros(s), so it’s an Array{Float64, 1}, but then you set g = b(a,i+1)^a, so here it becomes a scalar (presumably a Float64, depending on the type of a). You probably want to initialize g to zero(s), not zeros(s).

2. a is not given in your code. You probably use it a global variable, make sure to make at least a const, but for better performance it’s a good idea to pass it as a function argument.

With these changes I get the performance you expect

julia> using BenchmarkTools

julia> b(a, k) = ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a)
b (generic function with 1 method)

julia> function test(s, a)
g = zero(s)
for i =1:s
g = b(a,i+1)^a
end
return g
end
test (generic function with 1 method)

julia> s = 500
500

julia> @btime test($s, 3.0) 115.908 μs (0 allocations: 0 bytes) 1.2537550024999987e8  #3 Thanks for the answer @giordano Sorry, but there were some bugs in the code I presented. Here is what I trying to do. Is there a way to make the computation of b go faster? function b(a, k) return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a); end function test(s,a) g = zeros(s); for i =1:s g[i] = b(a,i+1)^a; end end s = 500; @btime test(s,0.3)  183.833 μs (1 allocation: 4.06 KiB) #4 You can avoid useless calculations, like raising to 1 / a and then to a. For real and positive quantities it should be safe to remove both of them. On my machine this reduces running time to a third, but this isn’t limited to Julia. Initial value of g isn’t relevant, so you can initialize to a random vector with Vector{Float64}(s), rather then zeros(s). On my machine this shaves off one microsecond out of ~115. I can’t see any other evident change you can apply to make your code run faster. Anyway, for the future, keep always in mind the tips provided at https://docs.julialang.org/en/stable/manual/performance-tips.html #5 The following shaved off half the time for me function test(s,a) g = zeros(s); Threads.@threads for i =1:s g[i] = b(a,i+1)^a; end end @btime test($s,0.3)
61.320 μs (2 allocations: 4.11 KiB)

129.483 μs (1 allocation: 4.06 KiB)


with julia started with 4 threads.

#6

But this is like cheating

How many physical CPUs do you have? I have 4 and running Julia with 4 threads running time decreases by a factor of ~4. With more than 4 threads performance gets worse, because of hyper-threading, maybe this is your case.

#7

Okey, guys thanks for the answers. However, I cannot understand why the code below (with the line filling G just set to some random float number) runs in 50% of the time as the one with filling of G using the b function included. Is this computation so slow?

function gen_volterra_for(a,s,n,rand_1,rand_2)
G = zeros(s+1);
Z_1 = zeros(s);
Z_2 = zeros(s+1);
for i =1:s
#G[i+1] = b(a,i+1)/(n^a);
G[i+1] = 34.74;
Z_1[i] = get_sigma_Z1(n)*rand_1[i]
Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
end
return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end



and

function gen_volterra_for(a,s,n,rand_1,rand_2)
G = zeros(s+1);
Z_1 = zeros(s);
Z_2 = zeros(s+1);
for i =1:s
G[i+1] = b(a,i+1)/(n^a);
#G[i+1] = 34.74;
Z_1[i] = get_sigma_Z1(n)*rand_1[i]
Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
end
return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end



Setup:

a = -0.43;
rho = -0.9;
eta = 1.9;
S_0 = 1.0
xi = 0.235^2;
n = 312;
s = round(Int,n*1.0;)
dt = 1.0/n;
mean = 0;

rand_nums = randn(3*n)
rand_1 = rand_nums[1:n]
rand_2 = rand_nums[n+1:2*n]

@btime volterra = gen_volterra_for(a,s,n,rand_1,rand_2)


Various utility functions


function b(a, k)
return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
end

function power_ker(a, x)
return x^a;
end

function get_sigma_Z1(n)
return sqrt(1.0/n);
end

function get_sigma_Z2(a,n)
return sqrt(1.0/((2.0*a+1.0)*n^(2.0*a+1.0)))
end

function get_rho(a)
return sqrt(2.0*a+1.0)/(a+1.0)
end

function direct_conv{T}(a::Array{T}, b::Array{T})
m = length(a)
n = length(b)
c = zeros(T,m+n-1)
@inbounds @simd for j=1:m
@inbounds @simd for k=1:n
c[j+k-1] += a[j]*b[k]
end
end
return c
end



#8

The power function (e.g. k^(a+1.0)) is currently not optimized very well, and dominates your code. Eventually, LLVM should automatically invoke a vector math library for this, but for now you might try something like VML.jl

#9

Is it all julia code? I need this to work with the ForwardDiff package.
Edit: Nevermind, its Intel MKL so fortran/C I guess.

#10
julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial:
memory estimate:  139.97 KiB
allocs estimate:  7807
--------------
minimum time:     777.090 μs (0.00% GC)
median time:      794.260 μs (0.00% GC)
mean time:        811.997 μs (0.92% GC)
maximum time:     2.395 ms (59.16% GC)
--------------
samples:          6140
evals/sample:     1
julia> function gen_volterra_for(a,s,n,rand_1,rand_2)
G = zeros(s+1);
Z_1 = zeros(s);
Z_2 = zeros(s+1);
@inbounds @simd for i =1:s
G[i+1] = b(a,i+1)/(n^a);
#G[i+1] = 34.74;
Z_1[i] = get_sigma_Z1(n)*rand_1[i]
Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
end
return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end
gen_volterra_for (generic function with 1 method)

julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial:
memory estimate:  18.06 KiB
allocs estimate:  6
--------------
minimum time:     245.693 μs (0.00% GC)
median time:      247.533 μs (0.00% GC)
mean time:        251.392 μs (0.48% GC)
maximum time:     2.119 ms (83.37% GC)
--------------
samples:          10000
evals/sample:     1


For reference, @inbounds contributed just under a microsecond. @simd deserves the credit.

#11

I do not get anything near the ~4x speedup you are experiencing. What could be the reason for this?

#12

If you never rebuilt your system image and are just using a generic installation, then you may not have the right compiler optimizations enabled.

#13

This computesthe same thing as the original post about 10x faster on my machine, without “cheating” using threads:

function test(s,a)
g = Vector{Float64}(s)
ap1= a+1
ap1inv = inv(ap1)
@inbounds for i =1:s
k = i+1.0
@fastmath g[i] = (k^ap1-(k-1)^ap1)*ap1inv
end
end


New to julia: Convolution and juliadiff
#14

FastMath is a prickly beast…
Original is the corrected original (still doing zeros() and [ ^(1/a) ] ^a ), “fm” is your version with FastMath, and “no_fm” is your version minus the FastMath macro.

julia> @benchmark original($s,0.3) BenchmarkTools.Trial: memory estimate: 4.06 KiB allocs estimate: 1 -------------- minimum time: 144.635 μs (0.00% GC) median time: 145.724 μs (0.00% GC) mean time: 147.749 μs (0.00% GC) maximum time: 205.575 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark fm($s,0.3)
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     134.556 μs (0.00% GC)
median time:      134.811 μs (0.00% GC)
mean time:        139.382 μs (0.00% GC)
maximum time:     175.917 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark no_fm($s,0.3) BenchmarkTools.Trial: memory estimate: 4.06 KiB allocs estimate: 1 -------------- minimum time: 70.826 μs (0.00% GC) median time: 71.886 μs (0.00% GC) mean time: 72.413 μs (0.00% GC) maximum time: 104.729 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark original($s,3.0)
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     152.398 μs (0.00% GC)
median time:      179.032 μs (0.00% GC)
mean time:        178.279 μs (0.00% GC)
maximum time:     321.960 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark fm($s,3.0) BenchmarkTools.Trial: memory estimate: 4.06 KiB allocs estimate: 1 -------------- minimum time: 9.448 μs (0.00% GC) median time: 9.947 μs (0.00% GC) mean time: 10.187 μs (0.00% GC) maximum time: 31.462 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark no_fm($s,3.0)
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     73.765 μs (0.00% GC)
median time:      74.019 μs (0.00% GC)
mean time:        75.585 μs (0.00% GC)
maximum time:     110.715 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1


I’ve experimented with FastMath before and hadn’t been able to find a case where it actually made things faster, and was excited to see on where it did.
However, while it is blisteringly fast for a = 3.0, it takes roughly twice as long when a = 0.3, or if a = only approximately 3.0…

julia> @benchmark fm(\$s, 3.00000000001)
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     134.603 μs (0.00% GC)
median time:      134.761 μs (0.00% GC)
mean time:        136.377 μs (0.00% GC)
maximum time:     174.652 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1


Gross. I would stay away from FastMath, without good references on conditions where it will actually consistently perform well. =/

EDIT: For this function, FastMath is fast if a = -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, or an integer >= -1.
Otherwise it is nearly twice as slow as no FastMath (on my machine).

#15

Okey, good to know. I will have \alpha \in (-0.5,0) . So from what I understand, FastMath will not give any speed up.