Why are calls to this function slow, compared to Python+Numpy?

vgdev · August 12, 2017, 12:47pm

I have simple question. I have a function b, below which i make multiple calls too. But I am experiencing that it is about x3 slower than a Numpy implementaion in Python. Any idea on how to speed this up?

function b(a, k)
  return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a);
end

testing function

function test(s)
  g = zeros(s);
  for i =1:s
    g = b(a,i+1)^a;
  end
end

s = 500;
@time test(s)

Result:

0.000358 seconds (5.50 k allocations: 90.156 KiB)

I would expect an execution time of about 0.0001 secs. Any help will be appreciated.

giordano · August 12, 2017, 1:06pm

There are a couple issues:

you define g = zeros(s), so it’s an Array{Float64, 1}, but then you set g = b(a,i+1)^a, so here it becomes a scalar (presumably a Float64, depending on the type of a). You probably want to initialize g to zero(s), not zeros(s).
a is not given in your code. You probably use it a global variable, make sure to make at least a const, but for better performance it’s a good idea to pass it as a function argument.

With these changes I get the performance you expect

julia> using BenchmarkTools

julia> b(a, k) = ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a)
b (generic function with 1 method)

julia> function test(s, a)
           g = zero(s)
           for i =1:s
               g = b(a,i+1)^a
           end
           return g
       end
test (generic function with 1 method)

julia> s = 500
500

julia> @btime test($s, 3.0)
  115.908 μs (0 allocations: 0 bytes)
1.2537550024999987e8

vgdev · August 12, 2017, 1:31pm

Thanks for the answer @giordano Sorry, but there were some bugs in the code I presented. Here is what I trying to do. Is there a way to make the computation of b go faster?

function b(a, k)
  return ((k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0))^(1.0/a);
end

function test(s,a)
  g = zeros(s);
  for i =1:s
    g[i] = b(a,i+1)^a;
  end
end

s = 500;
@btime test(s,0.3)

183.833 μs (1 allocation: 4.06 KiB)

giordano · August 12, 2017, 1:50pm

You can avoid useless calculations, like raising to 1 / a and then to a. For real and positive quantities it should be safe to remove both of them. On my machine this reduces running time to a third, but this isn’t limited to Julia.

Initial value of g isn’t relevant, so you can initialize to a random vector with Vector{Float64}(s), rather then zeros(s). On my machine this shaves off one microsecond out of ~115.

I can’t see any other evident change you can apply to make your code run faster. Anyway, for the future, keep always in mind the tips provided at https://docs.julialang.org/en/stable/manual/performance-tips.html

baggepinnen · August 12, 2017, 2:12pm

The following shaved off half the time for me

function test(s,a)
  g = zeros(s);
  Threads.@threads for i =1:s
    g[i] = b(a,i+1)^a;
  end
end

@btime test($s,0.3)
61.320 μs (2 allocations: 4.11 KiB)

Before @threads
129.483 μs (1 allocation: 4.06 KiB)

with julia started with 4 threads.

giordano · August 12, 2017, 3:00pm

But this is like cheating

How many physical CPUs do you have? I have 4 and running Julia with 4 threads running time decreases by a factor of ~4. With more than 4 threads performance gets worse, because of hyper-threading, maybe this is your case.

vgdev · August 12, 2017, 4:11pm

Okey, guys thanks for the answers. However, I cannot understand why the code below (with the line filling G just set to some random float number) runs in 50% of the time as the one with filling of G using the b function included. Is this computation so slow?

function gen_volterra_for(a,s,n,rand_1,rand_2)
  G = zeros(s+1);
  Z_1 = zeros(s);
  Z_2 = zeros(s+1);
  for i =1:s
    #G[i+1] = b(a,i+1)/(n^a);
    G[i+1] = 34.74;
    Z_1[i] = get_sigma_Z1(n)*rand_1[i]
    Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
  end
  return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end

and

function gen_volterra_for(a,s,n,rand_1,rand_2)
  G = zeros(s+1);
  Z_1 = zeros(s);
  Z_2 = zeros(s+1);
  for i =1:s
    G[i+1] = b(a,i+1)/(n^a);
    #G[i+1] = 34.74;
    Z_1[i] = get_sigma_Z1(n)*rand_1[i]
    Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
  end
  return (direct_conv(Z_1,G)[1:s+1]+Z_2);
end

Setup:

a = -0.43;
rho = -0.9;
eta = 1.9;
S_0 = 1.0
xi = 0.235^2;
n = 312;
s = round(Int,n*1.0;)
dt = 1.0/n;
mean = 0;


rand_nums = randn(3*n)
rand_1 = rand_nums[1:n]
rand_2 = rand_nums[n+1:2*n]

@btime volterra = gen_volterra_for(a,s,n,rand_1,rand_2)

Various utility functions


function b(a, k)
  return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
end

function power_ker(a, x)
  return x^a;
end

function get_sigma_Z1(n)
  return sqrt(1.0/n);
end

function get_sigma_Z2(a,n)
  return sqrt(1.0/((2.0*a+1.0)*n^(2.0*a+1.0)))
end

function get_rho(a)
  return sqrt(2.0*a+1.0)/(a+1.0)
end

function direct_conv{T}(a::Array{T}, b::Array{T})
    m = length(a)
    n = length(b)
    c = zeros(T,m+n-1)
     @inbounds @simd for j=1:m
        @inbounds @simd for k=1:n
            c[j+k-1] += a[j]*b[k]
        end
    end
    return c
end

Ralph_Smith · August 12, 2017, 5:21pm

The power function (e.g. k^(a+1.0)) is currently not optimized very well, and dominates your code. Eventually, LLVM should automatically invoke a vector math library for this, but for now you might try something like VML.jl

vgdev · August 12, 2017, 5:28pm

Is it all julia code? I need this to work with the ForwardDiff package.
Edit: Nevermind, its Intel MKL so fortran/C I guess.

Elrod · August 12, 2017, 5:44pm

julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial: 
  memory estimate:  139.97 KiB
  allocs estimate:  7807
  --------------
  minimum time:     777.090 μs (0.00% GC)
  median time:      794.260 μs (0.00% GC)
  mean time:        811.997 μs (0.92% GC)
  maximum time:     2.395 ms (59.16% GC)
  --------------
  samples:          6140
  evals/sample:     1
julia> function gen_volterra_for(a,s,n,rand_1,rand_2)
         G = zeros(s+1);
         Z_1 = zeros(s);
         Z_2 = zeros(s+1);
         @inbounds @simd for i =1:s
           G[i+1] = b(a,i+1)/(n^a);
           #G[i+1] = 34.74;
           Z_1[i] = get_sigma_Z1(n)*rand_1[i]
           Z_2[i+1] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)^2.0)*rand_2[i]);
         end
         return (direct_conv(Z_1,G)[1:s+1]+Z_2);
       end
gen_volterra_for (generic function with 1 method)

julia> @benchmark volterra = gen_volterra_for(a,s,n,rand_1,rand_2)
BenchmarkTools.Trial: 
  memory estimate:  18.06 KiB
  allocs estimate:  6
  --------------
  minimum time:     245.693 μs (0.00% GC)
  median time:      247.533 μs (0.00% GC)
  mean time:        251.392 μs (0.48% GC)
  maximum time:     2.119 ms (83.37% GC)
  --------------
  samples:          10000
  evals/sample:     1

For reference, @inbounds contributed just under a microsecond. @simd deserves the credit.

vgdev · August 12, 2017, 7:27pm

I do not get anything near the ~4x speedup you are experiencing. What could be the reason for this?

ChrisRackauckas · August 12, 2017, 7:41pm

If you never rebuilt your system image and are just using a generic installation, then you may not have the right compiler optimizations enabled.

stevengj · August 12, 2017, 11:55pm

This computesthe same thing as the original post about 10x faster on my machine, without “cheating” using threads:

function test(s,a)
         g = Vector{Float64}(s)
         ap1= a+1
         ap1inv = inv(ap1)
         @inbounds for i =1:s
           k = i+1.0
           @fastmath g[i] = (k^ap1-(k-1)^ap1)*ap1inv
         end
end

Elrod · August 13, 2017, 12:58am

FastMath is a prickly beast…
Original is the corrected original (still doing zeros() and [ ^(1/a) ] ^a ), “fm” is your version with FastMath, and “no_fm” is your version minus the FastMath macro.

julia> @benchmark original($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     144.635 μs (0.00% GC)
  median time:      145.724 μs (0.00% GC)
  mean time:        147.749 μs (0.00% GC)
  maximum time:     205.575 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark fm($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     134.556 μs (0.00% GC)
  median time:      134.811 μs (0.00% GC)
  mean time:        139.382 μs (0.00% GC)
  maximum time:     175.917 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark no_fm($s,0.3)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     70.826 μs (0.00% GC)
  median time:      71.886 μs (0.00% GC)
  mean time:        72.413 μs (0.00% GC)
  maximum time:     104.729 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark original($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     152.398 μs (0.00% GC)
  median time:      179.032 μs (0.00% GC)
  mean time:        178.279 μs (0.00% GC)
  maximum time:     321.960 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark fm($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     9.448 μs (0.00% GC)
  median time:      9.947 μs (0.00% GC)
  mean time:        10.187 μs (0.00% GC)
  maximum time:     31.462 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark no_fm($s,3.0)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     73.765 μs (0.00% GC)
  median time:      74.019 μs (0.00% GC)
  mean time:        75.585 μs (0.00% GC)
  maximum time:     110.715 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I’ve experimented with FastMath before and hadn’t been able to find a case where it actually made things faster, and was excited to see on where it did.
However, while it is blisteringly fast for a = 3.0, it takes roughly twice as long when a = 0.3, or if a = only approximately 3.0…

julia> @benchmark fm($s, 3.00000000001)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     134.603 μs (0.00% GC)
  median time:      134.761 μs (0.00% GC)
  mean time:        136.377 μs (0.00% GC)
  maximum time:     174.652 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Gross. I would stay away from FastMath, without good references on conditions where it will actually consistently perform well. =/

EDIT: For this function, FastMath is fast if a = -2.0, -1.0, 0.0, 1.0, 2.0, 3.0, or an integer >= -1.
Otherwise it is nearly twice as slow as no FastMath (on my machine).

vgdev · August 13, 2017, 9:47am

Okey, good to know. I will have \alpha \in (-0.5,0) . So from what I understand, FastMath will not give any speed up.

Topic		Replies	Views
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3563	October 24, 2017
Why is this simple function twice as slow as its Python version Performance question	97	4361	April 12, 2021
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10945	October 15, 2019
Julia is very slow compared to python/numpy. What am I doing wrong here? Performance benchmark , matrices	25	3046	April 11, 2022
Julia is slower than Python when appending elements to untyped arrays General Usage question	42	886	February 10, 2025

Why are calls to this function slow, compared to Python+Numpy?

Related topics