How to optimise and be faster than Java?

In general yes, having the type conversion is probably not going to hurt things because of genericity of other variables - but in this case, where everything else is already known to be a Float and a Float is guaranteed to come out since there’s a division there and all the things coming in are already Floats, why not just write that literal and save that conversion? If you want to be absolutely perfect about that literal, one(eltype(x)) would be the best option imo.

My impression is that you don’t really save the conversion, that when it comes down to it, it’s optimized away. Or, if not, then it should be, at least. Certainly, the performance difference is not measurable.

The gain is more generic code. Very often, people assume too much, and add too many type constraints. I always want to encourage people to move away from that mindset.

And, it leads to nicer, prettier, more readable code.

Since your function is rather fast, you should make a lot of runs for profiler to get statistics. So it’s better to use something like

Profile.clear()
@profile  (for i = 1:100_000; calcprojsimdi(pa, N); end)
Profile.print()

I’ve tried it with first version of your code and it looks like most of the time is spent in pow function, so most time consuming functions are csi_MonthlyInterpolatedSpot and the like.

On my machine, I have following results:

function csi_MonthlyInterpolatedSpot1(s::Proj, N::Int32)
    s.MonthlyInterpolatedSpot[1] = zero(Float64)
    for T = 2:N
        s.MonthlyInterpolatedSpot[T] = (1 / s.MonthlyInterpolatedZCB[T]) ^ (12.0 / (T-1)) - 1
    end
end

@btime csi_MonthlyInterpolatedSpot1($pa, $N)
# 12.611 μs (0 allocations: 0 bytes)

function csi_MonthlyInterpolatedSpot2(s::Proj, N::Int32)
    s.MonthlyInterpolatedSpot[1] = zero(Float64)
    for T = 2:N
        s.MonthlyInterpolatedSpot[T] = (1 / s.MonthlyInterpolatedZCB[T])
    end
end

@btime csi_MonthlyInterpolatedSpot2($pa, $N)
# 1.083 μs (0 allocations: 0 bytes)

So it would be interesting to compare speed of pow in Java and Julia to check whether is it true that java version is faster.

1 Like

I’d try adding @inbounds to both, and getting rid of the inversion in the first (make the exponent negative), to see if those help.

If pow is the limiting factor, then you can look at using VML or Vectorize to get a faster pow function. But even @inbounds or a broadcasted calculation is a bit faster than what’s shown. Toy example:

function pow_loop(a, b, N)
    a[1] = zero(eltype(a))
    for T = 2:N
        a[T] = (1/b[T])^(12/(T-1)) - 1
    end
    return a
end

function pow_loop_inv_inb(a, b, N)
    a[1] = zero(eltype(a))
    @inbounds for T = 2:N
        a[T] = (b[T])^(-12/(T-1)) - 1
    end
    return a
end

function pow_loop_vector(a, b, N)
    a[1] = zero(eltype(a))
    @views a[2:N] .= b[2:N].^(-12 ./ ((2:N).-1)) .- 1
    return a
end

using Vectorize # this will only work with Intel's VML library
function pow_loop_vml(a, b, N)
    Vectorize.pow!(a, b, (-12 ./ ((1:N) .- 1)))
    a .-= 1
    a[1] = zero(eltype(a))
    return a
end

Results:

a = rand(125); b = rand(125);

@btime pow_loop($a, $b, 125);
#  4.403 μs (0 allocations: 0 bytes)

@btime pow_loop_inv_inb($a, $b, 125);
#  2.508 μs (0 allocations: 0 bytes)

@btime pow_loop_vector($a, $b, 125);
#  2.703 μs (2 allocations: 96 bytes)

@btime pow_loop_vml($a, $b, 125);
#  1.826 μs (1 allocation: 1.06 KiB)