How to optimize computation within vectorized list operation and large array?

I want to do this type of computation:

ot=collect(range(start=-3,length=3600*4,stop=3));
st=rand(1000000);
sgf=zeros(ComplexF64, length(ot));
@elapsed @time for nn in 1:length(ot)
    sgf[nn]=sum(1 ./ (ot[nn]+0.001*im.-st));
end

With output 368.634083 seconds, which is extremely slower than the counterpart in Matlab:

ot=linspace(-3,3,3600*4);
st=rand([1,1000000]);
tic
for nn=1:length(ot)
    sgf(nn)=sum(1./(ot(nn)+i*0.0001-st));
end
toc

Matlab only spends 58 seconds on my laptop.
Then I realized that Julia spends too much time on calculation inside summation:

@elapsed 1 ./ ((ot[1]+0.001*im).-st)
0.0234201

and if using for:

@elapsed collect(ComplexF64,1.0/((ot[1]+0.001*im)-st[i]) for i in 1:length(st))
0.2392825

It is even more slower.
Therefore, what is the reason causing Julia computing vectorized list operation in such a low speed? I think there should be some optimizing metods.

Edit:
By substituting 1/c to inv, summation over a complex list part becomes much faster:

@elapsed @inbounds @fastmath sum(cinv.(ot[1]+0.001*im.-st))
0.0039946

It is even faster than in Matlab. However, the final result in for is still slow:

@elapsed for nn in 1:length(ot)
    @inbounds @fastmath sgf[nn]=sum(inv.(ot[nn]+0.001*im.-st));
end
81.828844

I guess there still should be some aspects could be optimized.

You are calculating with global variables. They are inherently type unstable.

Instead, use a let block so ot st sgf are all local to the let block then time that.

I don’t think it’s that - reducing the problem size by a factor of 100 so the benchmarks don’t take all day:

julia> ot = range(start=-3,length=3600*4,stop=3);

julia> st = rand(10000);

julia> sgf = zeros(ComplexF64, length(ot));

julia> function f1(ot, st, sgf)
           for nn in 1:length(ot)
               sgf[nn]=sum(1 ./ (ot[nn]+0.001*im.-st));
           end
           sgf
       end
f1 (generic function with 1 method)

julia> function f2(ot, st, sgf)
           sgf .= [sum(1 / (o + 0.001 * im - stᵢ) for stᵢ ∈ st) for o ∈ ot];
       end
f2 (generic function with 1 method)

julia> @btime f1($ot, $st, $sgf);
  3.460 s (28800 allocations: 2.15 GiB)

julia> @btime f2($ot, $st, $sgf);
  2.814 s (2 allocations: 225.05 KiB)

so something is really slow - I never work with Complex numbers, maybe it’s a type instability somewhere (ot and st aren’t Complex)?

The biggest problem is probably that sum(1 ./ (ot[nn] .+ 0.001*im .- st)) allocates a big array at every step. If you replace that with a generator it’s much faster:

begin
  ot=range(start=-3,length=3600*4,stop=3)  # lazy range
  st=rand(10^4)
  sgf=zeros(ComplexF64, length(ot))
  # function will be compiled:
  function work!(sgf::AbstractVector, st::AbstractVector, ot::AbstractVector)
    length(sgf) == length(ot) || error("bad lengths")
    for nn in eachindex(ot)
      # sgf[nn]=sum(1 ./ (ot[nn] .+ 0.001*im .- st))  # allocates 2.151 GiB as before
      sgf[nn] = sum(1 / (ot[nn] + 0.001*im - s) for s in st)
    end
  end
  @time work!(sgf, st, ot);  # 1.2 seconds, 7x faster
end

begin
  ot=collect(range(start=-3,length=3600*4,stop=3));
  st=rand(10^4) # 1000_000);
  sgf=zeros(ComplexF64, length(ot));
  @time for nn in 1:length(ot)
      sgf[nn]=sum(1 ./ (ot[nn] + 0.001*im .- st));
  end
end;  # original code, 10^4 not 10^6, takes 8.8 seconds

Edit: One much faster way is this:

using Tullio
function work_t!(sgf::AbstractVector, st::AbstractVector, ot::AbstractVector)
  @tullio sgf[nn] = 1 / (ot[nn] + 0.001*im - st[k])
end
@time work_t!(sgf, st, ot); # 0.21 seconds, first run
@time work_t!(sgf, st, ot); # 0.04 seconds, second run

This is multi-threaded, which is a factor of 4 here, 4-core machine. And it’s calling @fastmath which might be cheating.

1 Like

Exactly. ot and st are both real Float64. Only the final results are complex number.

On my laptop there is a 20x speed difference between

sgf[nn] = sum(1 / (ot[nn] + 0.001*im - s) for s in st)

and

sgf[nn] = sum(1 / (ot[nn] - s) for s in st)

For me just adding @fastmath to that line gave x20 boost. :wink:

1 Like

Not sure it should be that big, but the complex calculations would be expected to be considerably slower. 1/r is a CPU operation for r a Float64 but 1/c is probably a bunch of logarithms and exponentials or trig and division and etc

@fastmath does indeed fix it, but we can do better by writing out the inner loop as well:

function bar!(sgf, ot, st)
    @fastmath for i in eachindex(sgf, ot)
        val = zero(eltype(sgf))
        z = ot[i] + 0.001im
        for j in eachindex(st)
            val += inv(z - st[j])
        end
        sgf[i] = val
    end
    return sgf
end

Benchmark with 1_000_000 samples:

ot = range(-3, 3, 3600*4)  # don't use collect!!
st = rand(1_000_000)
sgf = zeros(ComplexF64, length(ot))

julia> @btime bar!($sgf, $ot, $st);
  9.360 s (0 allocations: 0 bytes)

If I add @fastmath Threads.@threads for i in eachindex(sgf, ot) with 12 threads (6 cores), I get:

julia> @btime bar!($sgf, $ot, $st);
  2.010 s (73 allocations: 8.39 KiB)

I suspect that Matlab uses threads as well.

2 Likes

In principle 1/z is just equal to conj(z) / abs2(z), which is pretty fast. But the current julia code is much more complex with lots of intermediate calculations. The 30x slowdown should be the combination of a somewhat slower operation plus the prevention of simd.

@inbounds @fastmath fixes this.

2 Likes

Is this a safe operation in this context?

I think function inv makes great contribution, much faster than 1.0/.

In this particular case, conj(x)/abs2(x) will be much faster than inv(x) as inv has to work even when abs2 would overflow (which will not happen on this testset). Applying this substitution, along with an fma and a invert-and-multiply, matches the fastmath version.

For reference, these are the versions I wrote without resorting to manual loops or multithreading.

testfun_1(ot,st) = map(o->sum(s->inv(complex(o-s,1e-3)),st),ot) # as written, mostly
testfun_2(ot,st) = map(o->sum(s->complex(o-s,-1e-3)/((o-s)^2+1e-3^2),st),ot) # manual unsafe inv
testfun_3(ot,st) = map(o->sum(s->complex(o-s,-1e-3)*inv(fma(o-s,o-s,1e-3^2)),st),ot) # manual implementation of fastmath version
testfun_4(ot,st) = map(o->sum(s->@fastmath(inv(complex(o-s,1e-3))),st),ot) # fastmath

ot = range(start=-3,length=3600*4,stop=3);
st = rand(10^5); # NOTE: was 10^6 in the original problem but I didn't want to wait for `testfun_1`

@time testfun_1(ot,st); # 26.35s
@time testfun_2(ot,st); # 1.40s
@time testfun_3(ot,st); # 0.68s
@time testfun_4(ot,st); # 0.68s

Take all my times and multiply them by 10 because I only used a length=10^5 st.

All this said, I wouldn’t expect Julia to totally smash MATLAB at this problem since the MATLAB version doesn’t contain a hot loop (besides the vectorization, which it handles well). As we’ve seen, the dominant factor here is the implementation of complex inversion/division. In particular, it seems that Julia’s inv(::Complex) would benefit from a better fastpath. However, I wouldn’t advocate going so far as the invert-and-multiply. I think the current version does do invert-and-multiply (at least for /) and I’ve been meaning to make a PR to change it, as it currently results in dubious outcomes like

julia> complex(1e11)/complex(1e11)
0.9999999999999999 + 0.0im
1 Like

I think the better version of inv is probably

function inv(w::ComplexF64)
    c, d = reim(w)
    absc, absd = abs(c), abs(d)
    cd = ifelse(absc>absd, absc, absd)
    if sqrt(floatmin(Float64)/2) <= cd <= sqrt(floatmax(Float64)/2)
        return conj(w) / abs2(w)
   end
    (isinf(c) | isinf(d)) && return complex(copysign(0.0, c), flipsign(-0.0, d))
    
    ϵ  = eps(Float64)
    bs = 2/(ϵ*ϵ)

    # scaling
    s = 1.0
    if cd >= floatmax(Float64)/2
        c *= 0.5; d *= 0.5; s = 0.5 # scale down c, d
    elseif cd <= 2floatmin(Float64)/ϵ
        c *= bs;  d *= bs;  s = bs  # scale up c, d
    end

    # inversion operations
    if absd <= absc
        p, q = robust_cinv(c, d)
    else
        q, p = robust_cinv(-d, -c)
    end
    return ComplexF64(p*s, q*s)
end

A quick benchmark shows that this is roughly 2x faster for normal sized numbers.
faster `inv` for normal sized `ComplexF64` by oscardssmith · Pull Request #47255 · JuliaLang/julia · GitHub makes this a PR.

1 Like

Really appreciate your contribution.
I just found that there is an interesting phenomenon that your inv2 (please let me call it inv2 for distinguishing with inv) is 2x faster in loop but slower in vectorized list computing. Here are codes:

function inv2(w::ComplexF64)
    c, d = reim(w)
    absc, absd = abs(c), abs(d)
    cd = ifelse(absc>absd, absc, absd)
    if sqrt(floatmin(Float64)/2) <= cd <= sqrt(floatmax(Float64)/2)
        return conj(w) / abs2(w)
   end
    (isinf(c) | isinf(d)) && return complex(copysign(0.0, c), flipsign(-0.0, d))
    
    ϵ  = eps(Float64)
    bs = 2/(ϵ*ϵ)

    # scaling
    s = 1.0
    if cd >= floatmax(Float64)/2
        c *= 0.5; d *= 0.5; s = 0.5 # scale down c, d
    elseif cd <= 2floatmin(Float64)/ϵ
        c *= bs;  d *= bs;  s = bs  # scale up c, d
    end

    # inversion operations
    if absd <= absc
        p, q = robust_cinv(c, d)
    else
        q, p = robust_cinv(-d, -c)
    end
    return ComplexF64(p*s, q*s)
end
@elapsed @inbounds @fastmath sum(inv.((ot[1]+0.001*im).-st))
0.003922
@elapsed @inbounds @fastmath sum(inv2.((ot[1]+0.001*im).-st))
0.0168729
@elapsed @fastmath sum(inv(ot[1] + 0.001*im - s) for s in st)
0.213455
@elapsed @fastmath sum(inv2(ot[1] + 0.001*im - s) for s in st)
0.1648984

I think there should be some tricky points in vectorized function mapping.

yeah. if you look at the PR, I’ve made it vectorize better.

1 Like

Thanks so much for your help. Tullio is miraculously efficient, about 10x faster. But I found that the final results from @Tullio have less memory (381.42 k) and accuracy (3 Sig. Fig. less) than from for (1.15 M). What is the mechanism of @Tullio? Is is possible to rewrite them as loop?
Edit: I may misinterpret. I mean how to use LoopVectorization to achieve a comparable speed for this problem?