How to optimize computation within vectorized list operation and large array?

swish47 · October 20, 2022, 12:44pm

I want to do this type of computation:

ot=collect(range(start=-3,length=3600*4,stop=3));
st=rand(1000000);
sgf=zeros(ComplexF64, length(ot));
@elapsed @time for nn in 1:length(ot)
    sgf[nn]=sum(1 ./ (ot[nn]+0.001*im.-st));
end

With output 368.634083 seconds, which is extremely slower than the counterpart in Matlab:

ot=linspace(-3,3,3600*4);
st=rand([1,1000000]);
tic
for nn=1:length(ot)
    sgf(nn)=sum(1./(ot(nn)+i*0.0001-st));
end
toc

Matlab only spends 58 seconds on my laptop.
Then I realized that Julia spends too much time on calculation inside summation:

@elapsed 1 ./ ((ot[1]+0.001*im).-st)
0.0234201

and if using for:

@elapsed collect(ComplexF64,1.0/((ot[1]+0.001*im)-st[i]) for i in 1:length(st))
0.2392825

It is even more slower.
Therefore, what is the reason causing Julia computing vectorized list operation in such a low speed? I think there should be some optimizing metods.

Edit:
By substituting 1/c to inv, summation over a complex list part becomes much faster:

@elapsed @inbounds @fastmath sum(cinv.(ot[1]+0.001*im.-st))
0.0039946

It is even faster than in Matlab. However, the final result in for is still slow:

@elapsed for nn in 1:length(ot)
    @inbounds @fastmath sgf[nn]=sum(inv.(ot[nn]+0.001*im.-st));
end
81.828844

I guess there still should be some aspects could be optimized.

dlakelan · October 20, 2022, 12:52pm

You are calculating with global variables. They are inherently type unstable.

Instead, use a let block so ot st sgf are all local to the let block then time that.

nilshg · October 20, 2022, 1:00pm

I don’t think it’s that - reducing the problem size by a factor of 100 so the benchmarks don’t take all day:

julia> ot = range(start=-3,length=3600*4,stop=3);

julia> st = rand(10000);

julia> sgf = zeros(ComplexF64, length(ot));

julia> function f1(ot, st, sgf)
           for nn in 1:length(ot)
               sgf[nn]=sum(1 ./ (ot[nn]+0.001*im.-st));
           end
           sgf
       end
f1 (generic function with 1 method)

julia> function f2(ot, st, sgf)
           sgf .= [sum(1 / (o + 0.001 * im - stᵢ) for stᵢ ∈ st) for o ∈ ot];
       end
f2 (generic function with 1 method)

julia> @btime f1($ot, $st, $sgf);
  3.460 s (28800 allocations: 2.15 GiB)

julia> @btime f2($ot, $st, $sgf);
  2.814 s (2 allocations: 225.05 KiB)

so something is really slow - I never work with Complex numbers, maybe it’s a type instability somewhere (ot and st aren’t Complex)?

mcabbott · October 20, 2022, 1:01pm

The biggest problem is probably that sum(1 ./ (ot[nn] .+ 0.001*im .- st)) allocates a big array at every step. If you replace that with a generator it’s much faster:

begin
  ot=range(start=-3,length=3600*4,stop=3)  # lazy range
  st=rand(10^4)
  sgf=zeros(ComplexF64, length(ot))
  # function will be compiled:
  function work!(sgf::AbstractVector, st::AbstractVector, ot::AbstractVector)
    length(sgf) == length(ot) || error("bad lengths")
    for nn in eachindex(ot)
      # sgf[nn]=sum(1 ./ (ot[nn] .+ 0.001*im .- st))  # allocates 2.151 GiB as before
      sgf[nn] = sum(1 / (ot[nn] + 0.001*im - s) for s in st)
    end
  end
  @time work!(sgf, st, ot);  # 1.2 seconds, 7x faster
end

begin
  ot=collect(range(start=-3,length=3600*4,stop=3));
  st=rand(10^4) # 1000_000);
  sgf=zeros(ComplexF64, length(ot));
  @time for nn in 1:length(ot)
      sgf[nn]=sum(1 ./ (ot[nn] + 0.001*im .- st));
  end
end;  # original code, 10^4 not 10^6, takes 8.8 seconds

Edit: One much faster way is this:

using Tullio
function work_t!(sgf::AbstractVector, st::AbstractVector, ot::AbstractVector)
  @tullio sgf[nn] = 1 / (ot[nn] + 0.001*im - st[k])
end
@time work_t!(sgf, st, ot); # 0.21 seconds, first run
@time work_t!(sgf, st, ot); # 0.04 seconds, second run

This is multi-threaded, which is a factor of 4 here, 4-core machine. And it’s calling @fastmath which might be cheating.

swish47 · October 20, 2022, 1:14pm

Exactly. ot and st are both real Float64. Only the final results are complex number.

DNF · October 20, 2022, 1:15pm

On my laptop there is a 20x speed difference between

sgf[nn] = sum(1 / (ot[nn] + 0.001*im - s) for s in st)

and

sgf[nn] = sum(1 / (ot[nn] - s) for s in st)

mkoculak · October 20, 2022, 1:22pm

For me just adding @fastmath to that line gave x20 boost.

dlakelan · October 20, 2022, 1:24pm

Not sure it should be that big, but the complex calculations would be expected to be considerably slower. 1/r is a CPU operation for r a Float64 but 1/c is probably a bunch of logarithms and exponentials or trig and division and etc

DNF · October 20, 2022, 1:33pm

@fastmath does indeed fix it, but we can do better by writing out the inner loop as well:

function bar!(sgf, ot, st)
    @fastmath for i in eachindex(sgf, ot)
        val = zero(eltype(sgf))
        z = ot[i] + 0.001im
        for j in eachindex(st)
            val += inv(z - st[j])
        end
        sgf[i] = val
    end
    return sgf
end

Benchmark with 1_000_000 samples:

ot = range(-3, 3, 3600*4)  # don't use collect!!
st = rand(1_000_000)
sgf = zeros(ComplexF64, length(ot))

julia> @btime bar!($sgf, $ot, $st);
  9.360 s (0 allocations: 0 bytes)

If I add @fastmath Threads.@threads for i in eachindex(sgf, ot) with 12 threads (6 cores), I get:

julia> @btime bar!($sgf, $ot, $st);
  2.010 s (73 allocations: 8.39 KiB)

I suspect that Matlab uses threads as well.

DNF · October 20, 2022, 1:58pm

In principle 1/z is just equal to conj(z) / abs2(z), which is pretty fast. But the current julia code is much more complex with lots of intermediate calculations. The 30x slowdown should be the combination of a somewhat slower operation plus the prevention of simd.

@inbounds @fastmath fixes this.

dlakelan · October 20, 2022, 2:27pm

Is this a safe operation in this context?

swish47 · October 20, 2022, 2:38pm

I think function inv makes great contribution, much faster than 1.0/.

mikmoore · October 20, 2022, 3:16pm

In this particular case, conj(x)/abs2(x) will be much faster than inv(x) as inv has to work even when abs2 would overflow (which will not happen on this testset). Applying this substitution, along with an fma and a invert-and-multiply, matches the fastmath version.

For reference, these are the versions I wrote without resorting to manual loops or multithreading.

testfun_1(ot,st) = map(o->sum(s->inv(complex(o-s,1e-3)),st),ot) # as written, mostly
testfun_2(ot,st) = map(o->sum(s->complex(o-s,-1e-3)/((o-s)^2+1e-3^2),st),ot) # manual unsafe inv
testfun_3(ot,st) = map(o->sum(s->complex(o-s,-1e-3)*inv(fma(o-s,o-s,1e-3^2)),st),ot) # manual implementation of fastmath version
testfun_4(ot,st) = map(o->sum(s->@fastmath(inv(complex(o-s,1e-3))),st),ot) # fastmath

ot = range(start=-3,length=3600*4,stop=3);
st = rand(10^5); # NOTE: was 10^6 in the original problem but I didn't want to wait for `testfun_1`

@time testfun_1(ot,st); # 26.35s
@time testfun_2(ot,st); # 1.40s
@time testfun_3(ot,st); # 0.68s
@time testfun_4(ot,st); # 0.68s

Take all my times and multiply them by 10 because I only used a length=10^5 st.

All this said, I wouldn’t expect Julia to totally smash MATLAB at this problem since the MATLAB version doesn’t contain a hot loop (besides the vectorization, which it handles well). As we’ve seen, the dominant factor here is the implementation of complex inversion/division. In particular, it seems that Julia’s inv(::Complex) would benefit from a better fastpath. However, I wouldn’t advocate going so far as the invert-and-multiply. I think the current version does do invert-and-multiply (at least for /) and I’ve been meaning to make a PR to change it, as it currently results in dubious outcomes like

julia> complex(1e11)/complex(1e11)
0.9999999999999999 + 0.0im

Oscar_Smith · October 20, 2022, 3:46pm

I think the better version of inv is probably

function inv(w::ComplexF64)
    c, d = reim(w)
    absc, absd = abs(c), abs(d)
    cd = ifelse(absc>absd, absc, absd)
    if sqrt(floatmin(Float64)/2) <= cd <= sqrt(floatmax(Float64)/2)
        return conj(w) / abs2(w)
   end
    (isinf(c) | isinf(d)) && return complex(copysign(0.0, c), flipsign(-0.0, d))
    
    ϵ  = eps(Float64)
    bs = 2/(ϵ*ϵ)

    # scaling
    s = 1.0
    if cd >= floatmax(Float64)/2
        c *= 0.5; d *= 0.5; s = 0.5 # scale down c, d
    elseif cd <= 2floatmin(Float64)/ϵ
        c *= bs;  d *= bs;  s = bs  # scale up c, d
    end

    # inversion operations
    if absd <= absc
        p, q = robust_cinv(c, d)
    else
        q, p = robust_cinv(-d, -c)
    end
    return ComplexF64(p*s, q*s)
end

A quick benchmark shows that this is roughly 2x faster for normal sized numbers.
faster `inv` for normal sized `ComplexF64` by oscardssmith · Pull Request #47255 · JuliaLang/julia · GitHub makes this a PR.

swish47 · October 20, 2022, 4:30pm

Really appreciate your contribution.
I just found that there is an interesting phenomenon that your inv2 (please let me call it inv2 for distinguishing with inv) is 2x faster in loop but slower in vectorized list computing. Here are codes:

function inv2(w::ComplexF64)
    c, d = reim(w)
    absc, absd = abs(c), abs(d)
    cd = ifelse(absc>absd, absc, absd)
    if sqrt(floatmin(Float64)/2) <= cd <= sqrt(floatmax(Float64)/2)
        return conj(w) / abs2(w)
   end
    (isinf(c) | isinf(d)) && return complex(copysign(0.0, c), flipsign(-0.0, d))
    
    ϵ  = eps(Float64)
    bs = 2/(ϵ*ϵ)

    # scaling
    s = 1.0
    if cd >= floatmax(Float64)/2
        c *= 0.5; d *= 0.5; s = 0.5 # scale down c, d
    elseif cd <= 2floatmin(Float64)/ϵ
        c *= bs;  d *= bs;  s = bs  # scale up c, d
    end

    # inversion operations
    if absd <= absc
        p, q = robust_cinv(c, d)
    else
        q, p = robust_cinv(-d, -c)
    end
    return ComplexF64(p*s, q*s)
end
@elapsed @inbounds @fastmath sum(inv.((ot[1]+0.001*im).-st))
0.003922
@elapsed @inbounds @fastmath sum(inv2.((ot[1]+0.001*im).-st))
0.0168729
@elapsed @fastmath sum(inv(ot[1] + 0.001*im - s) for s in st)
0.213455
@elapsed @fastmath sum(inv2(ot[1] + 0.001*im - s) for s in st)
0.1648984

I think there should be some tricky points in vectorized function mapping.

Oscar_Smith · October 20, 2022, 4:48pm

yeah. if you look at the PR, I’ve made it vectorize better.

swish47 · October 21, 2022, 4:23pm

Thanks so much for your help. Tullio is miraculously efficient, about 10x faster. But I found that the final results from @Tullio have less memory (381.42 k) and accuracy (3 Sig. Fig. less) than from for (1.15 M). What is the mechanism of @Tullio? Is is possible to rewrite them as loop?
Edit: I may misinterpret. I mean how to use LoopVectorization to achieve a comparable speed for this problem?

Topic		Replies	Views
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4846	July 19, 2021
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8398	March 5, 2017
Simple speed comparison with Matlab New to Julia question	33	1249	May 6, 2025
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1894	June 25, 2023
Sum operations between arrays Performance	21	5624	April 7, 2020

How to optimize computation within vectorized list operation and large array?

Related topics