NonLinearRange

I was looking for way to way to allow multi-kernel broadcasting to act also over non-linear ranges.
E.g. consider a complex exponential exp.(1im .* k0 .* x) with k0=1.35 being a scalar and x=0:255 a linear range.
This will be fairly slow, since the exponential would need to be evaluated every complex-valued position. The same would hold implementing this as a comprehension, I assume.
Thinking sequentially, one could use an iterator which multiplies an internal state variable by the complex constant exp.(1im*k0) to advance by one in x. This only requires a complex-valued multiplication and is thus sequentially much faster. However, the broadcasting in the example above, might leverage some multi-kernel parallelism (does it?).

Ideally one would define a non-linear range. It is an iterator as well as allowing random access. The broadcast system should then be able to automatically be able to distribute the work over the kernels such that each kernel starts with a random access operation but then runs over the elements it should process using the iterator.
I imagine that one would like to define such a non-linear range object like this:

exp_range = NonLinRange((arg)->exp(1im*arg), (s)->s*exp(1im), k0*x)

with x being the UnitRange as above. The first argument specifies the random access operation, the second specifies the non-linear iteration applied on the current state s to advance to the successive element.
Ideally such a NonLinearRange would keep its type upon multiplication or addition, which one could guarantee with appropriate boilerplate code.
Furthermore, any exp applied to a StepRangeLen object could become such a NonLinearRange object instead of a Vector. But this may break compatibility in some cases, as then the result-type changed.

Does anything like this exist in core-julia or a package? If not, would this not be a way to speed up code significantly? I would image that an efficient all-Julia FFT implementation would make use of something like this.

Broadcasting is not parallelized by default.

If this is a performance-critical operation, I would try not to materialize an array of values like this at all, and instead compute it as needed. i.e. just write a loop.

But it’s hard to give general advice here without knowing in what context you intend to use this.

You could certainly define a subtype of AbstractVector (because you allow random access) that implements a specialized iterator, including a specialized iterator for views (a SubArray), for loops over the array or portions thereof.

However, it would be a bit weird if a[i] gave a different number than the i-th value obtained from iteration, which would happen (due to roundoff errors) if you computed them from different algorithms (see below).

I would recommend against this technique for FFTs, for two reasons:

  1. It is inaccurate. Building up an sequence of N values (e^{i\phi})^n with simple repeated multiplications incurs O(\varepsilon N) root-mean-square roundoff error (Tasche & Zeuner, 2002), where \varepsilon is machine precision. This is much, much worse than the error incurred by an FFT algorithm itself, which is typically O(\varepsilon \sqrt{\log N}). See also comments on FFT accuracy. You can easily see this for yourself if you compare cumprod(fill(cis(ϕ), n)) to cis.(ϕ .* (1:n)) for some random ϕ as a function of n — note that the rms relative error between two arrays is norm(a-b) / norm(a).
  2. It is unnecessary: FFTs are performance-critical mainly in situations where you perform an FFT of the same size many times. In this case you can precompute a table of trigonometric values (and there are various ways to store or compress this table). (This is what FFTW.jl does when you call plan_fft.)

PS. Note that exp(im*x) can be more efficiently computed as cis(x) for x::Real.

5 Likes

Thanks Steven! Great to learn that broadcasting is not parallelized by default. This gives me a handle to further optimize code using LoopVectorization.jl, even though it seems like it is deprecated.
Yes, you are right, the linear error growth and the non-predictable numerical results are a no-go for being in the core. But maybe as an extra Type, it is OK. Yes, parenting AbstractVector.jl to such a Type may be the best choice.
Interesting to learn that FFTs do not use this trick. I guess all that pre-calculation is hidden in the plans or wisdom then. Without vectorization, I find that even exp(im*k*x) is surprisingly slow. I think @fastmath may help a bit.
Below is a test to check for speed improvements.
As you mentioned, the drift-error is as big as 0.1 in this example for the all-sequential method. For the mulit-threaded version it is considerably less.
The bad news is that LoopVectorization.jl does not seem to handle complex numbers. At least I could not get it to work. Maybe using a reinterpret and the sincos function could work, but the good news is that the @threads gives a speedup of 5x (16 threads available).
The code fill_exp_seq_threads! below, which is relatively accurate, is actually (surprisingly) slower than just bare-bone multithreading. Sureprisingly, since fill_exp_seq! is purely sequential but the fastest (event though single-threaded). It would be cool, if the @turbo or @avx macros actually worked also for complex-valued broadcasts.

using LoopVectorization
using BenchmarkTools
using Base.Threads

sz = (10000,)

function fill_exp!(c)
    @threads for n in 1:size(c,1) # axes(c,1)
        c[n] = exp(im*n/10f0)
    end
    return c
end

function fill_exp_seq!(c, v = exp(im/10f0), w = complex(1f0))    
    for n in axes(c,1)
        c[n] = w
        w *= v
    end
    return c
end

function fill_exp_seq_threads!(c)
    nt =  Threads.nthreads()
    v = exp(im/10f0)
    ts = floor(Int, size(c,1)/nt)
    @threads for t in 0:nt-1
        si = (t*ts+1)
        fi = max((t+1)*ts, size(c,1))
        w = exp(im*si/10f0)
        fill_exp_seq!((@view c[si:fi]), v, w)
    end
    return c
end


c = rand(ComplexF32, sz...)
ref = exp.(1f0*im.*(1:sz[1])./10f0); 
@btime fill_exp!($c); # 22 µs
@btime fill_exp_seq!($c); # 20 µs
@btime fill_exp_seq_threads!($c); # 28 µs
@btime $c .= exp.(im.*(1:sz[1])./10f0); # 117 µs

In case you missed it further up, it’s recommended to use cis(x) instead of exp(im*x).

1 Like

I think you misunderstood the comment:
Exp(im*x) is faster than cis, but cis(x) is more accurate.

cis should be faster for the case of real x (where it just calls sincos), and that’s what I find I my machine:

julia> @btime exp(im * $(Ref(3.0))[])
  17.103 ns (0 allocations: 0 bytes)
-0.9899924966004454 + 0.1411200080598672im

julia> @btime cis($(Ref(3.0))[])
  12.472 ns (0 allocations: 0 bytes)
-0.9899924966004454 + 0.1411200080598672im

(The Ref(3.0) trick is just to ensure that @btime doesn’t constant-fold away the whole calculation.)

They should have very similar accuracy. In fact, they give exactly the same results in the above example, and this should be always be the case (for finite x) with the default exp(im*x) algorithm. exp(im*x) is slower than cis because it calls exp(real(im*x)) (which gives 1.0) and has a few extra branches.

3 Likes

Sorry, I misunderstood your comment. At some previous time, I tested cis vs. exp and the latter was faster. But maybe I did something wrong or this changed in the mean time.

Great. I will migrate my code to using cis instead.

I have somewhat changed my mind, and think that the proposed iteration scheme may actually be useful. I was able to fix the mutit-hreading issue with the iterative update (see below). By migrating internally to Float64 precision, for a length of 10k and converting to Float32, I am still at Float32 machine precision, but the code is 4x faster compared to all other approaches. This may be worth it. But it sounds like a fair bit of work to make the broadcasting system understand and parallelize a generator like this.

using LoopVectorization
using BenchmarkTools
using Base.Threads
# using Plots

sz = (10000,)

function fill_exp_threads!(c)
    @inbounds @threads for n in 0:size(c,1)-1 # axes(c,1)
        c[n+1] = cis(n/10f0)
    end
    return c
end

function fill_exp_seq!(c, v = cis(1/10.0), w = complex(1.0))
    for n in axes(c,1)
        c[n] = w
        w *= v
    end
    return c
end

function fill_exp_seq_threads!(c, k=1/10.0)
    nt =  Threads.nthreads()
    v = cis(k)
    ts = floor(Int, size(c,1)/nt)
    for t in 0:nt-1
        si = (t*ts+1)
        fi = max((t+1)*ts, size(c,1))
        w = cis((si-1)*k)
        Threads.@spawn fill_exp_seq!((@view c[si:fi]), v, w)
    end
    return c
end


c = zeros(ComplexF32, 10000)
ref = cis.((0:sz[1]-1)./10f0); 
@btime fill_exp_threads!($c); # 19 µs
@btime fill_exp_seq!($c); # 21.7 µs
@btime fill_exp_seq_threads!($c); # 4.5 µs
@btime $c .= exp.(im.*(0:sz[1]-1)./10f0); # 117 µs
@btime $c .= cis.(im.*(0:sz[1]-1)./10f0); # 49.4 µs

@btime c = @avx cos.((0:sz[1]-1)./10f0); # Works for scalars
@btime c = @avx cis.((1:sz[1])./10f0); # FAILS
@btime $c .= @turbo cis.((1:sz[1])./10f0); # FAILS

fill_exp!(c); 
@show maximum(abs.(c .- ref))
# fill_exp_seq!(c, exp(im/10f0), complex(1f0)); 
fill_exp_seq!(c); 
@show maximum(abs.(c .- ref))
# plot(abs.(c .- ref))
fill_exp_seq_threads!(c); 
@show maximum(abs.(c .- ref))
c .= exp.(im.*(0:sz[1]-1)./10f0); 
@show maximum(abs.(c .- ref))

… but the multithreading does not really pay of for sizes below 1000 and the speedup for the single-thread is then only about 2x.