Implementing dedispersion in Julia

I am trying to write an implementation of a basic algorithm in pulsar astronomy: dedispersion. What this essentially boils down to is adding the rows of a two-dimensional matrix together, but we shift each row by some value. The lower the row, the more the shift. Note that the shift is not circular, and it does not increase linearly. I am still new to Julia, and so I programmed it in a way that looks more like C/C++ here. Is there a better/more idiomatic/more performant way to do this? Like having a view into the 2D array that shifts as it goes down?

I think your code looks like a good starting point. Memory access is contiguous, thats very nice.

One thing I noticed is, that while A is a matrix you always access it using linear indices e.g. A[(i + Δi) * nt + j]. Is there a reason for this?

To increase performance there are a couple of things you could try:

  • Use @inbounds annotations
  • Use @simd annotation on your inner loop
  • Use @threads annotation on your outer loop

Also can you post a typical benchmark, so people can better play around with your code? I am not familiar with the algorithm and have no ideas what are typical sizes for the arguments.

I’d be careful with this, all threads would modify the same elements of B.

1 Like

You are right, if multi threading is desired one needs to do it differently. How exactly depends on typical sizes of A.

Some typical benchmarks are right there in the repository, in the bench directory. Unfortunately, I could not upload the file that I used to make that benchmark, because it was almost 20 GBs! But it should give you a fair idea of the kind of inputs I am trying to deal with.

As for the linear indexing for A, there is really no reason for it, except that I could not figure out how to do the same thing with Cartesian Indices :sweat_smile:.

I am looking for a very simple 10 lines copy pastable benchmark that gives me realistic eltypes and sizes. Like:

A = randn(100, 100) # is 100 a realistic size? Is Float64 realistic or Float32 ?
...
@btime dedisp(A, f₀, Δf, DM)

Edit: From the performance view point only size and eltype of data matters. The actual values are irrelevant, just put something random there.

Maybe it would be enough to keep nthreads() separate copies of B and sum them in the end?

1 Like

Yes, that is one possiblility the other is annotating the inner loop. Or maybe we need to solve the problem many times and then it might be better to just call single threaded dedisp in parallel.

My bad :sweat_smile: ! I had forgotten that the code for the benchmark wasn’t up yet. You can take a 2D Array with Float32 eltype and 4096 × 128000 elements. You can reduce the second dimension if you want, in order to ensure that the array fits into the memory of your machine (because these numbers will give you about 16 GBS of data!). Then call the dedisp function as: dedisp(A, 500.0, 200.0, 50.0). I will put up more details about the benchmark into the repo today :+1:t5:.

module Baseline

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        for j = 1:nt
            B[j] += A[(i + Δi) * nt + j]
        end
    end
    B
end

function dedisp(A, f₀, Δf, DM)
    nf, nt = size(A)
    B = Array{eltype(A)}(undef, nt)
    fill!(B,0)
    dedisp!(B, A, f₀, Δf, nf, nt, DM)
end

end

module Perf

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        offset = (i + Δi) * nt 
        @simd for j in 1:nt
            @inbounds B[j] += A[offset + j]
        end
    end
    B
end

function dedisp(A, f₀, Δf, DM)
    nf, nt = size(A)
    B = Array{eltype(A)}(undef, nt)
    fill!(B,0)
    dedisp!(B, A, f₀, Δf, nf, nt, DM)
end

end

A = randn(Float32, 16, 128000)
# Baseline.dedisp(A, 1f0, 1f0, 1f0)
using BenchmarkTools
using Test
#@test Baseline.dedisp(A, 500.0, 200.0, 50.0) ≈ Baseline.dedisp(A, 500.0, 200.0, 50.0)
@test Baseline.dedisp(A, 500.0, 200.0, 50.0) ≈ Perf.dedisp(A, 500.0, 200.0, 50.0)
@btime Perf.dedisp(A, 500.0, 200.0, 50.0)
@btime Baseline.dedisp(A, 500.0, 200.0, 50.0)

gives me:

  174.864 μs (2 allocations: 500.05 KiB)
  972.529 μs (2 allocations: 500.05 KiB)
3 Likes

Thanks for that :grin: ! I got a 2x times speedup with a 20 GB file, so that’s something. The modified release and benchmark are up on the repository. Really, really appreciate all the feedback.

Now we are talking numbers :slight_smile:. @simd doesn’t cut it for me, but checking with @tturbo from LoopVectorization or @batch from Polyester like so

module Perf
using Polyester

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        offset = (i + Δi) * nt 
        @batch for j in 1:nt
            @inbounds B[j] += A[offset + j]
        end
    end
    B
end

yields similar results like

  87.400 μs (2 allocations: 500.05 KiB)
  333.300 μs (2 allocations: 500.05 KiB)

on

julia> versioninfo()
Julia Version 1.9.0-DEV.81
Commit 8ae9dd5c5b (2022-02-24 19:13 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 5 on 12 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 5
1 Like

I just checked same thing here, SIMD is used regardless of the annotation.
I can get another 2x speedup by using threads (my machine has 32 hyper threads):


function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    offsets = map(1:nf-1) do i
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        (i + Δi) * nt 
    end
    chunksize = 4000
    Threads.@threads for jstart in 1:chunksize:nt
        jstop = min(jstart+chunksize-1, nt)
        jrange = jstart:jstop
        for offset in offsets
            @simd for j in jrange
                @inbounds B[j] += A[offset + j]
            end
        end
    end
    B
end

But really the problem seems bottle necked by memory access. To confirm memory access is the issue I passed

struct FakeMatrix{T} <: AbstractMatrix{T}
    size::Tuple{Int,Int}
end
Base.size(o::FakeMatrix) = o.size
function Base.getindex(o::FakeMatrix{T}, i::Int) where {T}
    T(i)
end
A = FakeMatrix{Float32}((512, 128000))

and get 11x speedup from threads (16x would be ideal though, maybe load balancing needs more care).