Implementing dedispersion in Julia

Ujjwal_Panda · March 17, 2022, 7:39am

I am trying to write an implementation of a basic algorithm in pulsar astronomy: dedispersion. What this essentially boils down to is adding the rows of a two-dimensional matrix together, but we shift each row by some value. The lower the row, the more the shift. Note that the shift is not circular, and it does not increase linearly. I am still new to Julia, and so I programmed it in a way that looks more like C/C++ here. Is there a better/more idiomatic/more performant way to do this? Like having a view into the 2D array that shifts as it goes down?

jw3126 · March 17, 2022, 7:56am

I think your code looks like a good starting point. Memory access is contiguous, thats very nice.

One thing I noticed is, that while A is a matrix you always access it using linear indices e.g. A[(i + Δi) * nt + j]. Is there a reason for this?

To increase performance there are a couple of things you could try:

Use @inbounds annotations
Use @simd annotation on your inner loop
Use @threads annotation on your outer loop

Also can you post a typical benchmark, so people can better play around with your code? I am not familiar with the algorithm and have no ideas what are typical sizes for the arguments.

baggepinnen · March 17, 2022, 9:14am

I’d be careful with this, all threads would modify the same elements of B.

jw3126 · March 17, 2022, 9:23am

You are right, if multi threading is desired one needs to do it differently. How exactly depends on typical sizes of A.

Ujjwal_Panda · March 17, 2022, 9:30am

Some typical benchmarks are right there in the repository, in the bench directory. Unfortunately, I could not upload the file that I used to make that benchmark, because it was almost 20 GBs! But it should give you a fair idea of the kind of inputs I am trying to deal with.

As for the linear indexing for A, there is really no reason for it, except that I could not figure out how to do the same thing with Cartesian Indices .

jw3126 · March 17, 2022, 9:32am

I am looking for a very simple 10 lines copy pastable benchmark that gives me realistic eltypes and sizes. Like:

A = randn(100, 100) # is 100 a realistic size? Is Float64 realistic or Float32 ?
...
@btime dedisp(A, f₀, Δf, DM)

Edit: From the performance view point only size and eltype of data matters. The actual values are irrelevant, just put something random there.

baggepinnen · March 17, 2022, 9:32am

Maybe it would be enough to keep nthreads() separate copies of B and sum them in the end?

jw3126 · March 17, 2022, 9:34am

Yes, that is one possiblility the other is annotating the inner loop. Or maybe we need to solve the problem many times and then it might be better to just call single threaded dedisp in parallel.

Ujjwal_Panda · March 17, 2022, 10:10am

My bad ! I had forgotten that the code for the benchmark wasn’t up yet. You can take a 2D Array with Float32 eltype and 4096 × 128000 elements. You can reduce the second dimension if you want, in order to ensure that the array fits into the memory of your machine (because these numbers will give you about 16 GBS of data!). Then call the dedisp function as: dedisp(A, 500.0, 200.0, 50.0). I will put up more details about the benchmark into the repo today .

jw3126 · March 17, 2022, 10:36am

module Baseline

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        for j = 1:nt
            B[j] += A[(i + Δi) * nt + j]
        end
    end
    B
end

function dedisp(A, f₀, Δf, DM)
    nf, nt = size(A)
    B = Array{eltype(A)}(undef, nt)
    fill!(B,0)
    dedisp!(B, A, f₀, Δf, nf, nt, DM)
end

end

module Perf

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        offset = (i + Δi) * nt 
        @simd for j in 1:nt
            @inbounds B[j] += A[offset + j]
        end
    end
    B
end

function dedisp(A, f₀, Δf, DM)
    nf, nt = size(A)
    B = Array{eltype(A)}(undef, nt)
    fill!(B,0)
    dedisp!(B, A, f₀, Δf, nf, nt, DM)
end

end

A = randn(Float32, 16, 128000)
# Baseline.dedisp(A, 1f0, 1f0, 1f0)
using BenchmarkTools
using Test
#@test Baseline.dedisp(A, 500.0, 200.0, 50.0) ≈ Baseline.dedisp(A, 500.0, 200.0, 50.0)
@test Baseline.dedisp(A, 500.0, 200.0, 50.0) ≈ Perf.dedisp(A, 500.0, 200.0, 50.0)
@btime Perf.dedisp(A, 500.0, 200.0, 50.0)
@btime Baseline.dedisp(A, 500.0, 200.0, 50.0)

gives me:

  174.864 μs (2 allocations: 500.05 KiB)
  972.529 μs (2 allocations: 500.05 KiB)

Ujjwal_Panda · March 17, 2022, 7:17pm

Thanks for that ! I got a 2x times speedup with a 20 GB file, so that’s something. The modified release and benchmark are up on the repository. Really, really appreciate all the feedback.

goerch · March 17, 2022, 8:14pm

Now we are talking numbers . @simd doesn’t cut it for me, but checking with @tturbo from LoopVectorization or @batch from Polyester like so

module Perf
using Polyester

const 𝓓 = 1.0 / 2.41e-4
Δt(f, f₀, DM) = floor(Int64, 𝓓 * DM * (1/f^2 - 1/f₀^2))

function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    for i = 1:(nf-1)
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        offset = (i + Δi) * nt 
        @batch for j in 1:nt
            @inbounds B[j] += A[offset + j]
        end
    end
    B
end

yields similar results like

  87.400 μs (2 allocations: 500.05 KiB)
  333.300 μs (2 allocations: 500.05 KiB)

on

julia> versioninfo()
Julia Version 1.9.0-DEV.81
Commit 8ae9dd5c5b (2022-02-24 19:13 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 5 on 12 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 5

jw3126 · March 18, 2022, 4:24am

I just checked same thing here, SIMD is used regardless of the annotation.
I can get another 2x speedup by using threads (my machine has 32 hyper threads):


function dedisp!(B, A, f₀, Δf, nf, nt, DM)
    offsets = map(1:nf-1) do i
        δf = Δf / nf
        f = f₀ - δf
        Δi = Δt(f, f₀, DM)
        (i + Δi) * nt 
    end
    chunksize = 4000
    Threads.@threads for jstart in 1:chunksize:nt
        jstop = min(jstart+chunksize-1, nt)
        jrange = jstart:jstop
        for offset in offsets
            @simd for j in jrange
                @inbounds B[j] += A[offset + j]
            end
        end
    end
    B
end

But really the problem seems bottle necked by memory access. To confirm memory access is the issue I passed

struct FakeMatrix{T} <: AbstractMatrix{T}
    size::Tuple{Int,Int}
end
Base.size(o::FakeMatrix) = o.size
function Base.getindex(o::FakeMatrix{T}, i::Int) where {T}
    T(i)
end
A = FakeMatrix{Float32}((512, 128000))

and get 11x speedup from threads (16x would be ideal though, maybe load balancing needs more care).

Topic		Replies	Views
Revisiting dedispersion New to Julia	4	330	March 26, 2022
Speeding up julia New to Julia	15	5291	July 12, 2017
Performance input on my code Performance	8	830	October 15, 2018
CartesianIndices, Convolution and other "moving window" computations on GPU New to Julia gpu , indexing	0	629	January 10, 2019
Multi-threading on a 2 CPU system New to Julia multithreading	15	1000	February 2, 2023

Implementing dedispersion in Julia

Related topics