SIMD: Need some help to speed up sampling a code vector

This is part of my work to implement a GNSS receiver in real time (see GNSSReceiver.jl).

One crucial part is the code generation that functions as a matched filter.

The code is binary and can be modeled as

code = Int32.(rand((-1, 1), 1023)) # GPS L1 has a length of 1023 GPS L5 has a length of 10230

It could in theory also be stored in a more compact form like BitArray or BitIntegers, but I wasn’t able to increase performance for this format.

The code needs to be sampled with the current code frequency and sampling frequency.

I came up with two implementations that have roughly the same performance:

function generate_code1!(sampled_code, phases, code, code_frequency, sampling_frequency)
    sampling_frequency_i32 = Base.SignedMultiplicativeInverse(floor(Int32, sampling_frequency))
    code_length = Int32(1023)
    code_frequency_i32 = floor(Int32, code_frequency)
    @inbounds for (idx, i) = enumerate(Int32(0):Int32(length(sampled_code)-1))
        phases[idx] = mod(div(code_frequency_i32 * i, sampling_frequency_i32), code_length)
    end
    @inbounds for i = 1:length(sampled_code)
        sampled_code[i] = code[phases[i] + 1]
    end
end

If you have never heard about SignedMultiplicativeInverse you can read about it here. It is a nice way to implement integer division.
Here is the second implementation:

using FixedPointNumbers
function generate_code2!(sampled_code, code, code_frequency, sampling_frequency)
    FP = Fixed{Int, 53}
    code_length_fp = FP(length(code))
    delta_fp = FP(code_frequency / sampling_frequency)
    phase_fp = FP(0)
    @inbounds for i = 1:length(sampled_code)
        sampled_code[i] = code[floor(Int,phase_fp) + 1]
        phase_fp += delta_fp
        phase_fp -= (phase_fp >= code_length_fp) * code_length_fp
    end
end

The second implementation might drift for very large sampled_code arrays because it is based on a delta, but accuracy isn’t my main concern.

In both cases I used an integer implementation instead of float because the conversion from float to integer floor(Int, phase) seemed to be too inefficient.

Here is the code to run both functions:

using BenchmarkTools
code = Int32.(rand((-1, 1), 1023)) # GPS L1 has a length of 1023 GPS L5 has a length of 10230
num_samples = 2000
sampled_code1 = zeros(Int32, num_samples)
sampled_code2 = zeros(Int32, num_samples)
phases = zeros(Int32, num_samples)
@btime generate_code1!($sampled_code1, $phases, $code, $1023e3, $5e6)
    #1.488 μs (0 allocations: 0 bytes)
@btime generate_code2!($sampled_code2, $code, $1023e3, $5e6)
    #1.334 μs (0 allocations: 0 bytes)

Can we get any faster than this?
I wonder because I almost see no vectorized instructions in @code_native

Note that your example is missing a definition of the variable code.

Can you precompute and reuse the phases array in the first version? If so I think that this way would be fastest overall. The main problem is that you essentially perform random loads from an array and this cannot be vectorized AFAIK. Unless there is a way to compute the elements of code on the fly I don’t think you can get faster. As a comparison you could try to precompute phases and then just time the last loop of the first function. This gives you the absolute minimal time you get achieve using the approach shown here.

I like the fixed point number implementation better, its how hardware NCOs work on hardware GPS receiver

One option is to vectorize the fixed point index calculation and table lookup using gather instructions:

We can do vectorized load from a lookup table if we can somehow convince Julia to emit vgatherdpd instruction: vgatherdps . Whether this is faster than a indexing loop on a CPU is debatable:

The paper shows that its profitable for an i9-7900X processor with AVX512:

(reg_standalone is scalar indexing loop)

But a security update might make this fast vectorized lookup table code go 50% slower:

Another option is to run the code LFSRs in parallel instead of indexing into a lookup table:

I haven’t seen anyone doing this for GNSS PRN generators though

2 Likes

Note that your example is missing a definition of the variable code.

It was at the top, but I included it in the last code section now, too.

Can you precompute and reuse the phases array in the first version

No, unfortunately not, because the code_frequency will vary with every call.

The main problem is that you essentially perform random loads from an array and this cannot be vectorized AFAIK

There is some structure to this though. The phases look like this:

[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5, ...]

e.g. it might vary between 3 and 4 iterations of the same value depending on the code and sampling frequency. There will always be some repetition (unless the sampling frequency and the code_frequency are identical) and the number will be in ascending order.

How often will sampling_frequency vary?

This isn’t a binary code, since it can take three different values, -1, 0 and 1.

A binary code could be (depending on what you actually want here)

code = rand(Bool, 1023)
code = bitrand(1023)  # using Random
code = rand((-1, 1), 1023) # if you actually want +-1
1 Like

Ups, yes, I wasn’t aware that sign can be 0.
Yes I mean only -1s and 1s.

Thanks for the hint! I will update my code above.

1 Like

With this particular pattern, the vectorization opportunity I see is

  • Read a single element from code into a simd vector of length 4 and vstore into sampled_code
  • Increment output index by 4
  • Repeat above 4 times
  • Decrement output index by 1
  • Repeat

I don’t know if you can easily generalize this for all patterns, as I don’t have a computer with me, and don’t see all possible patterns.

1 Like

How often will sampling_frequency vary?

For a basic GNSS receiver this is fixed.
You could, however, use the estimated receiver time to improve the receiver clock and with it the sampling frequency. So it is not carved in stone :wink:

1 Like

It’s also conceivable that you don’t need to explicitly calculate the phases vector, if you can just calculate the ‘skip pattern’ and choose different code paths based on that.

1 Like

Yes, that’s more or less my second attempt (see generate_code2!).
It generates the phases (the skip pattern) by increasing the phase by a delta (see phase_fp += delta_fp)

Ok, but it still explicitly calculates a phase step at each index, and there’s no vectorization opportunity since each step appears independent to the compiler.

What I meant was, calculate the pattern, (4, 4, 4, 3) once and then choose a code path based on that.

1 Like

Could you provide permalinks to the relevant snippets in the Git repo on Github?

As far as I understand, in your code rounding down and rounding towards zero should produce the same result, right? If so, it’d be better (faster) to use rounding towards zero, with rem.

FTR, the BenchmarkTools.jl manual says:

If the function you study mutates its input, it is probably a good idea to set evals=1 manually.

It seems like it might make sense to pass the sampling frequency via the type domain. E.g., with Val.

As far as I understand, in your code rounding down and rounding towards zero should produce the same result, right? If so, it’d be better (faster) to use rounding towards zero, with rem.

No, it must be a wrapping for negative numbers (I’m not sure if wrapping is the right word).
So phase -1 must be identical to phase 1022.

There is actually some optimizations happening in the background.
If I use mod(phase, 1023) it is actually much faster than mod(phase, length(code)). So yes I might want to use Val for the code length.

I just checked if the same works for the sampling_frequency

function generate_code3!(sampled_code, phases, code, code_frequency)
    sampling_frequency_i32 = Base.SignedMultiplicativeInverse(Int32(5_000_000))
    code_length = Int32(1023)
    code_frequency_i32 = floor(Int32, code_frequency)
    @inbounds for (idx, i) = enumerate(Int32(0):Int32(length(sampled_code)-1))
        phases[idx] = mod(div(code_frequency_i32 * i, sampling_frequency_i32), code_length)
    end
    @inbounds for i = 1:length(sampled_code)
        sampled_code[i] = code[phases[i] + 1]
    end
end

But there doesn’t seem to be an improvement:

@btime generate_code3!($sampled_code1, $phases, $code, $1023e3)
    #1.516 μs (0 allocations: 0 bytes)

Could you provide permalinks to the relevant snippets in the Git repo on Github?

Yes, here is the relevant code in repo:

1 Like

Okay, makes sense, but I’m a bit skeptical about this approach, because it really depends on the set sampling and code frequency. The repetition could go from 1 to 16 or more (the repetition is essentially defined by sampling_frequency / code_frequency).

But is the pattern always N, N, N, ..., N-1? Or N, N, N, ..., N-n?

If you want vectorization, you need to look for ways of achieving it by leveraging patterns like this.

If sampling_frequency / code_frequency is an integer it will be that integer for all repetitions.
If it is something like 3.5 the repetition will alternate between 3 and 4.

But as I said it could be anything, also something like 3.9100684261974585.

Yes, but

Sorry for asking you to spell it out, I’m writing on a phone here.

The pattern can be calculated with

function skip_pattern(code_frequency = 1023e3, sampling_frequency = 4e6, num_samples = 1023)
   diff([0; accumulate((prev, x) -> prev + floor(Int, x - prev), sampling_frequency / code_frequency * (1:num_samples), init = 0)])
end

This is probably a bit more complicated than it is in theory, but this what I came up with :wink:

julia> skip_pattern(1023e3, 4e6)
1023-element Vector{Int64}:
 3
 4
 4
 ⋮
 4
 4
 4
julia> skip_pattern(1023e3, 3.5 * 1023e3)
1023-element Vector{Int64}:
 3
 4
 3
 4
 ⋮
 3
 4
 3

As I said, I cannot perform calculations on my phone, or at least it is very inconvenient. I just wanted to know

But I feel like I’m bothering you instead of helping you, so I will stop here.