SIMD: Need some help to speed up sampling a code vector

Your help is really appreciated. I’m not sure what you are asking for, though.
The pattern is not always N, N, ... N-1 or N, N, ... N-n.
I’ve provided a function to calculate the pattern above.

I cannot run you code, and the examples you print out do indeed follow the pattern I’m asking about (eg. 4, 3, 4, 3, 4, 3, ... )

A counter example could be 4, 4, 3, 3, 4, 4, 3, 3, ...

The reason I ask is that if it follows this pattern it should be relatively straightforward to simd-fy the code.

Alright, I think I’ve found something that’s faster using this skip pattern approach:

function generate_code4!(sampled_code, code, code_frequency, sampling_frequency)
    FP = Fixed{Int, 52}
    delta_fp = FP(sampling_frequency / code_frequency)
    prev = 0
    @inbounds for i = 1:floor(Int, code_frequency / sampling_frequency * length(sampled_code))
        temp_code = code[mod(i, 1023)]
        skips = floor(Int, i * delta_fp - prev)
        for j = 1:skips
            sampled_code[prev + j] = temp_code
        end
        prev = prev + skips
    end
end

It isn’t perfect, yet. The samples are shifted by one and it doesn’t calculate the last sample.
But it’s faster and I can see a lot more vectorized commands in code_native :slight_smile: .

code = Int32.(rand((-1, 1), 1023))
sampled_code4 = zeros(Int32, num_samples)
@btime generate_code4!($sampled_code4, $code, $1023e3, $5e6)
    # 1.131 ΞΌs (0 allocations: 0 bytes)

This is my first attempt using SIMD intrinsics from SIMD.jl:

using SIMD

function simd_nco!(output_vec, initial_phase, phase_step, lookup_table)

           #simple fixed point number
           # int 11 bit
           # frac 53 bit
           # container 64 bit
           range_const = Vec{8, Int64}((0,1,2,3,4,5,6,7))
           for idx in 0:fld(length(output_vec),8)-1
               v0 = range_const + idx*8
               v1 = initial_phase + v0 * phase_step
               v5 = Vec{8, Int32}(v1 >> 53) # extract integer part of fixed point number
               v2 = v5 % 1023
               @inbounds v3 = vgather(lookup_table, Vec{8, Int64}(v2+1))
               @inbounds vstore(v3, output_vec, idx*8+1)
           end
         #TODO this currently doesn't handle the remaining elements          
         # that doesn't fit into a SIMD register
       end


Unfortunately I need to convert back to Int64 vectors for gather indexing because of Support `vgather` with all Int types Β· Issue #98 Β· eschnett/SIMD.jl Β· GitHub

Output is identical with generate_code2!

function generate_code2!(sampled_code, code, code_frequency, sampling_frequency)
    FP = Fixed{Int, 52}
    code_length_fp = FP(length(code))
    delta_fp = FP(code_frequency / sampling_frequency)
    phase_fp = FP(0)
    @inbounds for i = 1:length(sampled_code)
        sampled_code[i] = code[floor(Int,phase_fp) + 1]
        phase_fp += delta_fp
        phase_fp -= (phase_fp >= code_length_fp) * code_length_fp
    end
end

num_samples = 2000
sampled_code = zeros(Int32, num_samples)
sampled_code2 = zeros(Int32, num_samples)

code = Int32.(rand((-1, 1), 1023))

generate_code2!(sampled_code, code, 1023e3, 5e6)

simd_nco!(sampled_code2, 0, (Fixed{Int,53}(1023e3/5e6).i), code)

sampled_code == sampled_code2

Performance is slightly faster:

julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.986 ΞΌs …  3.621 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.999 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.012 ΞΌs Β± 59.056 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‚β–ˆ                                                         
  β–ƒβ–ˆβ–ˆβ–†β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–…β–ƒβ–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  1.99 ΞΌs        Histogram: frequency by time        2.29 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.169 ΞΌs …  3.769 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.176 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.181 ΞΌs Β± 56.328 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–„β–ˆ                                                         
  β–‚β–ˆβ–ˆβ–„β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚ β–‚
  1.17 ΞΌs        Histogram: frequency by time        1.36 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Current bottleneck is the modulo operation v2 = v5 % 1023. Changing it to modulo 1024 makes it much faster because it reduces to bit masking:

julia> code = Int32.(rand((-1, 1), 1024))
1024-element Vector{Int32}:

julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.984 ΞΌs …  3.311 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.999 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.011 ΞΌs Β± 57.494 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–ˆβ–‡β–ƒ     β–ƒβ–…β–ƒ                                              β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–†β–…β–ˆβ–ˆβ–ˆβ–†β–β–β–β–β–„β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–…β–…β–†β–ƒβ–ƒβ–…β–…β–„β–„β–„β–„β–β–β–β–β–„β–…β–„β–…β–…β–…β–†β–…β–„β–…β–…β–… β–ˆ
  1.98 ΞΌs      Histogram: log(frequency) by time     2.29 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 117 evaluations.
 Range (min … max):  756.829 ns …  1.031 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     758.205 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   760.693 ns Β± 10.146 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–ˆβ–†β–                                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–…β–„β–β–β–…β–ˆβ–ˆβ–†β–ƒβ–…β–ƒβ–„β–ƒβ–ƒβ–…β–β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–‡β–†β–‡β–†β–‡ β–ˆ
  757 ns        Histogram: log(frequency) by time       790 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Thank you, I will try this code later today!

I think @DNF might be right here, that the skip pattern is the better approach.
I have an example above your post.
I think that if you split the for loops carefully, you could omit the modulo function altogether by just calculating the code up until code length and then start another loop for the remaining part.

This should be

code = rand((Int32(-1), Int32(1)), 1023)
1 Like

Okay guys, this blew my mind.
Watch this:

function generate_code5!(
    sampled_code,
    code,
    code_frequency,
    sampling_frequency,
    ::Val{N}, # number of samples
) where {N}
    fixed_point = sizeof(Int) * 8 - 1 - ndigits(N; base = 2)
    FP = Fixed{Int,fixed_point}
    delta_fp = FP(sampling_frequency / code_frequency).i
    delta_sum = 0
    prev = 0
    local temp_code
    num_total_iterations = Int(fld(code_frequency * N, sampling_frequency))
    num_code_iterations = cld(num_total_iterations, length(code))
    num_inner_iterations = 4 # This should actually be Int(cld(sampling_frequency, code_frequency))
    @inbounds for k = 1:num_code_iterations
        for i = 1:min(num_total_iterations - length(code) * (k - 1), length(code))
            temp_code = code[i]
            for j = 1:num_inner_iterations
                sampled_code[prev+j] = temp_code
            end
            delta_sum += delta_fp
            prev = delta_sum >> fixed_point + 1
        end
    end
    # Sample remaining bits
    @inbounds for i = prev:N
        sampled_code[i] = temp_code
    end
end
sampled_code5 = zeros(Int32, 2000)
@btime generate_code5!($sampled_code5, $code, $1023e3, $4e6, Val(2000))
  #143.262 ns (0 allocations: 0 bytes)

This is almost 10 times faster :heart_eyes:

It is based on the skip pattern approach (thanks @DNF).
The trick I used here was to make the inner loop a fixed length.
This still works even if the repetition varies between two numbers, because if the repetition is the lower number, the sampled code is overwritten at that index. This extra work is completely eliminated by taking advantage of SIMD.

Ah, and I have eliminated the mod function by splitting the loop such that it is always between 1 and 1023.

There is still one caveat though: I could only get it down to 143 ns when I hard coded the 4 in the inner loop. This should actually be Int(cld(sampling_frequency, code_frequency)) (which is equal to 4 for sampling_frequency = 4e6 and code_frequency = 1023e3) but then the benchmark drops down to 636.060 ns.

Do you have guys have an advice how to get to this speed without hard coding the 4?

2 Likes

Having not read basically any of this thread, maybe try unrolling the inner loop using GitHub - JuliaSIMD/LLVMLoopInfo.jl: Pass loop info to LLVM or making the frequencies Vals.

I see a couple of possibilities.

You could pass num_inner_iterations as a Val, instead of the frequencies (I don’t really understand what the fixed point is needed for, though I haven’t looked closely, I’d rather use Val for the inner loop.)

You could use explicit simd instructions with SIMD.jl, and use a fixed vector length, but skip num_inner_iterations for each write operation. Then perhaps you won’t need to use Val.