SIMD: Need some help to speed up sampling a code vector

zsoerenm · July 16, 2024, 10:05am

Your help is really appreciated. I’m not sure what you are asking for, though.
The pattern is not always N, N, ... N-1 or N, N, ... N-n.
I’ve provided a function to calculate the pattern above.

DNF · July 16, 2024, 10:24am

I cannot run you code, and the examples you print out do indeed follow the pattern I’m asking about (eg. 4, 3, 4, 3, 4, 3, ... )

A counter example could be 4, 4, 3, 3, 4, 4, 3, 3, ...

The reason I ask is that if it follows this pattern it should be relatively straightforward to simd-fy the code.

zsoerenm · July 16, 2024, 11:31am

Alright, I think I’ve found something that’s faster using this skip pattern approach:

function generate_code4!(sampled_code, code, code_frequency, sampling_frequency)
    FP = Fixed{Int, 52}
    delta_fp = FP(sampling_frequency / code_frequency)
    prev = 0
    @inbounds for i = 1:floor(Int, code_frequency / sampling_frequency * length(sampled_code))
        temp_code = code[mod(i, 1023)]
        skips = floor(Int, i * delta_fp - prev)
        for j = 1:skips
            sampled_code[prev + j] = temp_code
        end
        prev = prev + skips
    end
end

It isn’t perfect, yet. The samples are shifted by one and it doesn’t calculate the last sample.
But it’s faster and I can see a lot more vectorized commands in code_native .

code = Int32.(rand((-1, 1), 1023))
sampled_code4 = zeros(Int32, num_samples)
@btime generate_code4!($sampled_code4, $code, $1023e3, $5e6)
    # 1.131 μs (0 allocations: 0 bytes)

minetest2048 · July 16, 2024, 11:31am

This is my first attempt using SIMD intrinsics from SIMD.jl:

using SIMD

function simd_nco!(output_vec, initial_phase, phase_step, lookup_table)

           #simple fixed point number
           # int 11 bit
           # frac 53 bit
           # container 64 bit
           range_const = Vec{8, Int64}((0,1,2,3,4,5,6,7))
           for idx in 0:fld(length(output_vec),8)-1
               v0 = range_const + idx*8
               v1 = initial_phase + v0 * phase_step
               v5 = Vec{8, Int32}(v1 >> 53) # extract integer part of fixed point number
               v2 = v5 % 1023
               @inbounds v3 = vgather(lookup_table, Vec{8, Int64}(v2+1))
               @inbounds vstore(v3, output_vec, idx*8+1)
           end
         #TODO this currently doesn't handle the remaining elements          
         # that doesn't fit into a SIMD register
       end

Unfortunately I need to convert back to Int64 vectors for gather indexing because of Support `vgather` with all Int types · Issue #98 · eschnett/SIMD.jl · GitHub

Output is identical with generate_code2!

function generate_code2!(sampled_code, code, code_frequency, sampling_frequency)
    FP = Fixed{Int, 52}
    code_length_fp = FP(length(code))
    delta_fp = FP(code_frequency / sampling_frequency)
    phase_fp = FP(0)
    @inbounds for i = 1:length(sampled_code)
        sampled_code[i] = code[floor(Int,phase_fp) + 1]
        phase_fp += delta_fp
        phase_fp -= (phase_fp >= code_length_fp) * code_length_fp
    end
end

num_samples = 2000
sampled_code = zeros(Int32, num_samples)
sampled_code2 = zeros(Int32, num_samples)

code = Int32.(rand((-1, 1), 1023))

generate_code2!(sampled_code, code, 1023e3, 5e6)

simd_nco!(sampled_code2, 0, (Fixed{Int,53}(1023e3/5e6).i), code)

sampled_code == sampled_code2

Performance is slightly faster:

julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.986 μs …  3.621 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.012 μs ± 59.056 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂█                                                         
  ▃██▆▂▂▂▂▂▂▃▅▃▂▂▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  1.99 μs        Histogram: frequency by time        2.29 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.169 μs …  3.769 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.176 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.181 μs ± 56.328 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄█                                                         
  ▂██▄▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▁▂▂ ▂
  1.17 μs        Histogram: frequency by time        1.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Current bottleneck is the modulo operation v2 = v5 % 1023. Changing it to modulo 1024 makes it much faster because it reduces to bit masking:

julia> code = Int32.(rand((-1, 1), 1024))
1024-element Vector{Int32}:

julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.984 μs …  3.311 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.011 μs ± 57.494 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▇▃     ▃▅▃                                              ▂
  ██████▆▄▆▅███▆▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▁▁▅▅▆▃▃▅▅▄▄▄▄▁▁▁▁▄▅▄▅▅▅▆▅▄▅▅▅ █
  1.98 μs      Histogram: log(frequency) by time     2.29 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 117 evaluations.
 Range (min … max):  756.829 ns …  1.031 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     758.205 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   760.693 ns ± 10.146 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▆▁                                                        ▂
  █████▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▁▁▅██▆▃▅▃▄▃▃▅▁▅▆▇███████████▇▇██▇▆▇▆▇ █
  757 ns        Histogram: log(frequency) by time       790 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

zsoerenm · July 16, 2024, 12:16pm

Thank you, I will try this code later today!

I think @DNF might be right here, that the skip pattern is the better approach.
I have an example above your post.
I think that if you split the for loops carefully, you could omit the modulo function altogether by just calculating the code up until code length and then start another loop for the remaining part.

DNF · July 16, 2024, 12:46pm

This should be

code = rand((Int32(-1), Int32(1)), 1023)

zsoerenm · July 16, 2024, 9:05pm

Okay guys, this blew my mind.
Watch this:

function generate_code5!(
    sampled_code,
    code,
    code_frequency,
    sampling_frequency,
    ::Val{N}, # number of samples
) where {N}
    fixed_point = sizeof(Int) * 8 - 1 - ndigits(N; base = 2)
    FP = Fixed{Int,fixed_point}
    delta_fp = FP(sampling_frequency / code_frequency).i
    delta_sum = 0
    prev = 0
    local temp_code
    num_total_iterations = Int(fld(code_frequency * N, sampling_frequency))
    num_code_iterations = cld(num_total_iterations, length(code))
    num_inner_iterations = 4 # This should actually be Int(cld(sampling_frequency, code_frequency))
    @inbounds for k = 1:num_code_iterations
        for i = 1:min(num_total_iterations - length(code) * (k - 1), length(code))
            temp_code = code[i]
            for j = 1:num_inner_iterations
                sampled_code[prev+j] = temp_code
            end
            delta_sum += delta_fp
            prev = delta_sum >> fixed_point + 1
        end
    end
    # Sample remaining bits
    @inbounds for i = prev:N
        sampled_code[i] = temp_code
    end
end

sampled_code5 = zeros(Int32, 2000)
@btime generate_code5!($sampled_code5, $code, $1023e3, $4e6, Val(2000))
  #143.262 ns (0 allocations: 0 bytes)

This is almost 10 times faster

It is based on the skip pattern approach (thanks @DNF).
The trick I used here was to make the inner loop a fixed length.
This still works even if the repetition varies between two numbers, because if the repetition is the lower number, the sampled code is overwritten at that index. This extra work is completely eliminated by taking advantage of SIMD.

Ah, and I have eliminated the mod function by splitting the loop such that it is always between 1 and 1023.

There is still one caveat though: I could only get it down to 143 ns when I hard coded the 4 in the inner loop. This should actually be Int(cld(sampling_frequency, code_frequency)) (which is equal to 4 for sampling_frequency = 4e6 and code_frequency = 1023e3) but then the benchmark drops down to 636.060 ns.

Do you have guys have an advice how to get to this speed without hard coding the 4?

Zentrik · July 16, 2024, 9:17pm

Having not read basically any of this thread, maybe try unrolling the inner loop using GitHub - JuliaSIMD/LLVMLoopInfo.jl: Pass loop info to LLVM or making the frequencies Vals.

DNF · July 16, 2024, 9:44pm

I see a couple of possibilities.

You could pass num_inner_iterations as a Val, instead of the frequencies (I don’t really understand what the fixed point is needed for, though I haven’t looked closely, I’d rather use Val for the inner loop.)

You could use explicit simd instructions with SIMD.jl, and use a fixed vector length, but skip num_inner_iterations for each write operation. Then perhaps you won’t need to use Val.

Topic		Replies	Views
Efficient repeated sampling of small vector Performance	13	413	April 8, 2023
SIMD gather result in slow down Performance	6	564	February 28, 2023
How to speed up the Interpolation Performance	7	1662	April 4, 2018
Customize a random function to sample 3 out of a list of 4097 real numbers Performance question	5	253	July 7, 2023
Help to improve performance of kalman filter New to Julia question	3	679	December 30, 2018

SIMD: Need some help to speed up sampling a code vector

Related topics