Your help is really appreciated. Iβm not sure what you are asking for, though.
The pattern is not always N, N, ... N-1
or N, N, ... N-n
.
Iβve provided a function to calculate the pattern above.
I cannot run you code, and the examples you print out do indeed follow the pattern Iβm asking about (eg. 4, 3, 4, 3, 4, 3, ...
)
A counter example could be 4, 4, 3, 3, 4, 4, 3, 3, ...
The reason I ask is that if it follows this pattern it should be relatively straightforward to simd-fy the code.
Alright, I think Iβve found something thatβs faster using this skip pattern approach:
function generate_code4!(sampled_code, code, code_frequency, sampling_frequency)
FP = Fixed{Int, 52}
delta_fp = FP(sampling_frequency / code_frequency)
prev = 0
@inbounds for i = 1:floor(Int, code_frequency / sampling_frequency * length(sampled_code))
temp_code = code[mod(i, 1023)]
skips = floor(Int, i * delta_fp - prev)
for j = 1:skips
sampled_code[prev + j] = temp_code
end
prev = prev + skips
end
end
It isnβt perfect, yet. The samples are shifted by one and it doesnβt calculate the last sample.
But itβs faster and I can see a lot more vectorized commands in code_native
.
code = Int32.(rand((-1, 1), 1023))
sampled_code4 = zeros(Int32, num_samples)
@btime generate_code4!($sampled_code4, $code, $1023e3, $5e6)
# 1.131 ΞΌs (0 allocations: 0 bytes)
This is my first attempt using SIMD intrinsics from SIMD.jl:
using SIMD
function simd_nco!(output_vec, initial_phase, phase_step, lookup_table)
#simple fixed point number
# int 11 bit
# frac 53 bit
# container 64 bit
range_const = Vec{8, Int64}((0,1,2,3,4,5,6,7))
for idx in 0:fld(length(output_vec),8)-1
v0 = range_const + idx*8
v1 = initial_phase + v0 * phase_step
v5 = Vec{8, Int32}(v1 >> 53) # extract integer part of fixed point number
v2 = v5 % 1023
@inbounds v3 = vgather(lookup_table, Vec{8, Int64}(v2+1))
@inbounds vstore(v3, output_vec, idx*8+1)
end
#TODO this currently doesn't handle the remaining elements
# that doesn't fit into a SIMD register
end
Unfortunately I need to convert back to Int64 vectors for gather indexing because of Support `vgather` with all Int types Β· Issue #98 Β· eschnett/SIMD.jl Β· GitHub
Output is identical with generate_code2!
function generate_code2!(sampled_code, code, code_frequency, sampling_frequency)
FP = Fixed{Int, 52}
code_length_fp = FP(length(code))
delta_fp = FP(code_frequency / sampling_frequency)
phase_fp = FP(0)
@inbounds for i = 1:length(sampled_code)
sampled_code[i] = code[floor(Int,phase_fp) + 1]
phase_fp += delta_fp
phase_fp -= (phase_fp >= code_length_fp) * code_length_fp
end
end
num_samples = 2000
sampled_code = zeros(Int32, num_samples)
sampled_code2 = zeros(Int32, num_samples)
code = Int32.(rand((-1, 1), 1023))
generate_code2!(sampled_code, code, 1023e3, 5e6)
simd_nco!(sampled_code2, 0, (Fixed{Int,53}(1023e3/5e6).i), code)
sampled_code == sampled_code2
Performance is slightly faster:
julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.986 ΞΌs β¦ 3.621 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.999 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.012 ΞΌs Β± 59.056 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββ β
1.99 ΞΌs Histogram: frequency by time 2.29 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.169 ΞΌs β¦ 3.769 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.176 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.181 ΞΌs Β± 56.328 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.17 ΞΌs Histogram: frequency by time 1.36 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Current bottleneck is the modulo operation v2 = v5 % 1023
. Changing it to modulo 1024 makes it much faster because it reduces to bit masking:
julia> code = Int32.(rand((-1, 1), 1024))
1024-element Vector{Int32}:
julia> @benchmark generate_code2!($sampled_code, $code, $1023e3, $5e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.984 ΞΌs β¦ 3.311 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.999 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.011 ΞΌs Β± 57.494 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ ββ
β β
ββββββββββ
ββββββββββββββββββββββ
β
ββββ
β
ββββββββββ
ββ
β
β
ββ
ββ
β
β
β
1.98 ΞΌs Histogram: log(frequency) by time 2.29 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark simd_nco!($sampled_code2, $0, $(Fixed{Int,53}(1023e3/5e6).i), $code)
BenchmarkTools.Trial: 10000 samples with 117 evaluations.
Range (min β¦ max): 756.829 ns β¦ 1.031 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 758.205 ns β GC (median): 0.00%
Time (mean Β± Ο): 760.693 ns Β± 10.146 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ β
ββββββββββββββββββββββ
ββββ
βββββ
βββββ
ββ
ββββββββββββββββββββββ β
757 ns Histogram: log(frequency) by time 790 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Thank you, I will try this code later today!
I think @DNF might be right here, that the skip pattern is the better approach.
I have an example above your post.
I think that if you split the for loops carefully, you could omit the modulo function altogether by just calculating the code up until code length and then start another loop for the remaining part.
This should be
code = rand((Int32(-1), Int32(1)), 1023)
Okay guys, this blew my mind.
Watch this:
function generate_code5!(
sampled_code,
code,
code_frequency,
sampling_frequency,
::Val{N}, # number of samples
) where {N}
fixed_point = sizeof(Int) * 8 - 1 - ndigits(N; base = 2)
FP = Fixed{Int,fixed_point}
delta_fp = FP(sampling_frequency / code_frequency).i
delta_sum = 0
prev = 0
local temp_code
num_total_iterations = Int(fld(code_frequency * N, sampling_frequency))
num_code_iterations = cld(num_total_iterations, length(code))
num_inner_iterations = 4 # This should actually be Int(cld(sampling_frequency, code_frequency))
@inbounds for k = 1:num_code_iterations
for i = 1:min(num_total_iterations - length(code) * (k - 1), length(code))
temp_code = code[i]
for j = 1:num_inner_iterations
sampled_code[prev+j] = temp_code
end
delta_sum += delta_fp
prev = delta_sum >> fixed_point + 1
end
end
# Sample remaining bits
@inbounds for i = prev:N
sampled_code[i] = temp_code
end
end
sampled_code5 = zeros(Int32, 2000)
@btime generate_code5!($sampled_code5, $code, $1023e3, $4e6, Val(2000))
#143.262 ns (0 allocations: 0 bytes)
This is almost 10 times faster
It is based on the skip pattern approach (thanks @DNF).
The trick I used here was to make the inner loop a fixed length.
This still works even if the repetition varies between two numbers, because if the repetition is the lower number, the sampled code is overwritten at that index. This extra work is completely eliminated by taking advantage of SIMD.
Ah, and I have eliminated the mod
function by splitting the loop such that it is always between 1
and 1023
.
There is still one caveat though: I could only get it down to 143 ns
when I hard coded the 4
in the inner loop. This should actually be Int(cld(sampling_frequency, code_frequency))
(which is equal to 4
for sampling_frequency = 4e6
and code_frequency = 1023e3
) but then the benchmark drops down to 636.060 ns
.
Do you have guys have an advice how to get to this speed without hard coding the 4
?
Having not read basically any of this thread, maybe try unrolling the inner loop using GitHub - JuliaSIMD/LLVMLoopInfo.jl: Pass loop info to LLVM or making the frequencies Vals.
I see a couple of possibilities.
You could pass num_inner_iterations
as a Val
, instead of the frequencies (I donβt really understand what the fixed point is needed for, though I havenβt looked closely, Iβd rather use Val
for the inner loop.)
You could use explicit simd instructions with SIMD.jl, and use a fixed vector length, but skip num_inner_iterations
for each write operation. Then perhaps you wonβt need to use Val
.