I like the fixed point number implementation better, its how hardware NCOs work on hardware GPS receiver
One option is to vectorize the fixed point index calculation and table lookup using gather instructions:
We can do vectorized load from a lookup table if we can somehow convince Julia to emit vgatherdpd
instruction: vgatherdps . Whether this is faster than a indexing loop on a CPU is debatable:
- https://dl.acm.org/doi/abs/10.1145/3533737.3535089
- SIMD gather result in slow down
- > If you count shared memory scatter/gather, CPU SIMD already have both. Scatter... | Hacker News
- x86 - Intel vs AMD gather AVX performance - Stack Overflow
The paper shows that its profitable for an i9-7900X processor with AVX512:
(reg_standalone
is scalar indexing loop)
But a security update might make this fast vectorized lookup table code go 50% slower:
Another option is to run the code LFSRs in parallel instead of indexing into a lookup table:
- Generating more than one bit at a time with an LFSR
- https://ufdcimages.uflib.ufl.edu/AA/00/03/94/72/00001/AA00039472_00001.pdf
I haven’t seen anyone doing this for GNSS PRN generators though