Stupid question: did you measure with @btime
and check with a profiler? I’m only halfway qualified to talk about the CPU part of the question, but would suspect @batch
or @tturbo
should do better for the CPU, see this thread for example.
And is an exemplary MWE really out of reach?
As for your original question: I found this paper, but I haven’t seen any mention of such technology in this group recently. The first hit when searching Discourse is this thread.