I just learned about @code_native, and it’s really cool that you can see the assembly instructions for code easily.
So, C has “compiler intrinsics” that provide C-like functions that map to particular CPU instructions. Does Julia have an equivalent?
2 Likes
llvmcall
with CPU intrinsics. It is a pain, though. For an example, see https://discourse.julialang.org/t/compiling-to-branch-table/16599/9 and consider that the sext
of vector comparison results is essential (this is what x86 does natively) while SIMD.jl uses zext
. This is why the pmovmskb
/ _mm_movemask_epi8
composes well with pcmpeqb
/ _mm_cmpeq_epi8
, and does not compose with SIMD.jl
comparison.
The preferred way is to figure out the age old C idiom for whatever you want to do, because llvm was made for clang, and both clang and your processor’s instruction set were made for old C idioms. Hence this will often compile to something good.
The second preferred way is to hope that somebody has written a julia wrapper. Only afterwards one should look whether llvm exposes your intrinsic, under which crazy name and with what crazy calling convention.
Of course llvm only exposes the intrinsic if they have no idiom (in llvm IR!) that reliably compiles to whatever instruction you want, but good luck finding a documentation for that. I passionately hate the llvm docs for processor intrinsics. I often reverse the calling convention from the llvm unit tests; alternatively, compile from C to llvm IR with clang and reverse engineer that.
4 Likes
Godbolt + -emit-llvm
is pretty ok.
6 Likes
Thanks! I imagine there probably isn’t a lot of demand for such a feature, and those who really need it can do as you suggest.
I wrote a cryptographic function (scrypt) for fun in C++ once and made heavy use of SIMD, explicit prefetching and minimizing cache pollution to increase its performance. It was cool to be able to manipulate the CPU that directly and simply. Too bad that isn’t easily achievable here.
I think there would be a lot of demand for immintrin.jl
that contains most of of immintrin.h
(see https://software.intel.com/sites/landingpage/IntrinsicsGuide/) with the same API.
It’s just that somebody needs to annotate everything (hopefully via a script), compile it with clang, generate the julia code (with llvmcall
) and generate tests for everything (to see that behavior is correct; possibly checking that code_native
matches the icc/gcc/clang variant), possibly fixing or documenting everything that goes wrong. Afaik nobody went to that effort yet.
2 Likes
Yeah, I actually started looking how something like that should look (intrinsics.jl · GitHub)
What is annoying is that there is not a one to one correspondence with LLVM intrinsics and the intrinsics one write in C.
For example:
_mm_xor_si128(A, B)
is in LLVM: xor <2 x i64> %1, %0,
_mm_add_sd
is:
%3 = extractelement <2 x double> %1, i32 0, !dbg !357
%4 = extractelement <2 x double> %0, i32 0, !dbg !357
%5 = fadd double %4, %3, !dbg !357
%6 = insertelement <2 x double> %0, double %5, i32 0, !dbg !357
etc.
AFAIU, you basically need to write a mini-compiler for the intrinsics. Or maybe the intrinsic exists and clang just doesn’t emit it?
I found this file: https://github.com/llvm-mirror/clang/blob/master/www/builtins.py which perhaps are the ones that are “special cased” in LLVM.
1 Like
Actually, it is possible just to use the intrinsic even for _mm_add_sd
(it is just that clang doesnt use it).
const VE{N, T} = NTuple{N, VecElement{T}}
@generated function _mm_add_sd(a::VE{2,Float64}, b::VE{2, Float64})
exp = """
%3 = call <2 x double> @llvm.x86.sse2.add.sd(<2 x double> %0, <2 x double> %1)
ret <2 x double> %3
"""
return quote
Base.llvmcall(
("""
declare <2 x double> @llvm.x86.sse2.add.sd(<2 x double>, <2 x double>)
""",
$exp),
VE{2,Float64},
Tuple{VE{2,Float64}, VE{2, Float64}},
a, b)
end
end
julia> a = VE{2, Float64}((1.0,2.0))
(VecElement{Float64}(1.0), VecElement{Float64}(2.0))
julia> b = VE{2, Float64}((3.0,4.0))
(VecElement{Float64}(3.0), VecElement{Float64}(4.0))
julia> _mm_add_sd(a, b)
(VecElement{Float64}(4.0), VecElement{Float64}(2.0))
julia> @code_native _mm_add_sd(b, a)
.section __TEXT,__text,regular,pure_instructions
; Function _mm_add_sd {
; Location: REPL[3]:2
; Function macro expansion; {
; Location: REPL[3]:2
vaddsd %xmm1, %xmm0, %xmm0
retq
nopw %cs:(%rax,%rax)
;}}
2 Likes
Yeah. My best guess would be to use clang to compile everything in immintinsics.h
into functions, grab the llvm IR, and generate the surrounding @inline __mm_foobar(arg1::Type1, arg2::Type2) = llvmcall(...)
. At the same time, we would need to automatically generate cheap tests for each function (maybe steal some projects unit tests), and also write the generated code_native somewhere.
Then the stolen tests would check functionality and generated tests would check that the code_native matches if we generate a @cfunction
. Since we expect only 3-4 instructions (the one that the intrinsic corresponds to, plus maybe a handful of register moves and ret
to conform to C-ABI) there should be almost no spurious difference.
So the main tasks is to use a convenient C and ascii-llvm-IR parser/preprocessor to do all the text processing, and to write a monumentally annoying build-script for this. And if something here does not work in the obvious way this is of course a world of pain.
1 Like
There’s basically no point to use intrinsics for most (all) vector math operations.
1 Like
Sure, (SIMD.jl works well for normal arithmetic). There is, however, value in having something that works similar to C-intrinsics in how you write it in Julia for consistency (and learning) and also for stuff like aesdec
you need intrinsics AFAIU.
1 Like
For things that is not normal arithmatic and cannot be expressed as one, sure. I’m just pointing out that most of the intrinsics you’ve tested so far are implemented as a icc compatibility layer in clang/gcc and I’m 99% sure the corresponding actual intrinsics are meant for the backend and doesn’t actually expose any more functions. You shouldn’t be looking specifically for llvm intrinsics since there’s no point using them (they only make the code non-portable and potentially harder for the compiler to optimize).
1 Like
For example,
static __inline __m256 __DEFAULT_FN_ATTRS
_mm256_mul_ps(__m256 __a, __m256 __b)
{
return (__m256)((__v8sf)__a * (__v8sf)__b);
}
1 Like
Sure, but like, those were just examples to experiment with calling intrinsics at all.
Anyway, my point is that to be able to port something like https://github.com/cmuratori/meow_hash/blob/5ceaf1476baeef38c16ab95cbb4e18f6df20b05d/meow_intrinsics.h#L70-L100 it would be nice if there was something in Julia that made that convenient.
For fused operations I guess you also need to use intrinsics.
1 Like
Yes. The intrinsic set of llvm is designed for clang, such that all of immintinsics.h
can be compiled, not such that they are conveniently usable for humans. So we should grab the clang output, inline it and hope that llvm does not destroy its good properties in later passes.
E.g. converting <8 x i1> to i8
would be the obvious idiom, but is afaik not understood by llvm, needing the x86 intrinsic calls plus sext to <8 x i8>
to get pmovmskb
. This is a lot of work for humans, and just directly porting C code for the very innermost loop operations that uses _mm_something
would be very convenient. And, as Kristoffer said, aes intrinsics in llvm are madness.
Of course, general vector math without fancy stuff, like e.g. chacha20 works without llvmcall.
1 Like
Yes? I’m not saying you can’t make “compatibility layers”. I’m saying that you shouldn’t be looking for LLVM intrinsics. Moreover, you shouldn’t be looking for backend intrinsics, since most of the operations are implemented as normal operations and the rest as platform independent intrinsics.
1 Like
And if you really want to know how things should be implemented, you can just have a look at the clang implementation in the headers. it contains a minimum set of actual intrinsics that you need to find and call and many more operation you can just implement in normal julia or llvm code. The C implementation of _mm_add_sd
I pasted above, for example, should be fairely straightforward to implement in julia. If it does not yield the same code, it’s a compiiler bug that should be fixed.
Edit: and finally for the list of actual C intrinsics you need, I believe the GCC doc is usually pretty good. Maybe clang has one but I have always been using the GCC one.
1 Like
If I want to call _mm_aesdec_si128
I would right now see what LLVM outputs Compiler Explorer and then call that intrinsic @llvm.x86.aesni.aesdec
.
What should I do instead?
1 Like
Use it?
Just to be more clear, I’m replying to,
and
I’m just saying that if there’s a way to do it in julia code, don’t look for intrinsics, and when there’s a way to do it with generic intrinsics, don’t look for backend specific intrinsics. This is the right way to do things and also give the compiler the best input for optimization. I’m not saying that you shouldn’t use any backend intrinsics or other intrinsics, just that they shouldn’t be what you specifically look for.
1 Like
Alright, so do I understand you correctly in that you recommend:
- Implement the possible operations using the normal LLVM vector operations (this is basically SIMD.jl).
- Provide a compatibility layer on top of these for the arithmetic instrinsics (https://github.com/llvm-mirror/clang/blob/master/www/builtins.py) which could simplify in porting code from C.
- For the rest that do not have a vector operations, just use the intrinsic.
Yes. And use generic intrinsics as much as possible. The llvm langref, gcc intrinsics doc and the clang implementations in the headers are your friends.
1 Like