Julia equivalent of C compiler intrinsics?

BioTurboNick · November 7, 2018, 9:47pm

I just learned about @code_native, and it’s really cool that you can see the assembly instructions for code easily.

So, C has “compiler intrinsics” that provide C-like functions that map to particular CPU instructions. Does Julia have an equivalent?

foobar_lv2 · November 7, 2018, 10:24pm

llvmcall with CPU intrinsics. It is a pain, though. For an example, see https://discourse.julialang.org/t/compiling-to-branch-table/16599/9 and consider that the sext of vector comparison results is essential (this is what x86 does natively) while SIMD.jl uses zext. This is why the pmovmskb/ _mm_movemask_epi8 composes well with pcmpeqb / _mm_cmpeq_epi8, and does not compose with SIMD.jl comparison.

The preferred way is to figure out the age old C idiom for whatever you want to do, because llvm was made for clang, and both clang and your processor’s instruction set were made for old C idioms. Hence this will often compile to something good.

The second preferred way is to hope that somebody has written a julia wrapper. Only afterwards one should look whether llvm exposes your intrinsic, under which crazy name and with what crazy calling convention.

Of course llvm only exposes the intrinsic if they have no idiom (in llvm IR!) that reliably compiles to whatever instruction you want, but good luck finding a documentation for that. I passionately hate the llvm docs for processor intrinsics. I often reverse the calling convention from the llvm unit tests; alternatively, compile from C to llvm IR with clang and reverse engineer that.

kristoffer.carlsson · November 7, 2018, 10:33pm

Godbolt + -emit-llvm is pretty ok.

BioTurboNick · November 8, 2018, 4:13pm

Thanks! I imagine there probably isn’t a lot of demand for such a feature, and those who really need it can do as you suggest.

I wrote a cryptographic function (scrypt) for fun in C++ once and made heavy use of SIMD, explicit prefetching and minimizing cache pollution to increase its performance. It was cool to be able to manipulate the CPU that directly and simply. Too bad that isn’t easily achievable here.

foobar_lv2 · November 8, 2018, 7:01pm

I think there would be a lot of demand for immintrin.jl that contains most of of immintrin.h (see https://software.intel.com/sites/landingpage/IntrinsicsGuide/) with the same API.

It’s just that somebody needs to annotate everything (hopefully via a script), compile it with clang, generate the julia code (with llvmcall) and generate tests for everything (to see that behavior is correct; possibly checking that code_native matches the icc/gcc/clang variant), possibly fixing or documenting everything that goes wrong. Afaik nobody went to that effort yet.

kristoffer.carlsson · November 8, 2018, 7:38pm

Yeah, I actually started looking how something like that should look (intrinsics.jl · GitHub)

What is annoying is that there is not a one to one correspondence with LLVM intrinsics and the intrinsics one write in C.

For example:

_mm_xor_si128(A, B) is in LLVM: xor <2 x i64> %1, %0,

_mm_add_sd is:

  %3 = extractelement <2 x double> %1, i32 0, !dbg !357
  %4 = extractelement <2 x double> %0, i32 0, !dbg !357
  %5 = fadd double %4, %3, !dbg !357
  %6 = insertelement <2 x double> %0, double %5, i32 0, !dbg !357

etc.

AFAIU, you basically need to write a mini-compiler for the intrinsics. Or maybe the intrinsic exists and clang just doesn’t emit it?

I found this file: https://github.com/llvm-mirror/clang/blob/master/www/builtins.py which perhaps are the ones that are “special cased” in LLVM.

kristoffer.carlsson · November 8, 2018, 7:46pm

Actually, it is possible just to use the intrinsic even for _mm_add_sd (it is just that clang doesnt use it).

const VE{N, T} = NTuple{N, VecElement{T}}

@generated function _mm_add_sd(a::VE{2,Float64}, b::VE{2, Float64})
    exp = """
    %3 = call <2 x double> @llvm.x86.sse2.add.sd(<2 x double> %0, <2 x double> %1)
    ret <2 x double> %3
    """
    return quote
            Base.llvmcall(
            ("""
            declare <2 x double> @llvm.x86.sse2.add.sd(<2 x double>, <2 x double>)
            """,
            $exp),
               VE{2,Float64},
               Tuple{VE{2,Float64}, VE{2, Float64}},
               a,  b)
    end
end

julia> a = VE{2, Float64}((1.0,2.0))
(VecElement{Float64}(1.0), VecElement{Float64}(2.0))

julia> b = VE{2, Float64}((3.0,4.0))
(VecElement{Float64}(3.0), VecElement{Float64}(4.0))

julia> _mm_add_sd(a, b)
(VecElement{Float64}(4.0), VecElement{Float64}(2.0))

julia> @code_native _mm_add_sd(b, a)
        .section        __TEXT,__text,regular,pure_instructions
; Function _mm_add_sd {
; Location: REPL[3]:2
; Function macro expansion; {
; Location: REPL[3]:2
        vaddsd  %xmm1, %xmm0, %xmm0
        retq
        nopw    %cs:(%rax,%rax)
;}}

foobar_lv2 · November 8, 2018, 7:48pm

Yeah. My best guess would be to use clang to compile everything in immintinsics.h into functions, grab the llvm IR, and generate the surrounding @inline __mm_foobar(arg1::Type1, arg2::Type2) = llvmcall(...). At the same time, we would need to automatically generate cheap tests for each function (maybe steal some projects unit tests), and also write the generated code_native somewhere.

Then the stolen tests would check functionality and generated tests would check that the code_native matches if we generate a @cfunction. Since we expect only 3-4 instructions (the one that the intrinsic corresponds to, plus maybe a handful of register moves and ret to conform to C-ABI) there should be almost no spurious difference.

So the main tasks is to use a convenient C and ascii-llvm-IR parser/preprocessor to do all the text processing, and to write a monumentally annoying build-script for this. And if something here does not work in the obvious way this is of course a world of pain.

yuyichao · November 8, 2018, 7:48pm

There’s basically no point to use intrinsics for most (all) vector math operations.

kristoffer.carlsson · November 8, 2018, 7:50pm

Sure, (SIMD.jl works well for normal arithmetic). There is, however, value in having something that works similar to C-intrinsics in how you write it in Julia for consistency (and learning) and also for stuff like aesdec you need intrinsics AFAIU.

yuyichao · November 8, 2018, 7:54pm

For things that is not normal arithmatic and cannot be expressed as one, sure. I’m just pointing out that most of the intrinsics you’ve tested so far are implemented as a icc compatibility layer in clang/gcc and I’m 99% sure the corresponding actual intrinsics are meant for the backend and doesn’t actually expose any more functions. You shouldn’t be looking specifically for llvm intrinsics since there’s no point using them (they only make the code non-portable and potentially harder for the compiler to optimize).

yuyichao · November 8, 2018, 7:55pm

For example,

static __inline __m256 __DEFAULT_FN_ATTRS
_mm256_mul_ps(__m256 __a, __m256 __b)
{
  return (__m256)((__v8sf)__a * (__v8sf)__b);
}

kristoffer.carlsson · November 8, 2018, 8:01pm

Sure, but like, those were just examples to experiment with calling intrinsics at all.

Anyway, my point is that to be able to port something like https://github.com/cmuratori/meow_hash/blob/5ceaf1476baeef38c16ab95cbb4e18f6df20b05d/meow_intrinsics.h#L70-L100 it would be nice if there was something in Julia that made that convenient.

For fused operations I guess you also need to use intrinsics.

foobar_lv2 · November 8, 2018, 8:05pm

Yes. The intrinsic set of llvm is designed for clang, such that all of immintinsics.h can be compiled, not such that they are conveniently usable for humans. So we should grab the clang output, inline it and hope that llvm does not destroy its good properties in later passes.

E.g. converting <8 x i1> to i8 would be the obvious idiom, but is afaik not understood by llvm, needing the x86 intrinsic calls plus sext to <8 x i8> to get pmovmskb. This is a lot of work for humans, and just directly porting C code for the very innermost loop operations that uses _mm_something would be very convenient. And, as Kristoffer said, aes intrinsics in llvm are madness.

Of course, general vector math without fancy stuff, like e.g. chacha20 works without llvmcall.

yuyichao · November 8, 2018, 8:06pm

Yes? I’m not saying you can’t make “compatibility layers”. I’m saying that you shouldn’t be looking for LLVM intrinsics. Moreover, you shouldn’t be looking for backend intrinsics, since most of the operations are implemented as normal operations and the rest as platform independent intrinsics.

yuyichao · November 8, 2018, 8:09pm

And if you really want to know how things should be implemented, you can just have a look at the clang implementation in the headers. it contains a minimum set of actual intrinsics that you need to find and call and many more operation you can just implement in normal julia or llvm code. The C implementation of _mm_add_sd I pasted above, for example, should be fairely straightforward to implement in julia. If it does not yield the same code, it’s a compiiler bug that should be fixed.

Edit: and finally for the list of actual C intrinsics you need, I believe the GCC doc is usually pretty good. Maybe clang has one but I have always been using the GCC one.

kristoffer.carlsson · November 8, 2018, 8:11pm

If I want to call _mm_aesdec_si128 I would right now see what LLVM outputs Compiler Explorer and then call that intrinsic @llvm.x86.aesni.aesdec.

What should I do instead?

yuyichao · November 8, 2018, 8:15pm

Use it?

Just to be more clear, I’m replying to,

and

I’m just saying that if there’s a way to do it in julia code, don’t look for intrinsics, and when there’s a way to do it with generic intrinsics, don’t look for backend specific intrinsics. This is the right way to do things and also give the compiler the best input for optimization. I’m not saying that you shouldn’t use any backend intrinsics or other intrinsics, just that they shouldn’t be what you specifically look for.

kristoffer.carlsson · November 8, 2018, 8:17pm

Alright, so do I understand you correctly in that you recommend:

Implement the possible operations using the normal LLVM vector operations (this is basically SIMD.jl).
Provide a compatibility layer on top of these for the arithmetic instrinsics (https://github.com/llvm-mirror/clang/blob/master/www/builtins.py) which could simplify in porting code from C.
For the rest that do not have a vector operations, just use the intrinsic.

yuyichao · November 8, 2018, 8:18pm

Yes. And use generic intrinsics as much as possible. The llvm langref, gcc intrinsics doc and the clang implementations in the headers are your friends.