I’m reading about this between Rust and C/Fortran/etc, and here’s a simple example of constant folding from inlining an integer range summation from Rust into a C program with literal integer inputs. Obviously we don’t routinely deal with linking, but I’m wondering if there’s something between a ccall and llvmcall where a C library function is called and its LLVM IR is optimized together with the Julia code.
Discussion about other contrast and parallels (thin vs fat LTO versus the whole-program optimization in a Julia process) and when cross-language optimization is beneficial (constant folding is rare, inlining often doesn’t help) is welcome, lots to learn.
Is it that simple? Doesn’t this require a lot of internals support, to properly cooperate with the existing system?
I think that this would be a huge feature. Compile your shared library with clang -fembed-bitcode=all and have a ccall variant that permits cross-language inlining. Or maybe a flag for Libdl.dlopen, to look for bitcode sections in the library.
For starters, we could compile most of the julia runtime with that, and have most cheap runtime calls inlined.
For seconds, this would make a lot of intrinsics programming much nicer – basically because immintrin.h is well documented, while llvm processor intrinsics are an underdocumented mess. So this would enable us to just write the kernel in C (ever tried to use the juicy aesenc instructions from julia?)
Never did autodifferentiation let alone Enzyme, so noob question: is that issue implying that Enzyme works or is intended to work on LLVM IR across ccall boundaries? What went wrong there exactly, couldn’t tell how the working and failed functions differed after that Numba name-unmangling.
Yes, Enzyme works at the LLVM level, which means it couldn’t care less about the frontend language: as long as it receives LLVM IR/bitcode, it can do anything. And Enzyme is “just” an optimisation pass, so the same applies to other passes.
It has been suggested a few times in the past that BinaryBuilder enables -fembed-bitcode=all by default, but that a significant infrastructure change that we never did it in practice, but in theory it should be doable.
It’s possible to do this manually in both directions, I have done a fair amount of experiments but it involved a lot of manual interaction with Clang and LLD. The biggest problem I ran into was caching and cache invalidation (of Julia code in the Julia → C direction which I investigated the most). That might be better now with things like CompilerCaching.jl Various other problems included mapping between architecture tuples (there’s some weird permutations of macOS/darwin when doing lto) and managing the bitcode in memory/on disk.
ClangCompiler.jl is/was an attempt at getting some of this working less manually but I’m not sure the status of that project now.
BB2 with the local compiler shards would also be a massive help here, managing the toolchain was a huge headache.