Successful Static Compilation of Julia Code for use in Production

The StaticCompiler package was recently registered. This post records a successful experiment to statically compile a piece of Julia code into a small .so library on Linux, which is then loaded from Python and used in training of a deep learning model.

TLDR

Static compilation to a stand-alone library does work on Linux but has rather significant restrictions on the functionality available. It is mostly useful for core computational routines. Roughly speaking, if it could be implemented in plain C and you write your Julia code in a corresponding way, it is a good candidate for static compilation. The final library ended up at a size of 20 kB.

Notes

  • This post describes the state of static compilation in the first half of April 2022. The capabilities of StaticCompiler and surrounding tooling are expected to improve over time.

  • The code in question is proprietary and cannot be shared. Challenges and solutions will be presented by small representative examples. Unfortunately those will be of a toy size, in contrast to the full code, but that is how it is.

  • This should work similarly on Mac but at this point StaticCompiler is not supported on Windows.

Problem Description

Without going into how or why, I needed the functionality of some Julia code to be available in Python and the main option was to port the computational parts to C and compile it into a library which could be loaded from Python. The code in question was about 250 lines of code, split into five larger functions, one of which needed to be called from Python. Not huge by any means but far larger than typical test examples and actually intended to be used in production.

The point of this experiment was to see if I could avoid porting to C by using StaticCompiler to generate a stand-alone library.

Getting Started with StaticCompiler

As of writing, StaticCompiler works with Julia 1.7 and 1.8. I used 1.7.1 and 1.8.0-beta3 in my experiment. StaticCompiler was version 0.4.2.

Installing StaticCompiler

Iā€™m using a clean environment for all demonstration examples:

$ mkdir test
$ cd test
$ julia --project=.
julia> using Pkg
julia> Pkg.add("StaticCompiler")

Generating a Stand-alone Library

For a first demonstration, consider this toy example to compute harmonic numbers, saved as test1.jl:

function test1(n)
    s = 0.0
    for i = 1:n
        s += 1 / i
    end
    return s
end

To compile this into a stand-alone library, run

julia> using StaticCompiler
julia> include("test1.jl")
julia> compile_shlib(test1, (Int, ), filename = "test1")

We can look at the output:

$ ls -l test1.so 
-rwxrwxr-x 1 gunnar gunnar 15736 apr 10 16:16 test1.so

A useful tool to inspect the generated library is the nm command:

$ nm test1.so 
0000000000004020 b completed.0
                 w __cxa_finalize@@GLIBC_2.2.5
0000000000001040 t deregister_tm_clones
00000000000010b0 t __do_global_dtors_aux
0000000000003e48 d __do_global_dtors_aux_fini_array_entry
0000000000004018 d __dso_handle
0000000000003e50 d _DYNAMIC
00000000000011c8 t _fini
00000000000010f0 t frame_dummy
0000000000003e40 d __frame_dummy_init_array_entry
000000000000209c r __FRAME_END__
0000000000004000 d _GLOBAL_OFFSET_TABLE_
                 w __gmon_start__
0000000000002008 r __GNU_EH_FRAME_HDR
0000000000001000 t _init
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
0000000000001100 T julia_test1
0000000000001070 t register_tm_clones
0000000000004020 d __TMC_END__

I donā€™t have enough insight to say what all these symbols mean, letā€™s just say that this is a working baseline that we can compare to when things donā€™t work. We can also see that the compiled function has been prefixed by julia_.

Calling the Library from Python

I have used ctypes to call the library. This is test1.py:

import ctypes

lib = ctypes.cdll.LoadLibrary("test1.so")
test1 = lib.julia_test1
test1.argtypes = (ctypes.c_int64,)
test1.restype = ctypes.c_double
print(test1(10))

Now we can run it:

$ python3 test1.py 
2.9289682539682538

Restrictions and Workarounds

Unfortunately there are a lot of things you canā€™t do in your code if you want to compile to a stand-alone library. Specifically you canā€™t do anything that requires support from the Julia runtime, which for example includes:

  • heap allocations
  • threading
  • exceptions
  • dynamic dispatch
  • additional code generation
  • IO

In the following sections I will review the limitations I ran into and how I diagnosed and worked around them.

Type Instability

If we slightly change test1.jl to test2.jl like this:

function test2(n)
    s = 0
    for i = 1:n
        s += 1 / i
    end
    return s
end

and try to compile it

julia> include("test2.jl")
julia> compile_shlib(test2, (Int, ), filename = "test2")

the result is

ERROR: test2(Int64,) did not infer to a concrete type. Got Union{Float64, Int64}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile_shlib(f::Function, types::Tuple{DataType}, path::String, name::String; filename::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StaticCompiler ~/.julia/packages/StaticCompiler/S1AWw/src/StaticCompiler.jl:261
 [3] top-level scope
   @ REPL[6]:1

This is a problem already for the static compilation. We can diagnose this with @code_warntype:

julia> @code_warntype test2(10)
MethodInstance for test2(::Int64)
  from test2(n) in Main at /home/gunnar/pathology/whole-slide-vectors/blog/test2.jl:1
Arguments
  #self#::Core.Const(test2)
  n::Int64
Locals
  @_3::Union{Nothing, Tuple{Int64, Int64}}
  s::Union{Float64, Int64}
  i::Int64
Body::Union{Float64, Int64}
1         (s = 0)
    %2  = (1:n)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
          (@_3 = Base.iterate(%2))
    %4  = (@_3 === nothing)::Bool
    %5  = Base.not_int(%4)::Bool
          goto #4 if not %5
2   %7  = @_3::Tuple{Int64, Int64}
          (i = Core.getfield(%7, 1))
    %9  = Core.getfield(%7, 2)::Int64
    %10 = s::Union{Float64, Int64}
    %11 = (1 / i)::Float64
          (s = %10 + %11)
          (@_3 = Base.iterate(%2, %9))
    %14 = (@_3 === nothing)::Bool
    %15 = Base.not_int(%14)::Bool
          goto #4 if not %15
3         goto #2
4         return s

In the REPL the union types are colored red to stand out. The problem is that s is initialized as an Int on the first line and changes to Float64 in the loop, but if n < 1 the loop isnā€™t run, so it depends on the input value which type the return value will have.

The proper way to solve this is to initialize s to a Float64 as in test1.jl and the type instability is gone. Another, inferior, option would be to instead return Float64(s) at the end. This makes the output type possible to infer and the compilation goes through, leaving an internal type instability. In terms of static compilation that is actually okay and the resulting library can successfully be called from Python. For the kind of computational code that is possible to compile statically, type instabilities are usually a bad thing however, so my recommendation would be to eliminate those before attempting static compilation.

Exceptions

Exceptions require runtime support. This includes exceptions that donā€™t trigger, and potential exceptions you maybe werenā€™t aware of in functions you call. One example I ran into is test3.jl:

function test3(x)
    return floor(Int, x)
end

This compiles fine with

julia> include("test3.jl")
julia> compile_shlib(test3, (Float64,), filename = "test3")

However, when you try to run this with test3.py,

import ctypes

lib = ctypes.cdll.LoadLibrary("test3.so")
test3 = lib.julia_test3
test3.argtypes = (ctypes.c_double,)
test3.restype = ctypes.c_int64
print(test3())

you fail already when trying to load the library:

$ python3 test3.py 
Traceback (most recent call last):
  File "test3.py", line 3, in <module>
    lib = ctypes.cdll.LoadLibrary("test3.so")
  File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: test3.so: undefined symbol: jl_throw

This could be predicted from the nm output:

0000000000004038 b completed.0
                 w __cxa_finalize@@GLIBC_2.2.5
0000000000001070 t deregister_tm_clones
00000000000010e0 t __do_global_dtors_aux
0000000000003e18 d __do_global_dtors_aux_fini_array_entry
0000000000004030 d __dso_handle
0000000000003e20 d _DYNAMIC
0000000000001204 t _fini
0000000000001120 t frame_dummy
0000000000003e10 d __frame_dummy_init_array_entry
000000000000210c r __FRAME_END__
0000000000004000 d _GLOBAL_OFFSET_TABLE_
                 w __gmon_start__
0000000000002010 r __GNU_EH_FRAME_HDR
00000000000011e0 t gpu_gc_pool_alloc
00000000000011c0 t gpu_malloc
00000000000011d0 t gpu_report_oom
0000000000001000 t _init
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 U jl_invoke
                 U jl_throw
0000000000001130 T julia_test3
                 U malloc@@GLIBC_2.2.5
00000000000010a0 t register_tm_clones
0000000000004038 d __TMC_END__

Those U lines are bad news, unresolved symbols.

The reason for the potential exception is that the floating point input might be out of range for Int,

julia> test3(1e100)
ERROR: InexactError: trunc(Int64, 1.0e100)

If you know that this will not occur in your code you can work around the problem with

function test3(x)
    return unsafe_trunc(Int, floor(x))
end

Vectors

Arrays are something that you frequently need for computational tasks and my code was no exception, in fact it needed to pass multiple vectors and one 3D array from Python to Julia. Unfortunately you run into the ā€œno heap allocationā€ restriction just by trying to construct a vector. It is informative to see how this affects static compilation. test4.jl:

function test4()
    x = ones(Int, 3)
    return length(x)
end

Compilation works:

julia> include("test4.jl")
julia> compile_shlib(test4, (), filename = "test4")

However, when you try to run this with test4.py,

import ctypes

lib = ctypes.cdll.LoadLibrary("test4.so")
test4 = lib.julia_test4
test4.argtypes = ()
test4.restype = ctypes.c_int64
print(test4())

things go really badly:

$ python3 test4.py 
Segmentation fault (core dumped)

There is no hint that this will happen from the nm output, so the best ways to find heap allocations is to look for allocations when running the Julia code, e.g.

julia> @allocated test4()
80

So how can the heap allocations be worked around? To begin with you might get your vector data as input so you donā€™t have to allocate the memory on the Julia side. The simplest way to pass a vector from Python to the library is with a separate data pointer and length, test5.py:

import ctypes
import numpy as np

lib = ctypes.cdll.LoadLibrary("test5.so")
test5 = lib.julia_test5
test5.argtypes = (ctypes.POINTER(ctypes.c_int64), ctypes.c_int64)
test5.restype = None
x = np.array([1, 2, 3])
test5(x.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)), len(x))
print(x)

What do we do with the pointer and length on the Julia side? Itā€™s tempting to try to use unsafe_wrap, which does actually produce the vector we need, but it fails with static compilation because it needs to allocate some additional memory. Instead we have to resort to defining our own vector type. The bad news is that we also need to define the methods used on those vectors. The good news is that once we have done that we can keep the code we wrote for ordinary vectors. test5.jl:

struct CustomVector{T}
    data::Ptr{T}
    len::Int
end
Base.getindex(x::CustomVector, i::Int) = unsafe_load(x.data, i)
Base.setindex!(x::CustomVector, y, i::Int) = unsafe_store!(x.data, y, i)
Base.length(x::CustomVector) = x.len

function test5(data, len)
    x = CustomVector(data, len)
    for i in 1:length(x)
        x[i] = x[i]^2
    end
    return
end

We can compile this function with

julia> include("test5.jl")
julia> compile_shlib(test5, (Ptr{Int}, Int), filename = "test5")

and run it:

$ python3 test5.py 
[1 4 9]

Note that the use of those unsafe_* functions implies that we may get a segmentation fault or memory corruption if indexing out of bounds. But bounds checking would require exceptions, or at least IO, and thus support from the runtime, so we are really on our own, much like if we programmed in C.

Reverse Range Iteration

This one came as a nasty surprise to me. I had a loop that needed to run backwards, we can illustrate it with the reverse summation of the earlier harmonic number example, test6.jl:

function test6(n)
    s = 0.0
    for i in reverse(1:n)
        s += 1 / i
    end
    return s
end

Compilation works

julia> include("test6.jl")
julia> compile_shlib(test6, (Int, ), filename = "test6")

but nm shows undefined symbols and the library canā€™t be loaded.

This is annoying, and should be fixable in a future Julia version, but not particularly hard to work around, e.g.

function test6(n)
    s = 0.0
    i = n
    while i >= 1
        s += 1 / i
        i -= 1
    end
    return s
end

IO

IO is best done on the Python side in this scenario, since it very much requires runtime support, focusing the Julia parts on pure computations.

In my code I had a special case that isnā€™t intrinsically IO, of the kind

io = IOBuffer(data)   # data is a Vector{UInt8}
seek(io, 53)
n = read(io, Int32)

I rewrote this with the function

function get_value(T::DataType, x::AbstractVector{UInt8}, n::Integer)
    return unsafe_load(Ptr{T}(pointer(@view x[n:(n + sizeof(T) - 1)])))
end

and added

function get_value(T::DataType, x::CustomVector{UInt8}, n::Integer)
    return unsafe_load(Ptr{T}(x.data + n - 1))
end

allowing it to be used with the custom wrapped vector as well as ordinary vectors.

Debugging

Debugging is hard work. For undefined symbols and segmentation faults the best approach is to try to test smaller parts of the code separately, if necessary using a divide and conquer strategy to narrow down what code causes the problem. Isolate the problematic code to a separate function and try to understand why it causes the problem and to experiment with workarounds. If you need to ask for help, this approach provides you with a nice Minimal (non-)Working Example.

You can try to catch segmentation faults with a debugger but there are no debugging symbols available, so you probably wonā€™t get much help from that.

When you get to the situation that everything runs but doesnā€™t produce the correct results itā€™s still a good idea to try to test parts of the code separately, but you probably will get to a point where debug printing would help. Too bad that printing is IO and requires runtime support.

Here the StaticTools package (not yet registered) comes to your rescue. It only works on Julia 1.8 but uses llvmcall magic to make printing possible without runtime support. This really turned out invaluable for me to track down the final error in the interaction between Python and the compiled library.

Additionally StaticTools has some more tricks up its sleeve like malloc-backed vectors and custom strings, which might come in handy for your code.

Library Size

Most of the small examples above weigh in at about 16 kB, which is thus a lower limit to the library size. My full code ended up with a 20 kB library, which is really quite respectable and surprisingly close to the small examples. Itā€™s small enough that Iā€™m comfortable checking it into my repository. (Yes, there are good reasons against checking in generated code, but it makes life oh so much easier when it comes to packaging the Python code.)

Conclusions

There certainly were hurdles on the way and the workarounds were often rather similar to programming in C. The library interface function needed 19 arguments (most of which would otherwise be packaged into a struct or an object) which is quite unwieldy but only needs to be written once on each side of the call.

Would it have been faster to port it to C? Maybe, but I donā€™t really think so. Also this was probably the largest application of StaticCompiler to date and possibly the first time it was used to compile a library that will be used in production, so I expect that both StaticCompiler and surrounding tooling will improve substantially in the future, making the process easier.

Acknowledgements

This wouldnā€™t have been possible without the recent advances in the StaticCompiler package and its dependencies, as well as all the prior work that has gone into those packages.

160 Likes

This is an awesome update. Thanks for sharing it.

6 Likes

Great job! The progress in Juliaā€™s ecosystem is fantastic, Iā€™m impressed.

4 Likes

Iā€™d be interested in understanding what can be made possible in the future here and what canā€™t. For example, are heap allocations a long way off? That seems the most basic ingredient of a lot of programs that would be nice to have, then exceptions, then io and threading. Itā€™s probably much harder to design complex code without allocations than without all the other parts.

3 Likes

What is needed is to be able to link the julia runtime to the executable. Probably doing static linking.

Linking to the runtime is the easy part. The hard part is relocating pointers that were meant to be used within a session and get baked into the generated code.

Iā€™ve been working on solving this, but itā€™s a difficult and thorny problem. My WIP solution relies on having a running julia session from which you would run the compiled code, like in those examples seen in the readme of StaticCompiler.jl

Those examples are in fact linked to the runtime and can use the garbage collector, allocate memory, throw errors, etc. Thereā€™s still lots of bugs, e.g. IO can be hit and miss.

11 Likes

Maybe a naive question, but is this going to be easier with the work on more precompilation saved that Tim Holy and Valentin Churavy is doing? (Iā€™m thinking of this).

(Edit: Give to Caesar what is Caesarā€™s, add Churavy :smiley:)

(Iā€™d say ā€œTim Holy and Valentin Churavyā€ since his contribution to that PR was huge.)

It may get easier, if for no other reason than reducing the number of external dependencies required to implement StaticCompiler. And yes, much of what that PR does is pointer relocation. It seems very doable to support heap allocations, and perhaps much of the restricted list short of ā€œadditional code generation.ā€

6 Likes

Itā€™s great that you got something working (for production)!

FYI: There might be a way to get this to work on Windows too:

Cosmopolitan Libc allows ā€œĪ±cĻ„ĀµĪ±lly pĪ“rĻ„Ī±blĪµ ĪµxĪµcĀµĻ„Ī±blĪµā€, i.e. binary executables that on Linux, Windows and more. E.g.:

Iā€™ve managed to compile Lua, QuickJS, and now Python2.7 and Python3.6. Are there web-friendly languages that would benefit more from a Cosmopolitan build?

Iā€™ve brought this up before, it would be cool if also done for Julia, but in this case, it seems, not needed to do that work, as only the top language/runtime needs supporting(?), and since itā€™s already done for Python, then you could just call your Julia compiled library already from that Python implementation.

Iā€™m still somewhat interested in what breaks Windows support since the Julia runtime isnā€™t used (so it seems only pure computation allowed) given all the limitations:

I think there might be another workaround for the heap, Libc.malloc and free etc. should also work? Could regular allocations work just with disabled GC? It seems like thatā€™s the problem GC, not strictly allocations or the heap the problem.

I looked a bit, and I see actually Windows support was merged, with testing (that later went away):

Iā€™m not sure whatā€™s still missing, and might something still work on Windows, with even more restrictions (if not just my proposed above alternative, with few or no restrictions at least compared to other platforms)?

I am wondering if you would consider copying this over to forem.julialang.org as an article. That forum is meant for long form content such as this and would give it more traction and is officially maintained by JuliaLang.

5 Likes