The StaticCompiler package was recently registered. This post records a successful experiment to statically compile a piece of Julia code into a small .so
library on Linux, which is then loaded from Python and used in training of a deep learning model.
TLDR
Static compilation to a stand-alone library does work on Linux but has rather significant restrictions on the functionality available. It is mostly useful for core computational routines. Roughly speaking, if it could be implemented in plain C and you write your Julia code in a corresponding way, it is a good candidate for static compilation. The final library ended up at a size of 20 kB.
Notes
-
This post describes the state of static compilation in the first half of April 2022. The capabilities of
StaticCompiler
and surrounding tooling are expected to improve over time. -
The code in question is proprietary and cannot be shared. Challenges and solutions will be presented by small representative examples. Unfortunately those will be of a toy size, in contrast to the full code, but that is how it is.
-
This should work similarly on Mac but at this point StaticCompiler is not supported on Windows.
Problem Description
Without going into how or why, I needed the functionality of some Julia code to be available in Python and the main option was to port the computational parts to C and compile it into a library which could be loaded from Python. The code in question was about 250 lines of code, split into five larger functions, one of which needed to be called from Python. Not huge by any means but far larger than typical test examples and actually intended to be used in production.
The point of this experiment was to see if I could avoid porting to C by using StaticCompiler to generate a stand-alone library.
Getting Started with StaticCompiler
As of writing, StaticCompiler works with Julia 1.7 and 1.8. I used 1.7.1 and 1.8.0-beta3 in my experiment. StaticCompiler was version 0.4.2.
Installing StaticCompiler
Iām using a clean environment for all demonstration examples:
$ mkdir test
$ cd test
$ julia --project=.
julia> using Pkg
julia> Pkg.add("StaticCompiler")
Generating a Stand-alone Library
For a first demonstration, consider this toy example to compute harmonic numbers, saved as test1.jl
:
function test1(n)
s = 0.0
for i = 1:n
s += 1 / i
end
return s
end
To compile this into a stand-alone library, run
julia> using StaticCompiler
julia> include("test1.jl")
julia> compile_shlib(test1, (Int, ), filename = "test1")
We can look at the output:
$ ls -l test1.so
-rwxrwxr-x 1 gunnar gunnar 15736 apr 10 16:16 test1.so
A useful tool to inspect the generated library is the nm
command:
$ nm test1.so
0000000000004020 b completed.0
w __cxa_finalize@@GLIBC_2.2.5
0000000000001040 t deregister_tm_clones
00000000000010b0 t __do_global_dtors_aux
0000000000003e48 d __do_global_dtors_aux_fini_array_entry
0000000000004018 d __dso_handle
0000000000003e50 d _DYNAMIC
00000000000011c8 t _fini
00000000000010f0 t frame_dummy
0000000000003e40 d __frame_dummy_init_array_entry
000000000000209c r __FRAME_END__
0000000000004000 d _GLOBAL_OFFSET_TABLE_
w __gmon_start__
0000000000002008 r __GNU_EH_FRAME_HDR
0000000000001000 t _init
w _ITM_deregisterTMCloneTable
w _ITM_registerTMCloneTable
0000000000001100 T julia_test1
0000000000001070 t register_tm_clones
0000000000004020 d __TMC_END__
I donāt have enough insight to say what all these symbols mean, letās just say that this is a working baseline that we can compare to when things donāt work. We can also see that the compiled function has been prefixed by julia_
.
Calling the Library from Python
I have used ctypes
to call the library. This is test1.py
:
import ctypes
lib = ctypes.cdll.LoadLibrary("test1.so")
test1 = lib.julia_test1
test1.argtypes = (ctypes.c_int64,)
test1.restype = ctypes.c_double
print(test1(10))
Now we can run it:
$ python3 test1.py
2.9289682539682538
Restrictions and Workarounds
Unfortunately there are a lot of things you canāt do in your code if you want to compile to a stand-alone library. Specifically you canāt do anything that requires support from the Julia runtime, which for example includes:
- heap allocations
- threading
- exceptions
- dynamic dispatch
- additional code generation
- IO
In the following sections I will review the limitations I ran into and how I diagnosed and worked around them.
Type Instability
If we slightly change test1.jl
to test2.jl
like this:
function test2(n)
s = 0
for i = 1:n
s += 1 / i
end
return s
end
and try to compile it
julia> include("test2.jl")
julia> compile_shlib(test2, (Int, ), filename = "test2")
the result is
ERROR: test2(Int64,) did not infer to a concrete type. Got Union{Float64, Int64}
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] compile_shlib(f::Function, types::Tuple{DataType}, path::String, name::String; filename::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ StaticCompiler ~/.julia/packages/StaticCompiler/S1AWw/src/StaticCompiler.jl:261
[3] top-level scope
@ REPL[6]:1
This is a problem already for the static compilation. We can diagnose this with @code_warntype
:
julia> @code_warntype test2(10)
MethodInstance for test2(::Int64)
from test2(n) in Main at /home/gunnar/pathology/whole-slide-vectors/blog/test2.jl:1
Arguments
#self#::Core.Const(test2)
n::Int64
Locals
@_3::Union{Nothing, Tuple{Int64, Int64}}
s::Union{Float64, Int64}
i::Int64
Body::Union{Float64, Int64}
1 (s = 0)
%2 = (1:n)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
(@_3 = Base.iterate(%2))
%4 = (@_3 === nothing)::Bool
%5 = Base.not_int(%4)::Bool
goto #4 if not %5
2 %7 = @_3::Tuple{Int64, Int64}
(i = Core.getfield(%7, 1))
%9 = Core.getfield(%7, 2)::Int64
%10 = s::Union{Float64, Int64}
%11 = (1 / i)::Float64
(s = %10 + %11)
(@_3 = Base.iterate(%2, %9))
%14 = (@_3 === nothing)::Bool
%15 = Base.not_int(%14)::Bool
goto #4 if not %15
3 goto #2
4 return s
In the REPL the union types are colored red to stand out. The problem is that s
is initialized as an Int
on the first line and changes to Float64
in the loop, but if n < 1
the loop isnāt run, so it depends on the input value which type the return value will have.
The proper way to solve this is to initialize s
to a Float64
as in test1.jl
and the type instability is gone. Another, inferior, option would be to instead return Float64(s)
at the end. This makes the output type possible to infer and the compilation goes through, leaving an internal type instability. In terms of static compilation that is actually okay and the resulting library can successfully be called from Python. For the kind of computational code that is possible to compile statically, type instabilities are usually a bad thing however, so my recommendation would be to eliminate those before attempting static compilation.
Exceptions
Exceptions require runtime support. This includes exceptions that donāt trigger, and potential exceptions you maybe werenāt aware of in functions you call. One example I ran into is test3.jl
:
function test3(x)
return floor(Int, x)
end
This compiles fine with
julia> include("test3.jl")
julia> compile_shlib(test3, (Float64,), filename = "test3")
However, when you try to run this with test3.py
,
import ctypes
lib = ctypes.cdll.LoadLibrary("test3.so")
test3 = lib.julia_test3
test3.argtypes = (ctypes.c_double,)
test3.restype = ctypes.c_int64
print(test3())
you fail already when trying to load the library:
$ python3 test3.py
Traceback (most recent call last):
File "test3.py", line 3, in <module>
lib = ctypes.cdll.LoadLibrary("test3.so")
File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: test3.so: undefined symbol: jl_throw
This could be predicted from the nm
output:
0000000000004038 b completed.0
w __cxa_finalize@@GLIBC_2.2.5
0000000000001070 t deregister_tm_clones
00000000000010e0 t __do_global_dtors_aux
0000000000003e18 d __do_global_dtors_aux_fini_array_entry
0000000000004030 d __dso_handle
0000000000003e20 d _DYNAMIC
0000000000001204 t _fini
0000000000001120 t frame_dummy
0000000000003e10 d __frame_dummy_init_array_entry
000000000000210c r __FRAME_END__
0000000000004000 d _GLOBAL_OFFSET_TABLE_
w __gmon_start__
0000000000002010 r __GNU_EH_FRAME_HDR
00000000000011e0 t gpu_gc_pool_alloc
00000000000011c0 t gpu_malloc
00000000000011d0 t gpu_report_oom
0000000000001000 t _init
w _ITM_deregisterTMCloneTable
w _ITM_registerTMCloneTable
U jl_invoke
U jl_throw
0000000000001130 T julia_test3
U malloc@@GLIBC_2.2.5
00000000000010a0 t register_tm_clones
0000000000004038 d __TMC_END__
Those U
lines are bad news, unresolved symbols.
The reason for the potential exception is that the floating point input might be out of range for Int
,
julia> test3(1e100)
ERROR: InexactError: trunc(Int64, 1.0e100)
If you know that this will not occur in your code you can work around the problem with
function test3(x)
return unsafe_trunc(Int, floor(x))
end
Vectors
Arrays are something that you frequently need for computational tasks and my code was no exception, in fact it needed to pass multiple vectors and one 3D array from Python to Julia. Unfortunately you run into the āno heap allocationā restriction just by trying to construct a vector. It is informative to see how this affects static compilation. test4.jl
:
function test4()
x = ones(Int, 3)
return length(x)
end
Compilation works:
julia> include("test4.jl")
julia> compile_shlib(test4, (), filename = "test4")
However, when you try to run this with test4.py
,
import ctypes
lib = ctypes.cdll.LoadLibrary("test4.so")
test4 = lib.julia_test4
test4.argtypes = ()
test4.restype = ctypes.c_int64
print(test4())
things go really badly:
$ python3 test4.py
Segmentation fault (core dumped)
There is no hint that this will happen from the nm
output, so the best ways to find heap allocations is to look for allocations when running the Julia code, e.g.
julia> @allocated test4()
80
So how can the heap allocations be worked around? To begin with you might get your vector data as input so you donāt have to allocate the memory on the Julia side. The simplest way to pass a vector from Python to the library is with a separate data pointer and length, test5.py
:
import ctypes
import numpy as np
lib = ctypes.cdll.LoadLibrary("test5.so")
test5 = lib.julia_test5
test5.argtypes = (ctypes.POINTER(ctypes.c_int64), ctypes.c_int64)
test5.restype = None
x = np.array([1, 2, 3])
test5(x.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)), len(x))
print(x)
What do we do with the pointer and length on the Julia side? Itās tempting to try to use unsafe_wrap
, which does actually produce the vector we need, but it fails with static compilation because it needs to allocate some additional memory. Instead we have to resort to defining our own vector type. The bad news is that we also need to define the methods used on those vectors. The good news is that once we have done that we can keep the code we wrote for ordinary vectors. test5.jl
:
struct CustomVector{T}
data::Ptr{T}
len::Int
end
Base.getindex(x::CustomVector, i::Int) = unsafe_load(x.data, i)
Base.setindex!(x::CustomVector, y, i::Int) = unsafe_store!(x.data, y, i)
Base.length(x::CustomVector) = x.len
function test5(data, len)
x = CustomVector(data, len)
for i in 1:length(x)
x[i] = x[i]^2
end
return
end
We can compile this function with
julia> include("test5.jl")
julia> compile_shlib(test5, (Ptr{Int}, Int), filename = "test5")
and run it:
$ python3 test5.py
[1 4 9]
Note that the use of those unsafe_*
functions implies that we may get a segmentation fault or memory corruption if indexing out of bounds. But bounds checking would require exceptions, or at least IO, and thus support from the runtime, so we are really on our own, much like if we programmed in C.
Reverse Range Iteration
This one came as a nasty surprise to me. I had a loop that needed to run backwards, we can illustrate it with the reverse summation of the earlier harmonic number example, test6.jl
:
function test6(n)
s = 0.0
for i in reverse(1:n)
s += 1 / i
end
return s
end
Compilation works
julia> include("test6.jl")
julia> compile_shlib(test6, (Int, ), filename = "test6")
but nm
shows undefined symbols and the library canāt be loaded.
This is annoying, and should be fixable in a future Julia version, but not particularly hard to work around, e.g.
function test6(n)
s = 0.0
i = n
while i >= 1
s += 1 / i
i -= 1
end
return s
end
IO
IO is best done on the Python side in this scenario, since it very much requires runtime support, focusing the Julia parts on pure computations.
In my code I had a special case that isnāt intrinsically IO, of the kind
io = IOBuffer(data) # data is a Vector{UInt8}
seek(io, 53)
n = read(io, Int32)
I rewrote this with the function
function get_value(T::DataType, x::AbstractVector{UInt8}, n::Integer)
return unsafe_load(Ptr{T}(pointer(@view x[n:(n + sizeof(T) - 1)])))
end
and added
function get_value(T::DataType, x::CustomVector{UInt8}, n::Integer)
return unsafe_load(Ptr{T}(x.data + n - 1))
end
allowing it to be used with the custom wrapped vector as well as ordinary vectors.
Debugging
Debugging is hard work. For undefined symbols and segmentation faults the best approach is to try to test smaller parts of the code separately, if necessary using a divide and conquer strategy to narrow down what code causes the problem. Isolate the problematic code to a separate function and try to understand why it causes the problem and to experiment with workarounds. If you need to ask for help, this approach provides you with a nice Minimal (non-)Working Example.
You can try to catch segmentation faults with a debugger but there are no debugging symbols available, so you probably wonāt get much help from that.
When you get to the situation that everything runs but doesnāt produce the correct results itās still a good idea to try to test parts of the code separately, but you probably will get to a point where debug printing would help. Too bad that printing is IO and requires runtime support.
Here the StaticTools package (not yet registered) comes to your rescue. It only works on Julia 1.8 but uses llvmcall
magic to make printing possible without runtime support. This really turned out invaluable for me to track down the final error in the interaction between Python and the compiled library.
Additionally StaticTools has some more tricks up its sleeve like malloc-backed vectors and custom strings, which might come in handy for your code.
Library Size
Most of the small examples above weigh in at about 16 kB, which is thus a lower limit to the library size. My full code ended up with a 20 kB library, which is really quite respectable and surprisingly close to the small examples. Itās small enough that Iām comfortable checking it into my repository. (Yes, there are good reasons against checking in generated code, but it makes life oh so much easier when it comes to packaging the Python code.)
Conclusions
There certainly were hurdles on the way and the workarounds were often rather similar to programming in C. The library interface function needed 19 arguments (most of which would otherwise be packaged into a struct or an object) which is quite unwieldy but only needs to be written once on each side of the call.
Would it have been faster to port it to C? Maybe, but I donāt really think so. Also this was probably the largest application of StaticCompiler to date and possibly the first time it was used to compile a library that will be used in production, so I expect that both StaticCompiler and surrounding tooling will improve substantially in the future, making the process easier.
Acknowledgements
This wouldnāt have been possible without the recent advances in the StaticCompiler package and its dependencies, as well as all the prior work that has gone into those packages.