Julia on embedded devices & validation thereof

Just to give some optimism for future progress, you can get this number waaaaaay down with the sort of approach used in the (experimental) StaticCompiler.jl:

julia> using StaticCompiler, StaticTools

julia> hello() = println(c"Hello, world!")
hello (generic function with 1 method)

julia> compile_executable(hello, (), "./")
ld: warning: object file (./hello.o) was built for newer OSX version (10.14) than being linked (10.12)
"/Users/me/hello"

shell> ls -alh hello
-rwxr-xr-x  1 me  staff   8.4K May 22 16:38 hello

shell> /usr/bin/time -l ./hello
Hello, world!
        0.00 real         0.00 user         0.00 sys
   1798144  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       455  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
         1  involuntary context switches

So 1.8 MB total memory usage (less than ls on my system) and an 8.4 kB executable.

21 Likes

Or for a less trivial example:

using StaticCompiler
using StaticTools
using LoopVectorization

@inline function mul!(C::MallocArray, A::MallocArray, B::MallocArray)
    @turbo for n āˆˆ indices((C,B), 2), m āˆˆ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k āˆˆ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end

function loopvec_matrix(argc::Int, argv::Ptr{Ptr{UInt8}})
    argc == 3 || return printf(stderrp(), c"Incorrect number of command-line arguments\n")
    rows = parse(Int64, argv, 2)            # First command-line argument
    cols = parse(Int64, argv, 3)            # Second command-line argument

    # LHS
    A = MallocArray{Float64}(undef, rows, cols)
    @turbo for i āˆˆ axes(A, 1)
        for j āˆˆ axes(A, 2)
           A[i,j] = i*j
        end
    end

    # RHS
    B = MallocArray{Float64}(undef, cols, rows)
    @turbo for i āˆˆ axes(B, 1)
        for j āˆˆ axes(B, 2)
           B[i,j] = i*j
        end
    end

    # # Matrix multiplication
    C = MallocArray{Float64}(undef, cols, cols)
    mul!(C, B, A)

    # Print to stdout
    printf(C)

    # Clean up matrices
    free(A)
    free(B)
    free(C)
end

# Attempt to compile
path = compile_executable(loopvec_matrix, (Int64, Ptr{Ptr{UInt8}}), "./")

which gives us

$ ls -alh loopvec_matrix
-rwxr-xr-x  1 me  staff    21K May 22 16:30 loopvec_matrix

$ ./loopvec_matrix 10 3
3.850000e+02	7.700000e+02	1.155000e+03
7.700000e+02	1.540000e+03	2.310000e+03
1.155000e+03	2.310000e+03	3.465000e+03

$ /usr/bin/time -l ./loopvec_matrix 100 100
[output omitted...]
        0.04 real         0.00 user         0.00 sys
   2113536  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       532  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
       127  voluntary context switches
         3  involuntary context switches

a 21 kB executable that uses 2.1 MB to multiply two 100x100 matrices. For comparison, ls:

$ /usr/bin/time -l ls -alh loopvec_matrix
-rwxr-xr-x  1 me  staff    21K May 22 16:30 loopvec_matrix
        0.00 real         0.00 user         0.00 sys
   2416640  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       609  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
        14  involuntary context switches
17 Likes

Yeah preallocating + zero allocations is great for improving the performance of isolated methods, and the approach can be scaled up to a whole program, but itā€™s still very much a limitation. I was thinking more of Julia being used for smaller embedded programs when StaticCompiler.jl matures, not anything on the scale of video game programming where, from what Iā€™ve read, people need allocators or an incremental GC (Unity, Unreal). Thereā€™s been discourse discussions about these things for soft real-time in general, but I donā€™t know of any ongoing effort to implement those. I have read about recent work on improving escape analysis, in part to reduce garbage (remember, mutable/immutable is not synonymous with heap/stack); EDIT: This comment does a better job summarizing than I ever could.

3 Likes

Is it feasible to replace MallocArray+free with Array+escape analysis, if not now then one day?

If that lets the Julia compiler put that array on the stack, then probably yes in principle?

You can also use arrays that are designed to be on the stack today, like StrideArrays and StaticArrays. The main limitation is that you need to know their length at compile-time, because otherwise thatā€™s effectively a runtime type-instability.

If you link to libjulia rather than making a fully standalone executable (e.g., use StaticCompiler.compile rather than StaticCompiler.compile_executable), then actually a lot more should already be possible.

1 Like

You can use Libc.malloc or any other C API to allocate memory. You can then wrap a Julia array around it with unsafe_wrap. If you pass the keyword own = false to unsafe_wrap, the GC will not free the memory automatically. You can then manually Libc.free when you are done.

This process does allocate a little memory through the Julia GC for the array data structure. If you can tolerate that, then you can use this to obtain a standard Array from manually allocated memory. Otherwise, you will need to use MallocArray as described above.

If you do want the GC to help free memory, I recently put together a package that uses a finalizer to free memory while using the standard Array interface for construction. This provides an abstract interface for use with custom memory allocator schemes.

7 Likes

Speaking of Julia on Arduinos ā€“ @sukeraā€™s blog post on exactly that just dropped!

https://seelengrab.github.io/articles/Running%20Julia%20baremetal%20on%20an%20Arduino/

10 Likes
Excerpt of "Running Julia baremetal on an Arduino"
  1. The initial unsafe_load from our pointer triggered undefined behavior, since the initial value of a given pointer is not defined. LLVM saw that, saw that we actually used the read value and eliminated both read & store due to it being undefined behavior and it being free to pick the value it ā€œreadā€ to be the one we wrote, making the load/store pair superfluous.
  2. The now empty loops serve no purpose, so they got removed as well.

In C, you can solve this problem by using volatile . That keyword is a very strict way of telling the compiler ā€œLook, I want every single read & write from and to this variable to happen. Donā€™t eliminate any and donā€™t shuffle them around (except for non-volatile, youā€™re free to shuffle those around)ā€.

I only ever tried it out twice ever, so I had assumed unsafe_load was a guaranteed runtime action, not ā€œoptimizedā€ to nothing. I had read about volatile variables in passing but always dismissed it as a language-specific thing Iā€™ll never have to care about. But now Iā€™m wondering: what assumptions and intentions are behind optimizing away an action that seems intended for runtime I/O?

Well, so this is an LLVM thing as opposed to a a Julia thing, but the key phrase is ā€œundefined behaviorā€ā€¦ in the event of undefined behavior, the compiler is free to do absolutely anything it wants ā€“ say, exit the program, delete your boot volume, whatever ā€“ itā€™s just being nice by not doing that and instead just choosing a convenient nearby value. As Wikipedia puts it:

In the C community, undefined behavior may be humorously referred to as " nasal demons ", after a comp.std.cpost that explained undefined behavior as allowing the compiler to do anything it chooses, even ā€œto make demons fly out of your noseā€.[1]

1 Like

Ah, I just didnā€™t understand what undefined behavior meant. The way Iā€™m reading it now is that the compiler assumed to some benefit that this action would never happen. Itā€™s still sort of odd to me that volatile exists to make the compiler stop doing that for some things, as if youā€™re telling it ā€œno wait, this behavior isnā€™t undefined actually.ā€

1 Like

Nope! Itā€™s just a pointer dereference under the hood, just like x = *p in C or C++ would be. It obeys almost all the same semantics as in those languages, in particular how LLVM optimizes it. In truth, julia is not some magic pointerfree machine - it too uses pointers under the hood, simply because you need some sort of referencing scheme to say ā€œI have a piece of memory somewhere, and this tells you where it isā€.

Well, while undefined behavior technically allows the compiler to do whatever (the C standard doesnā€™t define what should happen), generally optimizing compilers (as all we regularly use are) will try to choose behavior that results in faster/less spacious code. This means for example if a read from an uninitialized variable (or a dereference of an unitialized pointer!) happens, the compiler is free to choose behavior that allows it to eliminate more code (itā€™s not actively making that choice, itā€™s an emergent behavior of multiple optimization passes).

What volatile does in this specific case is communicate to the compiler that the read of this variable has a defined behavior, which actually is the case! Microcontrollers initialize their registers (and sometimes memory, though thatā€™s often done explicitly in the bootloader) to known-good values. So putting volatile there is a mark to the compiler that any read & write from/to that variable/address will be observed by something, i.e. it has a sideeffect thatā€™s not transparent to the compiler itself. Itā€™s not that the read then becomes defined behavior, but that the syntax & semantics are moved to a partly different set of definitions where a read from an uninitialized variable is well defined.

2 Likes

On arruino, what if you let C handle the hardware register interfaces and just wanted a Julia algorithm to handle the processing of the inputs and outputs. Does that make the workflow any easier/more reliable?
Is that possible without going from Julia to C first?

1 Like

Iā€™ve done that, what you do is compile a static library with julia code with some functions. And then link with the C code, itā€™s a bit annoying to do that, but itā€™s possible

3 Likes

I need to try this also with just linking as a library to C++ without requiring a Julia install anymore. I imagine that is possible with windows and Linux c++ projects.

Ah yeah, thereā€™s an example of that here: https://github.com/brenhinkeller/StaticTools.jl#calling-compiled-julia-library-from-python

(the example is calling from python, but once you have the .so/.dylib you can just dlopen from any language you like)

julia side:

using StaticCompiler
using StaticTools
using LoopVectorization
using Base: RefValue

@inline function mul!(C::MallocArray, A::MallocArray, B::MallocArray)
    @turbo for n āˆˆ indices((C,B), 2), m āˆˆ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k āˆˆ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return 0
end

# this will let us accept pointers to MallocArrays
mul!(C::Ref,A::Ref,B::Ref) = mul!(C[], A[], B[])

# Note that we have to specify a contrete type for each argument when compiling!
# So not just any MallocArra but in this case specifically MallocArray{Float64,2}
# (AKA MallocMatrix{Float64})
tt = (RefValue{MallocMatrix{Float64}}, RefValue{MallocMatrix{Float64}}, RefValue{MallocMatrix{Float64}})
compile_shlib(mul!, tt, "./", "mul_inplace")

python side:

import ctypes as ct
import numpy as np

class MallocMatrix(ct.Structure):
    _fields_ = [("pointer", ct.c_void_p),
                ("length", ct.c_int64),
                ("s1", ct.c_int64),
                ("s2", ct.c_int64)]

def mmptr(A):
    ptr = A.ctypes.data_as(ct.c_void_p)
    a = MallocMatrix(ptr, ct.c_int64(A.size), ct.c_int64(A.shape[1]), ct.c_int64(A.shape[0]))
    return ct.byref(a)

lib = ct.CDLL("./mul_inplace.dylib")

A = np.ones((10,10))
B = np.ones((10,10))
C = np.ones((10,10))

Aptr = mmptr(A)
Bptr = mmptr(B)
Cptr = mmptr(C)

lib.julia_mul_inplace(Cptr, Bptr, Aptr)
1 Like

Julia has a builtin C++ interpreter which is packed into the libjulia-internal.so dynamic library. The good news is that this dynamic library is quite small ( < 10 Mb) and you can execute many useful program on it (actually this is how Julia bootstraps itself). The bad news is that this library only contains Core module and without type inferencer. Core library contains bare data structure and less methods. The type inferencer will add other 16Mb size to the final product (I estimate that this can be reduced to < 7Mb), though itā€™s not quite needed if your program is fully static. You can do cross compilation at build time and remove it at runtime. So thatā€™s said, if you embedded device uses one of the architecture that Julia supports, you can already execute this small interpreter on that device without additional efforts.

Maybe the hardest problem is cross-compilation, since currently you have to have a live Julia session to perform any kind of the compilation (due to the dynamic nature of Julia, you have to gather some runtime metadata to perform compilation). But where you build your program is generally not where you will deploy your program. And this live Julia session has many architecture assumption, which will impact the output binary (for example, size of the integer is different on different platform). Ideally, Juliaā€™s internal compilation pipeline can be isolated and adopted to cross-compilation (and other parts of the compiler, they need sufficient isolation to break implicit assumption), but it seems we donā€™t have enough people for that.

Another problem is Base library, most parts of the Base doesnā€™t function correctly on embedded system (like libuv, filesystem, taskā€¦). But this is easier to solve since itā€™s quite common to develop oneā€™s own standard libraries in embedded development, so itā€™s no a big loss if one canā€™t use Base library.

In summary, if you want to do some work on embedded system, you can firstly try to work on some Linux-based embedded system by using the pure interpreter I mention before. Then you have Core library and you can define your own base by selectively including some base library functions (array.jl, dict.jlā€¦). Other approaches are also possible, like producing and linking binary. But currently they are unstable (relies on Juliaā€™s type inferencer to get rid of dynamic call) and unable to scale. So I would not recommend it here.

4 Likes