Julia on embedded devices & validation thereof

Definitely have to cut down on the memory consumption. I don’t know how to measure the exact number for myself, but this blog post estimates 150MB just to run a hello-world script. Most of it is apparently Julia allocating BLAS buffers, some happened in JIT compilation, and I assume some overhead is needed for dynamic dispatch. Can anyone give a clearer breakdown of the memory usage for each feature? I could imagine a kind of executable that sacrifices even JIT compilation to save memory.

To be clear, you can absolutely design a program where you preallocate all necessary mutable data (and BLAS buffers), and after the compilation is done with, your program runs without (heap)-allocating any further data, so no more garbage. You mutate the preallocated data and otherwise work on small-enough immutable instances, which are reliably handled on the stack. Of course, you can’t use allocating code like sort; you have to search for or implement in-place versions like sort!.

3 Likes

For some complex programs this design would essentially mean allocating only large buffers from Julia and implementing a custom allocator on top of that.

As there is no support in Julia for custom allocators, this seems very hard to me with a lot of metaprogramming and edge cases. A real concern for me.

Even worse, multithreading is also incompatible with gc, at least currently.

4 Likes

Unless I’m missing something that describes an unresolved feature request (or more precisely availability of code shown in a presentation) that is more than 2 years old, but it doesn’t actually provide the macro it seems to imply?

1 Like

It is way to soon to announce anything formal or concrete but we have a side project on this. Try to plug Julia to Vitis HLS.
Aaaaand that’s a nightmare. But we hope to show some (preliminary) results soon :slight_smile:

4 Likes

Just to give some optimism for future progress, you can get this number waaaaaay down with the sort of approach used in the (experimental) StaticCompiler.jl:

julia> using StaticCompiler, StaticTools

julia> hello() = println(c"Hello, world!")
hello (generic function with 1 method)

julia> compile_executable(hello, (), "./")
ld: warning: object file (./hello.o) was built for newer OSX version (10.14) than being linked (10.12)
"/Users/me/hello"

shell> ls -alh hello
-rwxr-xr-x  1 me  staff   8.4K May 22 16:38 hello

shell> /usr/bin/time -l ./hello
Hello, world!
        0.00 real         0.00 user         0.00 sys
   1798144  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       455  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
         1  involuntary context switches

So 1.8 MB total memory usage (less than ls on my system) and an 8.4 kB executable.

20 Likes

Or for a less trivial example:

using StaticCompiler
using StaticTools
using LoopVectorization

@inline function mul!(C::MallocArray, A::MallocArray, B::MallocArray)
    @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return C
end

function loopvec_matrix(argc::Int, argv::Ptr{Ptr{UInt8}})
    argc == 3 || return printf(stderrp(), c"Incorrect number of command-line arguments\n")
    rows = parse(Int64, argv, 2)            # First command-line argument
    cols = parse(Int64, argv, 3)            # Second command-line argument

    # LHS
    A = MallocArray{Float64}(undef, rows, cols)
    @turbo for i ∈ axes(A, 1)
        for j ∈ axes(A, 2)
           A[i,j] = i*j
        end
    end

    # RHS
    B = MallocArray{Float64}(undef, cols, rows)
    @turbo for i ∈ axes(B, 1)
        for j ∈ axes(B, 2)
           B[i,j] = i*j
        end
    end

    # # Matrix multiplication
    C = MallocArray{Float64}(undef, cols, cols)
    mul!(C, B, A)

    # Print to stdout
    printf(C)

    # Clean up matrices
    free(A)
    free(B)
    free(C)
end

# Attempt to compile
path = compile_executable(loopvec_matrix, (Int64, Ptr{Ptr{UInt8}}), "./")

which gives us

$ ls -alh loopvec_matrix
-rwxr-xr-x  1 me  staff    21K May 22 16:30 loopvec_matrix

$ ./loopvec_matrix 10 3
3.850000e+02	7.700000e+02	1.155000e+03
7.700000e+02	1.540000e+03	2.310000e+03
1.155000e+03	2.310000e+03	3.465000e+03

$ /usr/bin/time -l ./loopvec_matrix 100 100
[output omitted...]
        0.04 real         0.00 user         0.00 sys
   2113536  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       532  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
       127  voluntary context switches
         3  involuntary context switches

a 21 kB executable that uses 2.1 MB to multiply two 100x100 matrices. For comparison, ls:

$ /usr/bin/time -l ls -alh loopvec_matrix
-rwxr-xr-x  1 me  staff    21K May 22 16:30 loopvec_matrix
        0.00 real         0.00 user         0.00 sys
   2416640  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       609  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
        14  involuntary context switches
16 Likes

Yeah preallocating + zero allocations is great for improving the performance of isolated methods, and the approach can be scaled up to a whole program, but it’s still very much a limitation. I was thinking more of Julia being used for smaller embedded programs when StaticCompiler.jl matures, not anything on the scale of video game programming where, from what I’ve read, people need allocators or an incremental GC (Unity, Unreal). There’s been discourse discussions about these things for soft real-time in general, but I don’t know of any ongoing effort to implement those. I have read about recent work on improving escape analysis, in part to reduce garbage (remember, mutable/immutable is not synonymous with heap/stack); EDIT: This comment does a better job summarizing than I ever could.

2 Likes

Is it feasible to replace MallocArray+free with Array+escape analysis, if not now then one day?

If that lets the Julia compiler put that array on the stack, then probably yes in principle?

You can also use arrays that are designed to be on the stack today, like StrideArrays and StaticArrays. The main limitation is that you need to know their length at compile-time, because otherwise that’s effectively a runtime type-instability.

If you link to libjulia rather than making a fully standalone executable (e.g., use StaticCompiler.compile rather than StaticCompiler.compile_executable), then actually a lot more should already be possible.

1 Like

You can use Libc.malloc or any other C API to allocate memory. You can then wrap a Julia array around it with unsafe_wrap. If you pass the keyword own = false to unsafe_wrap, the GC will not free the memory automatically. You can then manually Libc.free when you are done.

This process does allocate a little memory through the Julia GC for the array data structure. If you can tolerate that, then you can use this to obtain a standard Array from manually allocated memory. Otherwise, you will need to use MallocArray as described above.

If you do want the GC to help free memory, I recently put together a package that uses a finalizer to free memory while using the standard Array interface for construction. This provides an abstract interface for use with custom memory allocator schemes.

7 Likes

Speaking of Julia on Arduinos – @sukera’s blog post on exactly that just dropped!

https://seelengrab.github.io/articles/Running%20Julia%20baremetal%20on%20an%20Arduino/

10 Likes
Excerpt of "Running Julia baremetal on an Arduino"
  1. The initial unsafe_load from our pointer triggered undefined behavior, since the initial value of a given pointer is not defined. LLVM saw that, saw that we actually used the read value and eliminated both read & store due to it being undefined behavior and it being free to pick the value it “read” to be the one we wrote, making the load/store pair superfluous.
  2. The now empty loops serve no purpose, so they got removed as well.

In C, you can solve this problem by using volatile . That keyword is a very strict way of telling the compiler “Look, I want every single read & write from and to this variable to happen. Don’t eliminate any and don’t shuffle them around (except for non-volatile, you’re free to shuffle those around)”.

I only ever tried it out twice ever, so I had assumed unsafe_load was a guaranteed runtime action, not “optimized” to nothing. I had read about volatile variables in passing but always dismissed it as a language-specific thing I’ll never have to care about. But now I’m wondering: what assumptions and intentions are behind optimizing away an action that seems intended for runtime I/O?

Well, so this is an LLVM thing as opposed to a a Julia thing, but the key phrase is “undefined behavior”… in the event of undefined behavior, the compiler is free to do absolutely anything it wants – say, exit the program, delete your boot volume, whatever – it’s just being nice by not doing that and instead just choosing a convenient nearby value. As Wikipedia puts it:

In the C community, undefined behavior may be humorously referred to as " nasal demons ", after a comp.std.cpost that explained undefined behavior as allowing the compiler to do anything it chooses, even “to make demons fly out of your nose”.[1]

1 Like

Ah, I just didn’t understand what undefined behavior meant. The way I’m reading it now is that the compiler assumed to some benefit that this action would never happen. It’s still sort of odd to me that volatile exists to make the compiler stop doing that for some things, as if you’re telling it “no wait, this behavior isn’t undefined actually.”

1 Like

Nope! It’s just a pointer dereference under the hood, just like x = *p in C or C++ would be. It obeys almost all the same semantics as in those languages, in particular how LLVM optimizes it. In truth, julia is not some magic pointerfree machine - it too uses pointers under the hood, simply because you need some sort of referencing scheme to say “I have a piece of memory somewhere, and this tells you where it is”.

Well, while undefined behavior technically allows the compiler to do whatever (the C standard doesn’t define what should happen), generally optimizing compilers (as all we regularly use are) will try to choose behavior that results in faster/less spacious code. This means for example if a read from an uninitialized variable (or a dereference of an unitialized pointer!) happens, the compiler is free to choose behavior that allows it to eliminate more code (it’s not actively making that choice, it’s an emergent behavior of multiple optimization passes).

What volatile does in this specific case is communicate to the compiler that the read of this variable has a defined behavior, which actually is the case! Microcontrollers initialize their registers (and sometimes memory, though that’s often done explicitly in the bootloader) to known-good values. So putting volatile there is a mark to the compiler that any read & write from/to that variable/address will be observed by something, i.e. it has a sideeffect that’s not transparent to the compiler itself. It’s not that the read then becomes defined behavior, but that the syntax & semantics are moved to a partly different set of definitions where a read from an uninitialized variable is well defined.

2 Likes

On arruino, what if you let C handle the hardware register interfaces and just wanted a Julia algorithm to handle the processing of the inputs and outputs. Does that make the workflow any easier/more reliable?
Is that possible without going from Julia to C first?

1 Like

I’ve done that, what you do is compile a static library with julia code with some functions. And then link with the C code, it’s a bit annoying to do that, but it’s possible

3 Likes

I need to try this also with just linking as a library to C++ without requiring a Julia install anymore. I imagine that is possible with windows and Linux c++ projects.

Ah yeah, there’s an example of that here: https://github.com/brenhinkeller/StaticTools.jl#calling-compiled-julia-library-from-python

(the example is calling from python, but once you have the .so/.dylib you can just dlopen from any language you like)

julia side:

using StaticCompiler
using StaticTools
using LoopVectorization
using Base: RefValue

@inline function mul!(C::MallocArray, A::MallocArray, B::MallocArray)
    @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    return 0
end

# this will let us accept pointers to MallocArrays
mul!(C::Ref,A::Ref,B::Ref) = mul!(C[], A[], B[])

# Note that we have to specify a contrete type for each argument when compiling!
# So not just any MallocArra but in this case specifically MallocArray{Float64,2}
# (AKA MallocMatrix{Float64})
tt = (RefValue{MallocMatrix{Float64}}, RefValue{MallocMatrix{Float64}}, RefValue{MallocMatrix{Float64}})
compile_shlib(mul!, tt, "./", "mul_inplace")

python side:

import ctypes as ct
import numpy as np

class MallocMatrix(ct.Structure):
    _fields_ = [("pointer", ct.c_void_p),
                ("length", ct.c_int64),
                ("s1", ct.c_int64),
                ("s2", ct.c_int64)]

def mmptr(A):
    ptr = A.ctypes.data_as(ct.c_void_p)
    a = MallocMatrix(ptr, ct.c_int64(A.size), ct.c_int64(A.shape[1]), ct.c_int64(A.shape[0]))
    return ct.byref(a)

lib = ct.CDLL("./mul_inplace.dylib")

A = np.ones((10,10))
B = np.ones((10,10))
C = np.ones((10,10))

Aptr = mmptr(A)
Bptr = mmptr(B)
Cptr = mmptr(C)

lib.julia_mul_inplace(Cptr, Bptr, Aptr)
1 Like