Bring Julia code to embedded hardware (ARM)

asprionj · January 23, 2019, 12:57pm

I’m using Julia as framework for simulation of physical systems and prototyping for signal-processing and control algorithms. Of course, the latter should eventually run on some embedded hardware (in my case, it will be an ARM architecture).

As I see it, making Julia code (I intentionally don’t say Julia) runnable on embedded devices would solve this “two-language problem” as well. Today, control-related stuff is often prototyped in e.g. Python or Matlab/Simulink and then transformed to C code (manually, or automatically after spending a lot of money for several Mathworks toolboxes).

We can think of several ways to get Julia code onto the embedded hardware:

Re-write the code manually in C, write a (small) wrapper and call this code from Julia to validate against the prototype code. That’s today’s two-language approach.
Write a C-code generator using Julia’s strong metaprogramming capabilities. For not-too-complex algorithms, a subset of the entire functionality could do the job. That’s an advanced version of today’s two-language approach.
Extract the LLVM IR and (cross-) compile it for the embedded hardwate, using the appropriate LLVM backend.
(Cross-) compile a binary (executable, library) containing the algorithms (methods), similar to PackageCompiler.jl.
Run Julia on the embedded hardware, AOT-compile all required methods when going in production. This would allow prototyping in Julia directly on the embedded hardware and then just “fix the code” at some point.
Have a minimal Julia “runtime” on the embedded hardware, without the JIT compiler, for which “modules” can be (cross-) compiled – this could probably simplify the build process of 4 and would provide the possibility to add interpreted “glue code” to connect the modules.

If there’s a lot of flash and RAM, one could just use option 5. I guess that’s already possible today. However, on embedded devices, the entire OS (Yocto, Buildroot, Alpine, …) is usually stripped down to 2-4 MB, so one does not want to have some 100 MB software on it.

What are your thoughts on this? Do you also see a (huge) benefit? What could be the best way to go (under which circumstances / requirements)?

gdkrmr · January 23, 2019, 1:28pm

Check out “Transpiling Julia to C – The LLVM-CBackend”, not sure if this is still working:

https://juliacomputing.com/blog/2016/03/10/j2c-announcement.html

asprionj · January 23, 2019, 4:59pm

Thanks for the hint; I had already read the mentioned post before though. Anyone tested this with a post-1.0 release?

What I don’t like is that going from LLVM IR back to C is like a “step backwards”. After all, LLVM is a compiler, and its IR is a representation that is closer to the final machine code than to C – at least I see it this way. Especially since in our case here, the IR was not produced from C code. So it’s like stopping one compiler in the middle of his work, going back to (another) source code, then compile this again with a different compiler. Also, in the post it’s mentioned that:

This is not a true cross-compiler. This is not the fault of the backend, but of the Julia program generator phase which encodes many platform-specific assumptions as it generates function definitions.

jpsamaroo · January 23, 2019, 8:54pm

My recommendations are to, first, drop options 1 and 5. Drop 1 because it sucks. Drop 5 because it is impossible on resource-constrained hardware.

What I think we really should strive for is some mixture of options 2, 3, 4, and 6. Specifically:

For each target, have some sort of bootstrapping code to boot the system and get it into a state where it can be remotely programmed via a hard or wireless connection. Additionally, do either auto or manual detection and reporting of the target’s technical details (RAM, Flash, supported ISAs, etc.) that are readable by the connected host.
Once past the bootstrapping phase, we now implement a generic, cross-platform stripped-down Julia runtime. It will not use libuv and and threads and all that jazz, but will provide a layer which is still compatible (as much as possible) with core Julia functionality where possible. Things we will want here are exceptions and Tasks, and other stuff I’m forgetting to list, that are used in lots of today’s code.
Build an embedded LLVM bitcode/Julia SSAIR interpreter (also generic and cross-platform, as much as possible) that can be fed IR and interpret it directly, with the option to JIT-compile to native code if available. These already exist for LLVM bitcode IIRC.
Provide one or more packages to do the driving of converting high-level Julia to this IR, and handle all of the MCU communication and other supporting things (resets, autodetection, etc, etc.). These packages would also handle extracting LLVM bitcode/Julia SSAIR from post-lowering, type inference, and optimization passes, which are some of the more resource-intensive bits.
Additionally, when having an interpreter is not desired or possible, there should also be the option to change LLVM’s backend to directly emit native code for the target, with the loading of such code again handled by the above embedded runtime layer.
Where possible and supported, all dynamic dispatch and other functionality too complicated for the embedded system should be able to be “proxied” out to an attached host for handling, thus requiring a layer at the host level to handle this communication.

So, this all sounds like a ton of work that no one really wants to shoulder. What benefits do we get from it at least?

We avoid the two-language problem entirely once we get a target-specific bootloader and runtime in place, because now everything is either LLVM or Julia (everything else is abstracted away).
Existing packages which don’t make huge use of dynamic dispatch and which don’t do many calls into the Julia runtime should be fully supported (such as numerics codes, with BLAS calls using a Julia-native BLAS package instead of OpenBLAS/MKL).
Regular users don’t have to know much about embedded programming other than how to connect and flash their MCU from a host, assuming a bootloader and runtime is available for their target.
We avoid duplication of effort, because only the bootloader and parts of the runtime need to be specific to the target. Anything else, like GPIO and other hardware peripherals, can be abstracted by the runtime and exposed as regular Julia types and methods.
The most expensive parts of running Julia code (everything up until SSAIR/LLVM bitcode generation) is handled on the host.

Now, how would we approach this in reality? Say I wanted to run Julia code on my embedded ARM board or FPGA soft-core, what would I need?

A bootloader that can boot the system, setup interrupts and load data from flash into RAM, bring up communication peripherals, and start the runtime.
A stripped-down Julia runtime. Depending on the target, this can be anything from just a simple interpreter REPL for IR, up to a full-fledged runtime with Tasks and exceptions and other cool stuff. However, the interface from this runtime to/from the host should still follow a consistent protocol and process so that any MCU, regardless of features, is detectable and controllable from the host.
A few packages installed on your host which knows how to communicate over your desired wired/wireless peripheral (UART, USB, Wifi, etc.) and get a connection to the MCU’s runtime layer. Additionally, a package that detects and exposes the hardware peripherals to your Julia code as native types and methods.

What do people think about this approach? Feasible but difficult? Impossible? A terrible, terrible idea for reasons I don’t yet grasp?

baggepinnen · January 23, 2019, 9:02pm

I disagree, we are doing it in our university lab. See Readme and documentation in our repo
https://gitlab.control.lth.se/labdev/LabConnections.jl

jpsamaroo · January 23, 2019, 9:14pm

By “resource-constrained”, I mean something incapable of running Linux, like an Arduino

baggepinnen · January 23, 2019, 9:31pm

OP mentioned he was going to use an ARM architecture, which is what we use (beaglebone), so I would say it’s certainly an option to consider.

asprionj · January 23, 2019, 10:27pm

That’s one major benefit of the entire idea. Thanks for mentioning it; I somehow forgot.

One example that possibly illustrates my “vision”: optimisation-based control. This can be online machine-learning of some kind, solving a static optimisation problem to define setpoints for the feedback controller, or even optimal control / model-predictive control (i.e. dynamic optimisation problem). This all boils down to coding the model structure, formulating the mathematical optimisation problem using the model, and solving it. Prototyping such things in Julia is great, due to the language and some awesome libraries already around. But how awesome would it be for a guy knowledgeable about this stuff but with no experience in embedded programming, to just run the exact same code directly on the series-controller hardware?!?

I did mention ARM and Linux (Yocto, Buildroot, Alpine, …) in the original post, yes. But of course, bringing Julia (code) to even smaller MCU’s without a Linux system would be a great thing, tool. But probably that’s two different stories. The first one (ARM, some flash and RAM) seems to be simpler to reach. After all, Julia can be (or will be again, soon) run on ARM. However, AFAIK that nowadays requires the entire ~260MB (?) junk to be installed? So for this use case, options 4 and 6 could be ways to go. Any thoughts on which may be more realistic to realise?

jpsamaroo · January 23, 2019, 11:11pm

Most ARM chips couldn’t handle Linux though. The BeagleBone uses an ARM chip which is a member of the Cortex-A line, which yes, often is deployed with Linux or another full-fat OS running on it. However, most of the ARM chips I have at home are of the Cortex-M variety, which could hardly run a decent RTOS, let alone a virtual memory-focused OS like Linux. Additionally, in terms of volume, I’m going to guess that Cortex-M has more yearly sales than Cortex-A, simply because it can fit in all sorts of nooks and crannies that Cortex-A could not.

jpsamaroo · January 23, 2019, 11:21pm

If you’re primarily interested in Linux-capable ARM architectures, then that support already exists within LLVM and many of the libraries that Julia relies on (including Julia itself). However, other than improving/adding ARM support to those libraries, there’s “not much else” that’s needed to get Julia running on ARM (Cortex-A).

However, as I replied to @baggepinnen, and as you alluded to in your reply, there is another, more constrained class of embedded controller that Julia currently cannot target, which in my opinion has the potential for some really amazing results when Julia code is somehow crammed in there. For me personally, implementing some simple spiking neural networks within a Cortex-M7F MCU running on a aerial robotics platform could be really cool and lead to some awesome emergent behavior (my area of research somewhat).

Of course, you are the OP, so please direct this conversation in whichever direction you prefer! Don’t let me derail things if you’d prefer to discuss virtual memory-supporting ARM platforms instead

asprionj · January 24, 2019, 8:23am

Thanks for all the contributions so far. I don’t want to solve exactly one problem/task that I currently have. It should be a rather generic discussion on whether there is enough benefit in creating easy ways of bringing Julia to any kind of MCU, and if so, how this could be achieved.

@jpsamaroo, you are perfectly right in your distinction between the M and A classes of ARM MCU’s. We first aimed for an M4 for our main controller unit but then switched to an A (some A7/9, or could be the soon-to-be-released NXP i.MX8m bearing an A53) since we also need a web server, GUI, etc. – and we can afford the higher price tag for our application. However, flash memory and RAM still is a separate source for costs and should kept as low as possible. For our auxiliary units (something like I/O-extension boards), we will still rely on M0 (possibly M3). So, considering these to “worlds”…

“ARM A” + Linux: So there’s LLVM support for such devices, and Julia can run on them. But still, I wouldn’t want to…
1. install another 300MB of flash to have a full-blown Julia installation there, when my Linux OS would require 4MB, and all binary application code (written e.g. in C) weighted another 2MB.
2. rely on JIT compilation in a (soft-) real-time environment. Sure, you could just call every method once when starting up the controller, but since that would aways be the same, that’s just needless overhead greatly prolonging startup time.
I guess we would end up somewhere between options 3 and 4 for this use case, right? Of course, the amenities of option 6 would still be nice, but probably much more work to get it done. The goal would be to create an easy and reliable way of getting Julia code on the embedded hardware, such that also users “knowing nothing about embedded programming” could just do it. (BTW, who’s also waiting for the final season of GoT? )
Low-level on “ARM M”: I don’t think using Julia as a bare-metal embedded programming language is an option nor a goal. There would need to be at least some HAL and probably a (thin) layer in C to provide some environment for the Julia-generated modules. On top of that…
- Is there a chance of also going along options 3, 4 and 6? I.e. can the existing LLVM backend for ARM be adapted to those “small” MCUs?
- Another option would be to give the LLVM-CBackend another try (that is, LLVM IR to C to binary).
- And there’s option 2, actually generating C code from Julia. I guess Intel’s Julia2C did this. There’s some interesting explanations in this julia-dev google-group thread, all from Hongbo Rong:

So far, the concept is: the user specifies 1 Julia function to be translated into native C; then in the Julia code generation phase, we recursively compile this function and all its direct and indirect callees into C; that is, the whole call graph is translated.

All macros and included Julia code have been processed before J2C happens. The generated C is standalone: a call graph rooted at a user-specified function is translated, including any Julia runtime function called. So it is a kind of whole-program translation. The C does not call back to libjulia during execution

Palli · January 26, 2019, 4:28pm

Would it help to integrate with MicroPython.org (“compact enough to fit and run within just 256k of code space and 16k of RAM”)? It already supports e.g. ARM Cortex-M4F (with its official PyBoard; and supports other non-ARM, and it seems MicroBit based on Cortex-M0) and even interrupt handlers in Python code; MicroPython is a variant of Python 3.4. It has an inline assembler for Thumb2, but is otherwise interpreted, and its bytecode (not the same as CPython’s, I believe) for the MicroPython VM, like Python’s, would have better code density that ARM code, and even better than Thumb2 code, I would think.

Julia has good support for Python with PyCall (and pyjulia) and relying on it could help (at least to start with), but note it’s only to call official CPython’s libpython. I guess it wouldn’t do for MicroPython, or would at least need changes… because also MicroPython is its own OS/baremetal, not needing Linux.

I’m thinking at first a code-generator for [Micro]Python, rather than C, could be helpful (supporting all the MCUs not just ARM; in case it’s important). You would also want to have fast ARM assembly for at least part of your code, so compiling Julia (or LLVM?) to the Thumb2 inline assembly, could additionally (or instead) be helpful.

Note e.g. MicroPython has it’s own GC:

http://docs.micropython.org/en/latest/reference/constrained.html

A GC can be demanded at any time by issuing gc.collect() . It is advantageous to do this at intervals, firstly to pre-empt fragmentation and secondly for performance. A GC can take several milliseconds but is quicker when there is little work to do (about 1ms on the Pyboard). An explicit call can minimise that delay while ensuring it occurs at points in the program when it is acceptable.

I wouldn’t really worry about anything else than ARM. The smallest “computer” in the world is already based on Cortex-M0+. ARM has sold in more than 100 billion units, and I wouldn’t worry about supporting competitors by the time we support ARM Cortex-M0/Thumb2 well.

mitkoge · January 29, 2019, 9:03am

While ARM Cortex-M0 is massive target, still ESP32 is one of the most actively developed targets for micropython.
May be kind of microjulia language subset definition would be needed for restricted targets?
And some kind of dynamic/static compromise?
But probably there would alternatives for this subset (like Kotlin and Lua) then?

asprionj · January 29, 2019, 8:24pm

MicroPython: The resource requirements are in fact small, that’s awesome. It’s an interpreter and thus very flexible and quick to “play around”, which is awesome, too. Finally, Python is a very “productive” (by simplicity of usage) language, that’s great.
But: I assume the roughly 100x performance gap as compared to C++ is a no-go in many applications.

With (statically, or any other form of AOT compiled) Julia we’d have both: simplicity of usage, prototyping language == final implementation, but performance of C++.

Now, I had another (quick) look at LLVM. I don’t know this compiler framework in any depth, just the rough overall idea. So there’s an “upper” layer generating the IR, and then there’s backends generating e.g. machine code from this IR. There’s cross-compilation capable backends with which one can compile for e.g. ARM on another platform (also using an IR generated on this other platform). So this sounds like a straightforward route, but it seems (see second and third posts) that Julia adds platform-specific parts to the generated (IR?) code. Who could help with this? What parts are this? Which (groups of) commands introduce platform-specific dependencies?

What about a “pure” IR-generation mode for Julia? One that would not support all features or some just in a possibly sub-optimal but portable way. This IR would be free of any platform-specific stuff and could thus be (cross-)compiled for any platform for which an LLVM backend exists.

Palli · February 2, 2019, 7:46pm

Such compiling would be to machine code, but that’s not the only issue. Currently e.g. we have PackageCompiler.jl compiling to I assume ELF file format (and similar, .exe for Windows). I assume it requires Linux or other OS, so for Cortex-M0 at least you need baremetal (also no full MMU).

Just compiling doesn’t solve all problems. For baremetal, I assume you also like to get rid of libuv (assumes an OS; Keno is working on that, such version may already be available) and I guess more (e.g. nothing where you do memory mapped files, or any files(?) would work).

There’s also JSExpr.jl to compile (a subset of Julia) to JavaScript. I assume to [Micro]Python could be similar, possibly a small change. Yes, then the generated code would be interpreted (without JIT; not benefiting from similar to Google’s V8 that’s fast; but for other contexts possibly, PyPy).

I was thinking for a subset of your Julia code it could be marked, with a macro (maybe @inlineassembly) to rather generate the inline assembly of MicroPython. You could even start with that for all generated code.

Ok, didn’t realize, but I doubt most popular. I assume LLVM doesn’t support it, and if I read correctly neither MicroPython’s inline assembly, so either MicroPython isn’t too slow (not “100x” slower?), or that’s ok for some applications (or C code used with?).

Having MicroPython generation would at least help support this or other MCUs, but as I said, I don’t care too much about such support, so you can go either way and also interpreted and compiled isn’t mutually exclusive.

tshort · February 2, 2019, 9:03pm

This is a solid approach. It shares a lot of parallels with work that Keno has done on getting Julia to run on WebAssembly: no libuv, libjulia compiled to run on the target, and (for now) everything’s interpreted. Tasking is currently a big blocker–Keno has a plan, but it requires an LLVM pass to swap stacks. That’s a good bit of complex code still waiting for someone to tackle (issue).

Getting more static code requires some way to make Julia a better cross compiler. This issue would help a lot, but I don’t know that anyone is working on it. I started to build a static compiler in Julia, but I stopped work on it because the IR changed, and I was getting a little over my head. Every strategy I’ve looked at is a good bit of work.

asprionj · February 3, 2019, 3:08pm

Just wondering, how performant is the Julia interpreter? (I assume on par with Python, Lua, Ruby, …?) How large would the binary size (mainly libjulia then) be for this approach?

Is it a good idea to have a parallel way of generating LLVM IR/BC to Julia’s internal one? I don’t know much about Julia’s internals, but for me, removing and changing some parts of an existing, well-proven thing sounds like less work (but more complexity) than re-building a similar thing from scratch…?

BTW, thanks everyone for the valuable inputs, ideas and links.

tshort · February 4, 2019, 1:13am

The WebAssembly download is 42 MB, so the standard library does create a bit of a load. If LLVM is added to enable the JIT compiler, it’ll grow even more.

I suspect you’re right that it’d be less work but more complex to re-purpose Julia’s existing code generation. The complexity part is the kicker (especially the C++ requirement). I’ve looked but haven’t found an “easy” path to make that happen.

One way to re-purpose Julia’s compiler is to use the approach that CUDAnative uses. That approach works well for type-stable code that doesn’t allocate and doesn’t call out to any C code. Cassette could be used to swap out code that doesn’t work with the CUDAnative approach.

asprionj · February 5, 2019, 3:35pm

And what approach would that be? Do they explain it somewhere? (In a form understandable by non-experts in Julia and LLVM…)

These assumptions would probably apply to the vast majority of code that has to run on embedded controllers. I started to watch the Cassette talk from the Cambridge Meetup some time ago… have to finish that

Per · February 5, 2019, 4:39pm

There’s this paper by @maleadt et all: https://arxiv.org/pdf/1712.03112.pdf I found it a very good read. (It’s now over a year old so details may have changed since the pre-release of Julia 0.7 that they refer to, but the explication of the general concept is likely still relevant.)

It does not go into great detail on the actual process on replacing/reconfiguring the IR. In particular, I can’t find any documentation on the interfaces listed in Table 2. With the help of Google I found CodegenParams in Julia’s Base module, but no docs…

Topic		Replies	Views
Help to select a Raspberry Pi for embedding Julia application General Usage	83	4247	March 6, 2023
Julia on embedded devices & validation thereof General Usage	36	3281	July 16, 2022
Now that Julia w/o LLVM is a thing, how long before I can run Julia on RPi Pico? General Usage	11	2090	September 28, 2022
Julia for microcontrollers (like Micropython) Internals & Design proposal	25	8597	January 17, 2026
Speeding up julia on aarch64 Internals & Design aarch64 , arm	15	2555	April 29, 2020

Bring Julia code to embedded hardware (ARM)

Related topics