Bring Julia code to embedded hardware (ARM)

Now that sounds quite pessimistic… why so?
Sure, one needs enough memory to install and run Julia. Precompilation would only be necessary if something in the code changes (either when prototyping, or a “code upgrade” is compared to a full “software upgrade”). So having an embedded device that can self-compile changed logic or “scripts”, may be quite useful for some applications.

About real-time applications and GC: You can disable and re-enable the GC (GC.enable). So, you could disable it, run critical code, and then either re-enable it for non-time critical stuff, or manually trigger the collection (GC.gc) after every real-time loop instance.

Also, in very fast / time-critical embedded systems, one would anyway want to have almost everything statically allocated. In getting from a dynamic to a static variant of some code, Julia can help a lot (@code_...).


Any comments on my question related to official 32-bit ARM binaries?

Tier 3: Julia may or may not build. If it does, it is unlikely to pass tests. Binaries may be available in some cases. When they are, they should be considered experimental.

sounds more pessimistic that it actually is. ARM has 11 open bugs: Issues · JuliaLang/julia · GitHub

but all but two are for 64-bit ARM, which does actually have tier 1 support. It’s just that 32-bit ARM isn’t a priority (tier3 is to prevent ARM bugs blocking releases, while my understanding is older builds mostly work and I would expect for 1.6 too if not already out).

Before 1.4.0 release: Support tier shuffle: Win32, ARMv7 -= 1, AArch64 += 1 (#667) · JuliaLang/www.julialang.org@55ff318 · GitHub

ARMv7 shifts from tier 2 to 3. We’ve had a number of issues on ARMv7,
which have made producing release binaries quite tricky due to build
failures and other issues.

About real-time applications and GC: You can disable and re-enable the GC

You can do that, but it’s not the best way. It will still allocate memory which is a no-no for hard-real-time, might be ok for soft-real-time.

Better is to avoid allocations entirely as is possible, see: Robot Locomotion - JuliaHub

Julia doesn’t build for armv7l in the official building setup, that’s why there is no official build. Help to fix it would be appreciated. Recent example of user tracking down a bug for this platform: Can't build Julia v1.6.0-beta1 for Arm32bit · Issue #39293 · JuliaLang/julia · GitHub

1 Like

That closed issue ends with “now builds and passes my initial tests”, so Julia 1.6 (and 1.7) does build?

There has been no official release after that, so :man_shrugging: Also, that issue was specific to Julia v1.6, but the official build on armv7l failed since Julia v1.4.2, so there might be other lingering issues.

Other people actively testing Julia master on these boards and reporting any issues would be very helpful. Since there is no CI for this platform, there is no easy way for the developers to catch regressions.

There are unofficial builds for armv71:

That may be, but Julia’s design does not encourage or require making all of your allocations static. Just where they matter. Memory management and being able to escape strict typing when necessary are key advantages of Julia.

The niche that Julia targets does not really coincide with the constraints of embedded devices (even though that is a moving target), especially with real time requirements, so I guess that most people just make a rational decision about not investing into this, and choose a different language or hardware.

1 Like

Julia v1.5.4 does compile from source. I expect v1.6.1 to compile with the backport of https://github.com/JuliaLang/julia/pull/40176 already merged. Issue armv7l - No loaded BLAS libraries were built with LP64 support · Issue #40199 · JuliaLang/julia · GitHub prevents building current master. I will continue to test and report issues.

2 Likes

The build reportedly isn’t successful on the official build machine though, that’s why there haven’t been official builds in a while for that platform.

Are there any Universities working on cross compiling Julia into FPGA SOCs with ARM processors and possibly even VHDL? I’m primarily interested in United States based universities, but I would be interested to know if there is ongoing research. Thought this discourse might be a good place to look.

I am currently testing the feasibility of option 5 in the OP: running Julia on the embedded hardware and using AOT compilation. Our current target platform has a Cortex-A7 based SOC which can run the official 32-bit-ARM binary release of Julia 1.6.1.

Other than the large storage requirements, the biggest problem we have found so far is the very long startup time for even relatively simple scripts. Initially this was largely caused by precompilation times for some standard library functions and operators. For example, the precompilation of an “inverse divide” operator was taking ~14 seconds :zzz: . For comparison, this same statement took ~5 seconds to precompile on a Raspberry Pi 4 (Cortex-A72), and slightly under 1 second on a x86_64 laptop.

We can remove a lot of the precompilation time delay by using PackageCompiler.create_sysimage. But the run time overhead remains uncomfortably high, in the order of 10 seconds for a test script with less than 20 lines, or half of that time if the filesystem data is cached in memory (which we can not assume).

Any suggestions on how to further speed up the startup/load time of Julia scripts/modules/libraries? Or any tricks for getting more out of PackageCompiler?

We would be particularly interested in ways to remove unused code from the binaries, which might help with load times, and would reduce the storage requirements. I was hoping that PackageCompiler.create_app might help with that, but it does not seem to improve the startup time. It does help a bit in reducing storage requirements, though.

The filter_stdlibs argument in create_sysimage and create_app sounds promising, but I have not figured out how to use it; create_sysimage fails with compilation errors when I try it, and the documentation talks about “potential pitfalls” without further details. Any hints or pointers on using filter_stdlibs?

1 Like

@enriquer I think the people best suited to answer are the PackageCompiler people, who might not be following this ARM thread. So I might suggest reposting that in a separate thread to get the right audience.

To make sure: You compiled everything that is called in the script into the sysimage? That is, the remaining overhead is pure “Julia startup / initialisation”…? So the question would be whether (and if yes, how) this can be eliminated. An embedded system is usually expected (or even required) to start up quickly, so 10s is quite a lot of time (that would be added to the startup time of the system, incl. linux etc., itself).

1 Like

Thanks, that is a good point. I have reposted to a new “Performance” topic:

https://discourse.julialang.org/t/looking-for-advice-on-achieving-faster-startup-times

Please, follow this link if you want to continue discussion my previous post from two days ago.

I just replied to this in the new thread.

The problem is that PackageCompiler is not really a compiler: it seems like it just dumps the code that is JAOT compiled in memory to a shared object, which means any core Julia runtime functions/variables that get executed/declared along the way. Hence why it takes 5-10 min to compile even simple Julia scripts, and the final system image is >100MB (which includes debugging symbols, and that’s without the extra artifacts and libraries that get generated/copied). This in my opinion is absolutely unacceptable for embedded. (that, and the garbage collector)

In my honest opinion, I think it is a complete waste of time trying to get Julia to work on embedded, and the points discussed by Karpinski are moot at best in this domain (I already discuss the “safe” part in another topic: Julia for real time, worried about the garbage collector - #41 by 0xD3ADBEEF). Yes, it has a nice philosophy, but it needs a lot of infrastructure to carry it in the embedded space (a proper AOT compiler toolchain, drivers, HAL, etc…)

I would rather focus my efforts on making a good code generation package.

3 Likes

Hi @maleadt, could you share a link to the JuliaCon talk?

That didn’t happen, but GPUCompiler.jl is basically all that code isolated into a single package, easier to understand. It can also be used for non-GPU purposes, as StaticCompiler.jl demonstrates.

3 Likes

Getting Julia code to be targetable to fpga programable logic and FPGA arm cores would be huge. I don’t think i would care about Julia’s flexibility and GC, I just want the algos to be transcribed into fixed point and FPGA IP on the programable logic side and arm core side. I thought Xilinx open sourced their llvm front ends to better allow this sort of thing. Any thoughts lately on how feasible this is?

2 Likes

Some work was being done on getting Julia to work on FPGAs (although without llvm) here is the paper: [2201.11522] High-level Synthesis using the Julia Language

4 Likes