Now that Julia w/o LLVM is a thing, how long before I can run Julia on RPi Pico?

So with Julia 1.8 as I understand it you can now separate out the Julia runtime and the LLVM runtime.

Stuff like the RPi Pico which is based on a low powered ARM core is really ripe for a scientific programming language. Adafruit took the chip from the Pico and put it on a board with 8MB of flash, it’s got 264MB 264kB of RAM… Adafruit Feather RP2040 [Pink] : ID 4884 : $11.95 : Adafruit Industries, Unique & fun DIY electronics and kits

Im sure the next version will have 1MB RAM and 32MB of flash or something similar. At some point in the near future these kind of “microcontrollers” will have sufficient resources to make running Julia very attractive and suddenly enable a bunch of smart sensors and such using a language designed for scientific computing… Is this reasonable to expect sometime in the next 3-4 years?

Outside of the StaticCompiler.jl efforts I doubt julia is going to run on something that small anytime soon. Even if we figure some way to compile small julia binaries, with some reduced version of the runtime. There still would need to be an effort to add support for the boards, i.e registers/ports etc.

3 Likes

I think people had the same feeling about Go but then https://tinygo.org/ came along perhaps we just need a Julia compiler that outputs Go :joy:

I guess it depends on what exactly you mean with “run Julia”. If you want to run Julia code (i.e., without the runtime+GC we have today), that is (shameless plug) possible today, if LLVM has a backend for the microcontroller you want to target. The caveat is that there’s no supporting julia libraries for the microcontrollers, so there is lots of stuff you have to roll on your own, like I2C, SPI, interrupt handlers etc. You also likely can’t use anything that requires the runtime, so (for now?) you can’t use GC, can’t use tasks/threading, can’t have default println("foo") (since that requires some semblance of IO, my wager is you’d have to implement your own IO object and pass that around), need to do something about custom linking steps to ensure interrupt handlers end up where you want them to and probably more I’m forgetting.

In my opinion, transpiling to another language is much more difficult/error prone than “just” writing specialized code :slight_smile: I’m imagining that we could have a package implementing some form of “bare-bones” runtime replacement, but I’m not an expert in that area and I don’t know how portable that would be across different microcontrollers.

5 Likes

You have a typo, that’s actually 1000 times larger, it has “lots of onboard RAM (264KB)”, which is a challenge, not as if 264 MB wasn’t challenge enough. It’s interesting that 256KB is a lot in the embedded world, but tiny elsewhere.

Even with “Dual ARM Cortex-M0+ @ 133MHz” if you get it to work it’s going to be slow for everything you’re used to, only fast (enough) for embedded stuff.

Non-standard (ELKS) Linux can run on 256 KB (not MB) of RAM, but Julia runtime needs larger currently.

Size-wise you can put Julia (compiled binaries, they can be 20 KB) on the that chip, but you would have to use StaticCompiler.jl (and StaticTools.jl), or similar (not as is or PackageCompiler.jl) after it has been made to work for ARMv6M.

I would recommend doing it with Python as here, just with MicroPython (or CiruitPython) distribution:

Julia supports ARMv8 (tier 2), and ARMv7, but not ARMv6M (while LLVM supports) which only has 16-bit length “Thumb” instructions, ARMv7 has 32-bit, and ARMv8/AArch64 has only 32-bit length instructions.

The biggest challenge:

there is no memory management unit (MMU). A full-fledged operating system does not normally run on this class of processor.

[That limitation hasn’t stopped [Micro]Python from working.]

But feel free to try, since that would put Julia in the range of working on:

the “world’s smallest computer” […] based on the ARM Cortex-M0+ (and including RAM and wireless transmitters and receivers based on photovoltaics) – by University of Michigan researchers at the 2018 Symposia on VLSI Technology and Circuits with the paper “A 0.04mm3 16nW Wireless and Batteryless Sensor System with Integrated Cortex-M0+ Processor and Optical Communication for Cellular Temperature Measurement.” The device is 1/10th the size of IBM’s previously claimed world-record-sized computer

Even the original 1991 Linux according to Linus:
https://groups.google.com/g/comp.os.minix/c/dlNtH7RRrGA/m/SwRavCzVE7gJ

needs a MMU (sorry everybody)

So I would recommend without an OS (Julia’s runtime depends on one), UNLESS Julia, Julia’s full runtime, could run on something like:

  • ELKS, a 16-bit no-MMU Linux on Amstrad PC 2086 (thanks @pawoswm-arm)
  • Booting ELKS on an old 286 MB from 1,44MB floppy (thanks @xrayer)

[Note “286 MB” there means Intel 80286 (which has an MMU), and presumably, motherboard, i.e. not to be read as 286 megabytes. I’m not how large it is, but Amstrad “PC2086 has 8086 processor, but it has full 640kB of memory”.]

The Intel 80286 had a 24-bit address bus and as such had a 16 MB physical address space, compared to the 1 MB address space of prior x86 processors.

The original IBM PC had “16 KB – 256 KB” but it didn’t run Linux (or barely an OS by modern standards). It used Intel 8088 and it runs the same code as Intel 8086 that ELKS supports.

I’m not sure Julia strictly needs an MMU, only indirectly, since only the OS has access to it. So it seems theoretically possible to run the Julia runtime with GC (though likely cut down, e.g. no threads, no LLVM/compiler, maybe with the interpreter --compile=min [and/]or just compiled binaries).

I remember there being an option in the ELKS
configuration about how many 64 KB memory pages were
necessary, and the minimum was 4 (256 KB).

A bit off-topic, though maybe not in this context (linked from ELKS, Justine is awesome, and her SectorLISP smallest high-level GC language, smallar than the “tiniest organic creature”, which at 613 bytes would not fit on a boot sector):

δzd Encoding

One of the most size optimized pieces of code in the Cosmopolitan codebase is Python. This is where we really had to go beyond code golfing and pull out the artillery when it comes to size coding. One of the greatest weapons in our arsenal (which helped us get statically linked Python binaries down to 2mb in size) is what we call the Delta Zig-Zag Deflate compression scheme, or δzd for short.
[…]
One of the many symbols in the Python codebase that needed this treatment is _PyUnicode_PhrasebookOffset2 which was originally 178kb in size, but apply delta encoding, zig-zag encoding, and then finally DEFLATE, we got it down to 12kb. That’s a nearly 10x advantage over DEFLATE’s Huffman encoding alone, and BZIP2’s Burrows-Wheeler transform!

SectorLISP now supports garbage collection. This is the first time that a high-level garbage collected programming language has been optimized to fit inside the 512-byte boot sector of a floppy disk. Since we only needed 436 bytes, that means LISP has now outdistanced FORTH and BASIC to be the tiniest programming language in the world.

1 Like

I did see ARMv6M mentioned for LLVM, so then according to you Julia (without the runtime) would work, no “if”. But below is mostly speculation on running even with the runtime.

Looking at the julia “binary” (main) it’s only 22472 bytes, so far so good. But even with just the first dependency, libjulia.so.1.8 at 229 KB, the combination is already close to the the RAM budget at 251 KB (it’s unclear that code needs to fit in RAM, or just the larger 8 MB flash).

But there are many more dependencies, most larger than the RAM each so seems like an impossible task, BUT I just list the largest ones in order (and I only list those I’m confident can be eliminated):

-rwxr-xr-x 1 pharaldsson pharaldsson 221M sep 6 21:50 sys.so
-rwxr-xr-x 1 pharaldsson pharaldsson 79M sep 6 21:39 libLLVM-13jl.so
-rwxr-xr-x 1 pharaldsson pharaldsson 42M sep 6 21:51 libjulia-codegen.so.1.8
-rwxr-xr-x 1 pharaldsson pharaldsson 32M sep 6 21:39 libopenblas64_.0.3.20.so
-rwxr-xr-x 1 pharaldsson pharaldsson 18M sep 6 21:39 libstdc++.so.6.0.29
-rwxr-xr-x 1 pharaldsson pharaldsson 11M sep 6 21:51 libjulia-internal.so.1.8
-rwxr-xr-x 1 pharaldsson pharaldsson 8,7M sep 6 21:39 libgfortran.so.5.0.0
-rwxr-xr-x 1 pharaldsson pharaldsson 2,7M sep 6 21:39 libblastrampoline.so.5.0.2
-rwxr-xr-x 1 pharaldsson pharaldsson 2,3M sep 6 21:39 libmpfr.so.6.1.0
-rwxr-xr-x 1 pharaldsson pharaldsson 1,6M sep 6 21:39 libgit2.so.1.3.0
[…]
-rwxr-xr-x 1 pharaldsson pharaldsson 981K sep 6 21:39 libquadmath.so.0.0.0
-rwxr-xr-x 1 pharaldsson pharaldsson 745K sep 6 21:39 libumfpack.so.5.7.9
-rwxr-xr-x 1 pharaldsson pharaldsson 726K sep 6 21:39 libnghttp2.so.14.22.0
-rwxr-xr-x 1 pharaldsson pharaldsson 698K sep 6 21:39 libgmp.so.10.4.1
-rwxr-xr-x 1 pharaldsson pharaldsson 643K sep 6 21:39 libpcre2-8.so.0.11.0
-rwxr-xr-x 1 pharaldsson pharaldsson 638K sep 6 21:39 libcurl.so.4.8.0
-rwxr-xr-x 1 pharaldsson pharaldsson 638K sep 6 21:39 libmbedcrypto.so.2.28.0
[…]
-rwxr-xr-x 1 pharaldsson pharaldsson 466K sep 6 21:39 libgcc_s.so.1
-rwxr-xr-x 1 pharaldsson pharaldsson 305K sep 6 21:39 libmbedtls.so.2.28.0
-rwxr-xr-x 1 pharaldsson pharaldsson 303K sep 6 21:39 libssh2.so.1.0.1
-rwxr-xr-x 1 pharaldsson pharaldsson 222K sep 6 21:39 libopenlibm.so.4.0
-rwxr-xr-x 1 pharaldsson pharaldsson 209K sep 6 21:39 libklu.so.1.3.8
-rwxr-xr-x 1 pharaldsson pharaldsson 203K sep 6 21:39 libspqr.so.2.0.9
-rwxr-xr-x 1 pharaldsson pharaldsson 184K sep 6 21:39 libmbedx509.so.2.28.0
-rwxr-xr-x 1 pharaldsson pharaldsson 139K sep 6 21:39 libatomic.so.1.2.0 ???
-rwxr-xr-x 1 pharaldsson pharaldsson 117K sep 6 21:39 libz.so.1.2.12
[…]
-rwxr-xr-x 1 pharaldsson pharaldsson 11K sep 6 21:39 libsuitesparseconfig.so.5.10.1

You might think you need a e.g. the C standard library (or C++ standard library) but you don’t actually even need those to run programs (just libc, or equivalent, to allow for portable code, not strictly needed).

The largest dependency is the Julia sysimage, sys.so, and it can be made radically smaller (while keeping the Julia runtime, e.g. GC and threads) or just eliminated when StaticCompiler.jl is used (then missing GC for now, and threads).

Note, a 16 KB Hello world binary seems huge to me (we could do it in a few bytes in assembly (or BASIC you could rely on) back in the days; plus the preinstalled OS/BIOS code).

Below I mention Cosmopolitan libc, that seems like off-topic, since it doesn’t target ARM, but it’s probably the smallest C library, or at least the the only supporting many operating systems at the same time, and tiny doing that. So it’s good to try to support Julia with it besides it does work on ARM too (through emulation, yes, not ideal for a microcontroller).

It can run “everywhere” (so ARM too) even in web browser: Actually Portable Executable

The most compelling use case for making x86-64-linux-gnu as tiny as possible, with the availability of full emulation, is that it enables normal simple native programs to run everywhere including web browsers by default.

https://justine.lol/cosmopolitan/

Cosmopolitan makes C a build-once run-anywhere language, similar to Java
[…]
# ~40kb static binary (can be ~16kb w/ MODE=tiny)
./hello.com

The above command fixes GCC so it outputs portable binaries that will run on every Linux distro in addition to Mac OS X, Windows NT, FreeBSD, OpenBSD, and NetBSD too. For details on how this works, please read the αcτµαlly pδrταblε εxεcµταblε blog post. This novel binary format is also optional, since hello.com.dbg is executable too, only on your local system since it’s an ELF binary.

αcτµαlly pδrταblε εxεcµταblε Actually Portable Executable

I found out that it’s possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn’t use a shebang line. Once I realized it’s possible to create a synthesis of the binary formats being used by Unix, Windows, and MacOS, I couldn’t resist the temptation of making it a reality, since it means that high-performance native code can be almost as pain-free as web apps. Here’s how it works:
[…]
In the above one-liner, we’ve basically reconfigured the stock compiler on Linux so it outputs binaries that’ll run on MacOS, Windows, FreeBSD, OpenBSD, and NetBSD too. They also boot from the BIOS. […]

Platform Agnostic C / C++ / FORTRAN Tooling

Who could have predicted that cross-platform native builds would be this easy? As it turns out, they’re surprisingly cheap too. Even with all the magic numbers, win32 utf-8 polyfills, and bios bootloader code, exes still end up being roughly 100x smaller than Go Hello World:

[12 KB executable that runs on all platforms]
[…]

x86-64 Linux ABI Makes a Pretty Good Lingua Franca

[…]
It’ll be nice to know that any normal PC program we write will “just work” on Raspberry Pi and Apple ARM. All we have to do embed an ARM build of the emulator above within our x86 executables, and have them morph and re-exec appropriately, similar to how Cosmopolitan is already doing doing with qemu-x86_64, except that this wouldn’t need to be installed beforehand. The tradeoff is that, if we do this, binaries will only be 10x smaller than Go’s Hello World, instead of 100x smaller. The other tradeoff is the GCC Runtime Exception forbids code morphing, but I already took care of that for you, by rewriting the GNU runtimes.
[…]

bash hello.com              # runs it natively
./hello.com                 # runs it natively
./tinyemu.com hello.com     # just runs program
./emulator.com -t life.com  # show debugger gui
echo hello | ./emulator.com sha256.elf

[…]

DESCRIPTION

Emulates x86 Linux Programs w/ Dense Machine State Visualization
Please keep still and only watchen astaunished das blinkenlights

A bunch of almost unbelievably clever tech tricks come together into something practical with redbean 2: a webserver plus content in a single file that runs on any x86-64 operating system.

The project is the culmination – so far – of a series of remarkable, inspired hacks by programmer Justine Tunney: αcτµαlly pδrταblε εxεcµταblε, Cosmopolitan libc, and the original redbean. It may take a little time to explain what it does, so bear with us. We promise, you will be impressed.

To begin with, redbean uses a remarkable hack known as APE, which stands for Actually Portable Executable – which its author styles αcτµαlly pδrταblε εxεcµταblε. (If you know the Greek alphabet, this reads as “actmally pdrtable execmtable”, but hey, it looks cool.)

[Wow, that actually reads as “actually portable executable”, µ meant as u, not m[u]; δ is also just stylized o, not d[elta]. Maybe they’re trying to be helpful, or just missed the point of the Greek. I like that Justine is getting publicity, in this otherwise good Register article, and other news.]

[…]
Given this interest in programs that can be booted directly, without an OS at all, it may not surprise you that a future goal for redbean is to make it bootable: to embed a TCP/IP stack and network-card drivers, for a completely standalone tool.

It doesn’t (at least currently) support threads:
https://news.ycombinator.com/item?id=26297518

I don’t see that threads couldn’t be supported portably across (those) operating systems targeted (though likely not easily from the BIOS, then you’re basically shipping an OS with the libc), but threads aren’t always a requirement. Thread support is likely a requirement of Julia runtime, just as it is now for Python (as of recently), even though you run with only one thread. I guess the thread support code could be fully disabled/removed in Julia.

StaticTools.jl already has a println equivalent (that works without the GC/runtime). Not that I see println most useful for microcontrollers.

Your Arduino project is amazing, what a great writeup. That needs to get broadcast out into the wider internet until some microcontroller specialists pick it up and run with it to make a package for such things.

Yes that was in fact a typo, I was thinking kB but typed MB.

Well it’s for embedded stuff so yes I’m aware. I’m thinking of stuff like monitoring sensors on a junker racing car which some friends want to build.

When it comes to microcontrollers like RP2040 I think the right target is bare metal with a custom Julia allocator, scheduler, and GC designed for such small systems. Probably by the time that comes along the chips are including 4x the RAM. With the rapid growth of cheap computing I fully expect something like RP2040 in a few years to have megabytes of RAM and maybe hundreds of MB of flash (which would put them well outclassing the Mac SE I had in high school)

2 Likes

The runtime has binary dependencies like libuv that you can’t just put on a microcontroller. Most of those don’t cross compile and/or don’t work on a microcontroller for lots of reasons, not least of which them requiring a kernel to run in.

All of the considerations regarding microcontrollers are without such libraries, i.e. without a julia runtime, which alone is already larger than the available RAM on most microcontrollers. Without some form of virtual memory layer, you’re not going to get very far, but getting that requires some form of hardware support.

This is not relevant for us, since we can’t use our binary dependencies on a microcontroller. For one, most don’t even have more than a single thread and certainly require explicit checks & orchestration just to launch, not to mention having to build them for the architecture. That’s a huge undertaking, as shown by the lengthy process of getting a julia native M1 build.

I’m aware of that, but I’m also aware that this “just” hooks into an existing libc to do the printing. There’s a reason different microcontrollers often provide their own libc, most often precisely because printing on a microcontroller is not as “trivial” as printing on an existing linux kernel is.

Haha, thank you :sweat_smile: I’m already keenly watching the static julia improvements teased at this years juliacon, but there’s quite a lot of stuff that’s required for microcontrollers to work seamlessly that’s also orthogonal to static binaries with a runtime (which is the focus of current efforts).

Yes, baremetal is the way to go for now. The chips you’re thinking of are probably closer to the System on a Chip variety, which will probably run linux already anyway. For those it’ll probably just be a matter of having an LLVM backend for our julia code, as well as having our binary dependencies run on that architecture, without the usual hassle of baremetal stuff.

2 Likes

Yeah you can already run Linux +Julia on a RPi zero 2W which is intended to be $15 but we will have to wait for supply chains to renormalize I guess… Perhaps you’re right the hassle of bare metal isn’t very worth it if you can pay an extra $5 for a full Linux system :sweat_smile:

You’re already paying for the huge RAM of the chip, not much, $1 for that 256 KB ARM chip, for a single one (and it’s the cheapest I could find with at least 20 KB) vs $0.24 for cheapest microcontroller I could find of any kind (with 496 bytes! RAM). You can still buy 16-byte RAM PIC (smallest RAM still sold), but those cost minimum $0.63 (for 1) down to $0.58. For some reason I also got $0.07200 in my microcontroller search, but I don’t believe that “132KB FLASH SMART CARD IC” is an MCU, despite “Detailed Description - Microcontroller IC” and 240,000 at that price, so for $17280, besides “0 In Stock”.

That might not happen (soon). I got curious about DRAM trends of density and size, but then realized the microcontroller (most, if not all?) use SRAM, not DRAM, this one: “264kB on-chip SRAM in six independent banks”.

[An SRAM bit is 6 transistors so this is 12.5 million transistors, but “IBM’s new 2 nm chip technology […] 50 billion transistors on a fingernail-sized chip” would mean about 500 million transistors in about a square-millimeter, or 10 GB if used for SRAM alone (or 60 GB DRAM), so larger DRAM sizes will come with time, just how soon to microcontrollers, is unclear, since they are always some process nodes behind, and might not upgrade. Even if current good equipment, because obsolete? Also still some older stock available, always going to be cheaper.]

So why is then the smallest DRAM you can buy likely already at 1 GB. DRAM has to be made on a separate die (and is often put on top), but SRAM can be made on the same die as the logic (also you do not need to refresh, likely helpful for many microcontrollers, that need to sleep and save power).

So the RAM as SRAM is always going to be competing with whatever else you can do with the transistors, like adding FPUs, more cache etc. And with flash IF it’s on the same die (I think it is, or can be, can anyone confirm?), and flash only needs 1 (floating-gate MOSFET) transistor per bit.

The cheapest 200 KB+ was actually the 264 KB Raspberry Pi RP2040TR7 at £0.88

but at the same price you can get non-ARM with 512 KB.

Then I got in my search “2.7M x 8” ordered with £0.88000:

1 : £62.58000
Cut Tape (CT)

1,000 : £57.71661
Tape & Reel (TR)

1 : £0.88000
Bulk

and this 512 KB ARM-based Cortex-M7 at the “same” price: https://www.digikey.co.uk/en/products/detail/nxp-usa-inc/MIMXRT1052CVL5A/7646296

1 : £0.88000
Bulk

1 : £5.37000
Tray

If I understand fix `--compile=all` option, and some improvements for `--strip-ir` by JeffBezanson · Pull Request #46935 · JuliaLang/julia · GitHub correctly, running without IR or compiler is currently broken? Might be fixed in an upcoming Julia 1.8 patch release though.

and that issue does have nothing to do with the Pi Pico, I believe. Because it applies to PackageCompiler.jl (or full Julia, I’ve never used that option that way though, only as --compile=min, i.e. then interpreting, not compiling), that will likely never support the Pico. Other packages to help compile to 256 RAM computers (or less) do not use --compile (or the runtime), I suppose.