Llamafile/tinyBLAS, and Julia small-install/binary challenge

Palli · August 14, 2024, 11:45pm

Julia could replicate the idea behind llamafile (which works for CPUs and GPUs) and/or just use tinyBLAS:

llamafile lets you turn large language model (LLM) weights into executables.

Say you have a set of LLM weights in the form of a 4GB file (in the commonly-used GGUF format). With llamafile you can transform that 4GB file into a binary that runs on six OSes without needing to be installed.

This makes it dramatically easier to distribute and run LLMs. It also means that as models and their weights formats continue to evolve over time, llamafile gives you a way to ensure that a given set of weights will remain usable and perform consistently and reproducibly, forever.

We achieved all this by combining two projects that we love: llama.cpp (a leading open source LLM chatbot framework) with Cosmopolitan Libc (an open source project that enables C programs to be compiled and run on a large number of platforms and architectures). It also required solving several interesting and juicy problems along the way, such as adding GPU and dlopen() support to Cosmopolitan; you can read more about it in the project’s README.

I’ve wanted to make Julia installation smaller, and succeeded (without breaking compatibility), and same idea could work for small compiled Julia programs.

The easiest way to make Julia smaller is to drop unneeded dependencies such as openBLAS. It’s fairly large, and could be a dependency of LinearAlgebra. It’s already possible to switch in alternative BLAS like MKL.jl or BLIS.jl at runtime, so there’s actually no reason to have any BLAS bundled.

But since not having any BLAS, i.e. only the generic matmul fallback, isn’t considered good enough for many (for default Julia), then the tinyBLAS could substitute openBLAS, and since it’s a fork of it, it’s likely easy to switch.

If you look up tinyBLAS you find that fork of openBLAS, and it’s very old/outdated but I believe the only current maintained tinyBLAS is now at Mozilla as part of llamafile, see e.g. llamafile/llamafile/tinyblas_cpu.h at main · Mozilla-Ocho/llamafile · GitHub (and other similarly named files) and llamafile/llamafile/tinyblas.cu at main · Mozilla-Ocho/llamafile · GitHub [Also interesting there:Make GeLU go 10x faster (take two)]

Julia doesn’t have GPU support (though some external packages for are listed with tier 1 support), but tinyBLAS has GPU support, so that might be a reason to include it. Matmul is O(n^3), but copying matrices is only O(n), so for large enough a matrix copying to the GPU, then do a matmul, then copy back might be worthwhile, even transparently. [If matrices fit in L3 cache then matmul is only O(n^2), in memory operations, would still be worthwhile, to do on the GPU, maybe(?).]

The plan for smaller Julia would be first get rid of openBLAS and any dependency that isn’t needed (feel free to make a PR at my github for an unofficial fork to test out any idea related to this). It’s possibly better or easier to not drop rather substitute with tinyBLAS (e.g. without GPU support), and Cosmopolitan Libc isn’t needed for that.

Other large dependencies of Julia are e.g. LLVM, and it could be dropped, run by default with --compile=min, i.e. the interpreter (could be enough for scripts, or with LLVM, and can it be made smaller if it only supports -O=0 as the new default?). Just to see how small Julia can be made, and also precompiled Julia code in packages should still work that way. Julia itself neither needs the C++ standard library, only the LLVM (and some Julia packages with C++ code JLL dependencies).

Mozilla sponsored our work as part of their MIECO program. Google also awarded me an open source peer bonus for my work on Cosmopolitan, which is a rare honor, and it’s nice to see our project listed up there among the greats, e.g. curl, linux, etc. In terms of this release, we’re living up to the great expectations you’ve all held for this project in a number of ways. The first is we invented a new linker that lets you build fat binaries which can run on these platforms:

AMD64

Linux

MacOS

Windows

FreeBSD

OpenBSD

NetBSD

ARM64

Linux

MacOS

FreeBSD

Windows (non-native)

It’s called apelink.c and it’s a fine piece of poetry that weaves together the Portable Executable, ELF, Mach-O, and PKZIP file formats into shell scripts that run on most PCs and servers without needing to be installed. This is an idea whose time has come; POSIX even changed their rules about binary in shell scripts specifically to let us do it.

[…]

Build Once Anywhere, Run Anywhere C/C++

One of the things we’re most happy with, is that Cosmo’s cross platform support is now good enough to support Cosmo development. We’ve traditionally only compiled code on x86 Linux. Devs using Cosmo would build their programs on Linux, and then copy the binaries to other OSes. Focusing on Linux-only helped us gain considerable velocity at the start of the project; the Cosmopolitan monorepo has two million lines of code. Today was the first day the whole thing compiled on Apple Silicon and Microsoft Windows systems, and using Cosmo-built tools.

Windows Improvements

In order to get programs like GNU Make and Emacs to work on Windows, we implemented new libraries for POSIX signals emulation. Cosmopolitan is now able to preempt i/o and deliver asynchronous signals on Windows, using a SetThreadContext() trick I learned from the Go developers. Cosmo does a considerably better job spawning processes now too. For example, we wrote a brand new posix_spawn() function that goes 10x faster than the posix_spawn() included with Cygwin.

[…]

Portability and Performance (Pick Two)

The end result is that if you switch your Linux build process to use cosmocc instead of cc then the programs you build, e.g. Bash and Emacs, will just work on the command prompts of totally different platforms like Windows and MacOS, and when you run your programs there, it’ll feel like you’re on Linux. However portability isn’t the only selling point. Cosmo Libc will make your software faster and use less memory too. For example, when I build Emacs using the cosmocc toolchain, Emacs thinks it’s building for Linux. Then, when I run it on Windows:

[screenshot]

It actually goes 2x faster than the native WIN32 port that the Emacs authors wrote on their own. Cosmo Emacs loads my dotfiles in 1.2 seconds whereas GNU Emacs on Windows loads them in 2.3 seconds. Many years ago when I started this project, I had this unproven belief that portability toil could be abstracted by having a better C library. Now I think this is all the proof we need that it’s not only possible to make software instantly portable, but actually better too. For example, one of the things you may be wondering is, “these fat binary files are huge, wouldn’t that waste memory?” The answer is no, because Cosmo only pages into memory the parts of the executable you need.

Boulder startup dylibso just announced a few weeks ago that they’ve adopted Cosmopolitan for their new product Hermit: Actually Portable Wasm.

Topic		Replies	Views
LLaMA in Julia? Offtopic	13	3642	August 7, 2023
LlaMa2 architecture in Julia: llama2.jl (300 lines?) General Usage	2	522	August 1, 2023
16 bit float on Transformers.jl General Usage transformers	7	575	December 28, 2023
Sequence language models in Julia Machine Learning	3	193	June 29, 2025
[ANN] Introducing LLMAccess: A Simple Julia Wrapper for LLM REST APIs Package Announcements package , llm	4	496	February 12, 2025

Llamafile/tinyBLAS, and Julia small-install/binary challenge

Build Once Anywhere, Run Anywhere C/C++

Windows Improvements

Portability and Performance (Pick Two)

Related topics