ReactantServer.jl: serve more models per GPU with Reactant.jl and XLA (gauging interest)

csvance · May 31, 2026, 6:00am

I have been building a Julia inference server and want to find out whether others would find it useful before investing more in polishing it for general use. Feedback, criticism, and “we already have this, it is called X” are all welcome.

The short version

ReactantServer.jl is a Julia inference server built to get far more models onto a GPU than its memory would normally hold. It is Julia first and fully extensible in Julia: the compiled model is the fast core, and everything around it, from pre and post-processing to the serving logic itself, is ordinary Julia you can read and shape. Models are compiled ahead of time through Reactant.jl’s PJRT bindings, and the server speaks the KServe V2 inference API over gRPC, so existing Triton and KServe clients connect to it without changes.

It grew out of a concrete migration. We had been serving PyTorch computer-vision models on NVIDIA Triton and decided to move off it. Exporting those models to StableHLO and serving them through XLA gives us whole-program compiler optimization and, more importantly, the ability to keep far more models on each GPU. We kept KServe V2 as the wire protocol so our existing clients did not have to change.

What pushed us was stagnation on both sides. TorchScript, which our deployment leaned on, was deprecated years ago. torch.compile is a real step forward for LLMs, but in our experience it’s essentially useless on the static vision models we run. On the serving side, native support for the newer torch.export path took years to arrive. Compiling to StableHLO and serving through XLA steps around all of that: models from Lux.jl, PyTorch, and Jax become a portable compiled artifact complete with whole program optimizations like kernel fusion.

Key ideas it is built around

Fit more models than GPU memory holds. GPU memory is quickly becoming a dominant cost in inference infrastructure, so serving more models per card can directly lower cost per inference. Because only one model executes at a time, the GPU does not need every model’s weights resident at once. The server materializes every model’s weights into host RAM at startup, then transfers a model’s weights onto the GPU on demand when a request arrives, keeps them resident for reuse, and evicts cold models under a configurable GPU memory budget. Because the weights are already in RAM, an on-demand load is a single host-to-device transfer, on the order of a single inference rather than a reload from disk. The practical effect is that you can cram far more models onto one GPU than its memory would normally hold, paying only a small transfer cost when a cold model is first called. Needing fewer GPUs for the same catalog of models lowers compute cost.

Compiler-grade speed on static graphs, extended with dynamic Julia. Each model is compiled once into an optimized executable. XLA does whole-program optimization, fuses kernels, and plans layout, which is where the Julia ML stack and Reactant give a small team performance they would otherwise have to build by hand. The compiled graph is the fast static core, and you extend it with dynamic logic in plain Julia: data-dependent control flow, custom pre and post-processing, anything that does not belong in a static graph runs as ordinary Julia wrapped around the model. More on how that looks below.

Cost-aware scheduler ties it together Requests land on per-model queues, and a single dispatch loop picks the next model by a deficit-weighted fair-share policy, honors optional per-model latency budgets, and coalesces concurrent requests for the same model into one batched execution, which amortizes fixed per-launch overhead.

Julia first, with a Triton-style model repository

You point the server at a model repository, a directory of model bundles, the same way you point Triton at one. Each bundle is a folder with the compiled model, its weights, a manifest describing the inputs and outputs, and an optional model.jl. The server discovers everything in the repository at startup and exposes the set over the KServe RepositoryIndex call.

The design is Julia first, and the clearest place that shows is pre and post-processing. Those hooks are plain Julia in the bundle’s model.jl, registered with register_model. This earns Julia its place on the hot path: logic that is awkward to express as a static graph, such as data-dependent loops and early exits, is a few lines of ordinary Julia running right next to the model. When we migrated our PyTorch models, the parts that did not export cleanly to a static graph became Julia post-processing instead of being contorted to fit.

A bundle looks like this:

severity_grader/
  manifest.yaml          # input/output spec and compiled batch sizes
  model.b1.mlir          # compiled StableHLO, one module per batch size
  model.b3.mlir
  model.b6.mlir
  weights.safetensors    # shared across batch sizes
  model.jl               # Julia post-processing (optional)

# manifest.yaml
format_version: "2.0"
name: "severity_grader"
executable_inputs:
  - name: "INPUT__0"
    dtype: "u8"
    shape: "whn"            # w width, h height; n is the batch axis (Julia column-major)
    dims: { w: 224, h: 224 }
executable_outputs:        # the model's raw outputs
  - name: "OUTPUT__0"
    dtype: "f32"
    shape: "zn"             # z logits, n batch
    dims: { z: 9 }
client_outputs:            # what the caller sees after model.jl post-processing
  - name: "grade"
    dtype: "i64"
    shape: "yn"             # y is the ordinal class, one per sample
    dims: { y: 1 }
batching:
  compiled_batch_sizes: [1, 3, 6]

# model.jl
using ReactantServer: NamedTensor

# The model emits CORAL ordinal-regression logits. The grade is the number of consecutive
# leading logits above zero, stopping at the first that is not. That early break is awkward to
# express as a static XLA graph, but it is a plain loop in Julia, running right next to the model.
function postprocess(out::Vector{NamedTensor})
    logits = out[1].data::Array{Float32}        # OUTPUT__0 logits, (z, batch) in Julia column-major
    Z, B = size(logits)
    grade = zeros(Int64, 1, B)
    @inbounds for b in 1:B
        n = 0
        for i in 1:Z
            logits[i, b] > 0f0 ? (n = i) : break
        end
        grade[1, b] = n
    end
    return NamedTensor[NamedTensor("grade", grade)]
end

register_model("severity_grader"; postprocess=postprocess)

Multiple GPUs and shared memory

Each worker drives one GPU and serves the full KServe API on its own, so a single-GPU deployment needs nothing else. To scale out, you run one worker per GPU and put a gateway in front. The gateway automatically detects which worker is serving which model, using the RepositoryIndex each worker exposes, and presents a single API endpoint to clients. Every request is routed to the worker that holds the requested model with no manual configuration, and the request is forwarded unchanged, so clients see one server rather than a fleet.

For large tensors, the server also implements NVIDIA Triton’s system shared-memory extension. Clients register a shared-memory region and reference it from input and output tensors, so the tensor data never travels over the socket. Existing Triton shared-memory clients work unchanged.

Status

There is a complete path through every layer: load a StableHLO bundle, compile through Reactant/PJRT, schedule, serve over KServe V2 gRPC, return a result. The on-demand weight cache and the scheduler are implemented and tested. Conversion tooling turns a Lux.jl model or a PyTorch nn.Module (via torch.export and torchax) into a server-loadable bundle.

We are also working on Revise.jl support, so the server and the Julia pre/post-processing in model.jl can be edited and hot-reloaded without a full restart while developing.

It is deliberately narrow. It is XLA-centric and static-graph-centric, and it is not trying to be an LLM serving stack, a multi-framework server like Triton, or a hyperscale system. If you need those, they exist and do their jobs well.

What I am asking

Is there appetite in the Julia community for an XLA-based, KServe-compatible inference server?
If you serve models in production, would simple on-demand weight loading change what fits on your hardware? What does your model mix and traffic pattern look like?

wsmoses · June 1, 2026, 4:08am

Seems like a quite fun project!

Two quick logistical comments.

This package seems like it may end up touching some of the device internal and/or JLL-internal methods from within Reactant. To reduce the likelihood of a mismatch occurring in the future, is this something you’d be willing to potentially upstream as a separate package within the Reactant.jl monorepo (e.g. like the separate ReactantCore package within Reactant.jl/lib at main · EnzymeAD/Reactant.jl · GitHub)? Of course you would be the maintainer of, but it may make it easier for other folks to help and of course keep it tested (we have a reasonable number of GPU CI machines).

Also, in addition to the xla-specific dialects, Reactant supports a variety of other MLIR dialects as well – which if you’re using Reactant for execution, would also presumably be supported out of the box. If that’s the case, what would you think about calling it something like ReactantServer ?

csvance · June 1, 2026, 5:22pm

When I originally built the server I wasn’t sure if I would be able to make the memory model work with Reactant (needed to manually free weights without waiting for GC or a double free happening). Once I realized I could use Reactant as is and not maintain my own fork / PJRT binding, ReactantServer indeed became a better name for it. As you surmised that means I am relying a bit on the package internals, and having it tested as part of Reactant + having access to GPU runners would certainly help smooth out development and maintenance.

I’m working on getting the final sign off to open source now; I don’t expect any issue there based on the other things we have open sourced in the past. Once I have that I can start doing the groundwork to get it ready for general usage and the Reactant repo sounds like an ideal home for the project.

wsmoses · June 2, 2026, 3:40am

Sounds great! let us know if there’s anything we can do to help in the meantime!

csvance · June 3, 2026, 4:10pm

Got the final sign off! Made a meta issue here to track the status of getting the server ready: ReactantServer.jl · Issue #2940 · EnzymeAD/Reactant.jl · GitHub

If anyone is interested in figuring out the best way to distribute models and requests to many different GPUs, could be an interesting project. Currently each GPU manages its own assigned models, but in the long run it would be good if we could have an option to do this at the gateway level when serving with many GPU/TPU, multi node, etc.

csvance · June 17, 2026, 6:41pm

Got the repo setup in EnzymeAD: GitHub - EnzymeAD/ReactantServer.jl: Production inference server for Reactant-compiled models, serving KServe V2 over gRPC · GitHub

For now if you want to try ReactantServer, it lives outside of Julia package registration. The plan is to eventually register once we finish up-streaming various patches + register a gRPC Server package.

Here is a brief overview of some of the most common configurations people will likely want and how they are currently handled. Support for more accelerators is planned. The documentation should be up to date, so you can check that for more details.

Single GPU

On-demand model loading
Batch coalescing
FIFO + fair scheduling options
Add/remove/update models without restart

If you have a small lab and you want to serve a ton of models you have trained without worrying about GPU capacity, on demand loading with fair scheduling provides a balanced experience.

Multi GPU, models distributed without replication

On-demand model loading
ReactantServerGateway LPT-packing batch coalescing
Add/remove/update models without restart

You can logically think of this as seamlessly scaling up the single GPU case. It doesn’t provide as many guarantees as the fair scheduler, but the way LPT-packing distributes models implicitly tries to avoid a situation where more infrequently called models can be completely starved by frequently called ones.

Multi GPU, models replicated

ReactantServerGateway Round robin (not optimal for batch coalescing)

This is the configuration that needs the most work. It works, but it’s not going to maximize batch coalescing due to how round robin works.

Multi Node (bring your own control plane)

A gPRC control plane service endpoint is provided. Usually if you need this sort of thing you are at the scale where you would build the control plane tailored to your specific requirements, so it doesn’t really make much sense for the project to try and provide the actual control plane. However, we will try and provide the interface a control plane would use in order to integrate ReactantServer.

csvance · June 19, 2026, 2:09pm

Spent a decent amount of time fixing bugs and optimizing performance today for the multi-GPU configuration. The multi-GPU w/ replicas case is now fully supported via lpt_packing mode. You can either set a default number of replicas per model or set it per model. Coalescing aware routing modes fill_rr and fill_least have been included which encourage batching while distributing a model across multiple GPU. If you don’t care about batching or just want something more simple, you can still use least_outstanding and round_robin scheduler modes instead.

The server now automatically detects whether the accelerator supports TF32, and automatically converts .mlir bundles that were lowered with TF32 to normal FP32 precision. So you can export your .mlir once and use it on both pre and post Ampere GPUs.

I’m working on adding some helpers for serving common types of models with data-dependent operations like object detection. The idea is to just produce StableHLO for the dense backbone, then use Julia to handle NMS and other other data dependent ops in the postprocessing step.

csvance · June 22, 2026, 5:24am

In order to support standard object detection models, I implemented “meta models,” which let you chain together multiple compiled StableHLO programs with arbitrary Julia logic in between. This is what makes a model like a Faster R-CNN servable: its detection pipeline (RPN proposal NMS, roi_align, box decoding, per-class NMS) is data-dependent and can’t be captured as a single static graph, so the meta model runs two traced StableHLO stages with that glue expressed in pure Julia between them.

All StableHLO executables under a meta model are scheduled and placed on the same device for performance reasons, but it’s done in a way where the scheduler algorithms don’t need to be aware of it.

An end-to-end example is included for the torchvision fasterrcnn_resnet50_fpn model (a ResNet-50 + FPN GeneralizedRCNN, COCO-pretrained): it exports the model to StableHLO via torch.export, loads it with ReactantServer, and serves a request.

Finally I did quite a bit of additional work on making the server robust under load. This resulted in quite a bit of improvements in terms of load shedding/handling of deadlines, maximizing throughput while minimizing canceled requests, etc. At work we are now testing out the server as part of our next major release. We are already seeing close to a 50% increase in throughput over torchscript models in nVidia Triton thanks to XLA while being able to serve all of our models on even our oldest hardware node where it wasn’t possible previously.

csvance · June 26, 2026, 3:19pm

Here is a quick update on recent improvements to the server.

Any setup that shuffles models on and off the GPU (single-GPU on-demand loading, or multi-GPU lpt_packing) can accumulate BFC arena fragmentation over time. It’s the kind of thing that bites well after deployment.

The fix: reset the arena on a cadence tied to model movement. For lpt_packing, the default is now to compact every time the scheduler recomputes placement. Compacting every scheduler run effectively means fragmentation can never cause an allocation to fail, with a configurable headroom factor for in-flight requests to models that just moved GPUs. Compaction can also be tied to the number of model loads/evictions (the natural trigger in single-device mode).

On compaction, each worker unloads its models and either lazily reloads on demand (eager, the default) or reloads everything before serving (scheduler).

We also now measure the maximum scratch buffer size at startup and automatically compute how much device memory can go to weights. A configurable wiggle-room factor keeps the estimate on the conservative side, so operation stays smooth even if new models are added after startup.

Other new defaults for safer multi-GPU operation:

Eager compaction every scheduler execution (so hysteresis now defaults to 0.0, no longer needed)
A flag preventing placements that would overcommit a GPU’s weights past what fits resident
Shorter first rebalance interval (1 min) for faster warmup, then 5 min regular
Compute-time EMA half-life tied to the scheduler window, so the fairness signal aligns with the rebalance cadence by default

Also new: a Grafana/Prometheus observability stack with preconfigured dashboards, so you can watch memory, scheduling, and throughput out of the box.

csvance · July 13, 2026, 3:11pm

We ran a load test over the weekend looking for any kind of stability issues / memory leaks / latency spikes. The test was conducted on ~100 different models spanning a huge range of computer vision applications with four GPUs. Everything held steady.

As of now here is where the priorities stand:

Setting up a release system where known working combinations of Reactant/Reactant_JLL are provided together with a Docker image where everything has been tested together.
Improving the deployment experience / documentation. One of the largest footguns is that when autotuning runs for the first time on startup, it interferes with the memory high water mark we use to estimate max scratch buffer size.
Working towards publishing gRPCServer.jl: this is the only remaining dependency that needs to be vendored / doesn’t have a compatible version in the general registry. Several patches were upstreamed to HTTP.jl and Prometheus.jl, allowing us to unvendor them.

j_u · July 18, 2026, 7:52pm

Hey! I’m wondering, is gRPC interface really suitable if more GPUs are needed than a single node can provide, or would zero-copy RDMA semantics serve such a case much better? I am also wondering, is ReactantServer.jl only for vision or is it maybe somehow possible to serve language models as well? What do you think if I may ask?

csvance · July 18, 2026, 8:27pm

In the future we could support models that needed multiple devices on a single node as this is very well supported by PJRT. Currently a model must fit on a single device. A single model that needs multiple nodes gets pretty far outside of the core vision of the server and it goes beyond what PJRT supports. I’d be open to someone contributing/maintaining an IFRT server package which provides the same interface externally, but it’s not something I personally have a use case for.

To answer your question about gRPC, the way I’m currently using the server is on a single node with host shared memory, so model inputs/outputs bypass being protobuf encoded/decoded entirely. I’d be open to contributions extending this, but keep in mind that if we are talking about device memory, that would have to be allocated by ReactantServer so it is accounted for. Otherwise we cannot guarantee deterministic memory usage. I suppose if you just want to load a set of models and never hot swap anything, that isn’t an issue.

As for language models, I wouldn’t rule it out entirely for things like BERT, embedding / cross encoder models, but there are already very well established tools for serving large language models like vLLM. I originally built ReactantServer with vision models in mind, so that’s where most of my development effort has gone.

j_u · July 19, 2026, 3:25pm

I understand. Thank you for your detailed reply. Very interesting project.

csvance · July 23, 2026, 4:11pm

It turns out that BERT based models are trivial to export and serve with ReactantServer. I added tutorials here for four common use cases: Transformer Text Models · ReactantServer.jl

Dense Embedding: sentence-transformers/all-MiniLM-L6-v2
Sparse Embedding: prithivida/Splade_PP_en_v2
Cross Encoders: cross-encoder/ms-marco-MiniLM-L6-v2
Classification: distilbert/distilbert-base-uncased-finetuned-sst-2-english

If your lab/company uses a vector based knowledgebase / retrieval system for RAG, you can already host the models needed for that with ReactantServer.

I also looked into serving LLMs. The main challenges are potentially needing many different compiled program sizes with padding / KV cache using StableHLO semantics / wrapping all of that up together in a way that can be effectively batched. In principle, there shouldn’t be any reason why you couldn’t serve a dense model like Qwen3.6-27B today. I don’t have time to look into this currently, but serving these sorts of models which can reasonably fit on a single workstation class GPU is something I would like to support in the future. Contributions are welcome!

yuriko_diaz11 · July 23, 2026, 10:59pm

Curious what the actual latency overhead is from swapping models. Does cold vs warm matter much?

csvance · July 24, 2026, 12:13am

On an A6000 (PCIe 4.0) I’ve observed between 10ms - 70ms for models between 100 MB and 1GB. This likely can be improved some by pinning the memory which we are not doing yet. It would also work better if all of the weights were a single contiguous transfer. With those two things addressed we should be able to transfer near whatever the maximum rate would be for PCIe. So for 100 MB that would be around ~4 ms. For 1 GB ~40 ms.

If you are able to batch, you can amortize the cost of loading the weights across the entire batch.

j_u · July 27, 2026, 3:39pm

Thank you. I will look into this. I’m a private individual. In production, I’m currently running a small RAG pipeline 24/7. I’m looking to upgrade my embedding model from qwen3-embedding-0.6b (1024 dims) to something more performant, such as qwen3-embedding-8b (4096 dims) or jina-embeddings-v4-3.8b (2048 dims). Right now, I store vectors in FLOAT32, which take up roughly three times the space of my underlying data, even after basic database optimizations. I’m considering switching to INT8 or Binary quantization, though I’m still evaluating the best approach. My current RAG infra is mostly serverless - I’m looking to move away from that eventually.

As for the LLM semantics, currently I’m running Kimi K2.7 Code (not 24/7) and looking into MaxText to train a smaller model. I was considering Baseten Truss for serving, however, I think that your solution is potentially better.

Topic		Replies	Views
Will Reactant.jl become a machine learning framework? Machine Learning	16	3349	December 23, 2025
Using Reactant with Lux and Enzyme to speed up training in physics context Performance question , enzyme , lux , reactant	16	578	August 28, 2025
Significantly Higher VRAM Usage and Slower Training on Flux Compared to PyTorch Machine Learning	25	987	May 18, 2026
Reactant: how to use it, limititations and opportunities? Performance autodiff , reactant	1	358	February 11, 2026
[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems! Package Announcements announcement , generative-ai , prompting	22	4077	April 5, 2024