Arceus.jl — A Lightning-Fast Behavior Resolution System

Hi all,

I’m happy to present Arceus.jl, a fork of a high-performance AliceRoselia’s original package Arceus, which introduces a novel way to resolve behaviors for entities based on their component sets—using magic bitboards, an optimization technique originally used in chess engines for O(1) move lookups.

Arceus.jl provides constant-time behavior resolution for ECS-style setups where entities are encoded as UInt64 bitfields. This is especially useful in games, simulations, or other systems where thousands of entities need to be matched to behaviors in real-time.

Explanations

You define a @traitpool, which is a set of traits (characteristics influencing the entity’s behavior) and subpool (subset of traits and other subpool).
Once done, you define a new instance of a traitpool with @make_traitpool and add the relevant traits, only those who describe your object, you can add and remove traits with @addtrait and @removetrait.
Once done, you create a new lookup function with @lookup, you specify in the body how each traits affects the result when present. This may take a while the first time since it compute the magic number which is slow. But for any future run, you won’t have problems (unless you add or remove some test in the lookup function)
Now you are all set to create your lookup table with @make_lookup.
Now for any instance of that trait pool, the resulting behavior for it’s set of traits will be retrieved in less than 10ns, even on a potato hardware.


Features

  • O(1) Behavior Lookup
    Behaviors are precomputed for all possible component combinations, enabling instant retrieval.

  • Magic Bitboard Caching
    Magic numbers are expensive to compute, so they’re deterministically cached to disk (CSV), enabling fast startup on subsequent runs.

  • ECS Compatibility
    Works with any archetype-based ECS that represents components using bitmasks or similar binary encodings.

  • Fine-Grained Bit Control
    You can allocate custom bit ranges for component pools, set precise trait positions, and manage trait dependencies.


Example

Here’s a simplified usage example:

using Arceus

@traitpool "EndingTrait" begin
    @trait start
    @trait meetX
    @trait gotSaber
    @trait lostAlly at 2
    @subpool side begin
        @trait good
        @trait evil 
    end
end

@make_traitpool GameEnding from "EndingTrait" begin
    @trait start
end

@addtrait GameEnding begin
    @trait meetX
    @trait lostAlly
    @trait side.good
end

f1 = @lookup END "EndingTrait" begin
    val = UInt(0)
    if @hastrait END.start
        val |= 1
    end
    if @hastrait END.meetX
        val |= 2 
    end
    if @hastrait END.gotSaber
        val |= 4
    end
    if @hastrait END.lostAlly
        val |= 8
    end
    if @hastrait END.side.good
        val |= 64
    end
    if @hastrait END.side.evil
        val |= 128
    end
    return val
end

@register_variable f1
ending = @make_lookup f1

println(ending[GameEnding])

This allows sub-10ns behavior lookup on low hardware.


Known Limitations

  • Heavy Initial Computation: Magic numbers are expensive to generate. It’s best to compute them once during initialization—after that, they are cached to a CSV file.
  • 64-bit Limit: The system operates on UInt64, meaning you’re limited to 64 component flags. Beyond that, the magic bitboard trick doesn’t apply.

Links


Contribute

Contributions are very welcome—whether it’s suggestions, PRs, or feedback. And please consider supporting the original author of the system!


Summary

If you’re working with ECS architectures and need lightning-fast behavior resolution for complex trait combinations, Arceus.jl might be worth a look. Let me know what you think, and feel free to ask questions or open issues.

12 Likes

I don’t know if caching performance is important, but caching to an Arrow file (Arrow.jl) may be more performant and no harder than caching to CSV.

1 Like

Thanks.
It may definitely help to cache bigger data easily.

Depending on the structure of the magic number tables, I suspect it may even not be a bad idea to roll your own binary format.

Oh also, where are you putting the cache files?

1 Like

It will just rest in the same folder as the code. but I will fix that soon. And having some custom binary format may really help. I’m on it.

Nice one! I think you’re nailing an all-around fast ECS infrastructure for Julia!

If you want to continue this line of work, I have additional challenges depending on how far you’re willing to go. There are those I failed to do. If you want to try, good luck.

  1. Add pext support. (Faster than magic lookup on some arch) See. Bit Manipulation Instruction Set - #5 by yuyichao
  2. You seem to be implementing a traditional ECS in addition to this one. One realization about archetypal ECS is that each combination of traits might get used many times in an update. For example, one might update entities on fire with -5 HP and poisoned entities with -5 HP. One should be able to merge them and use the LLVM compiler to efficiently update them both.
  3. If you’re up for more challenges, take a look at something like accelerated linear algebra, reactant, etc, to not only optimize scalar update, but also linear algebraic updates as well.
  4. Dynamically compiling for each archetype introduces a compiled Julia function. These functions currently stay in memory and do not get garbage-collected. This is a known issue. If you’re daring enough, you can fix it. (This problem is a really hard one; you need to fix Julia itself. Maybe keep this later.

See also FunctionWrappers.jl.

As to why you’d need both a bit-lookup-based ECS and an archetype-based ECS, it’s because bit-lookup is only fast with simple precomputed behaviors. If you need to look up a virtual function, it’s starting to get slow.

You do know that coding things in Julia is a challenge right? For example, take a look at this. Comparison tables for various Julia packages

Julia beats everyone else hands-down. If you’re up to the challenge, you’re competing with Unity, Flecs, Bevy, etc. You decide if you want to or not. Good luck.

I’m surprised by your capability. Good luck with whatever you want to do further.

Give me just 3 days and you will be surprised.

I actually have a remedy for this. And it’s to not write functions in the first place. Just gotta find another way. I’m drafting a solution, nothing sure yet

As far as I’m concerned, compiling optimized Julia code introduces a function (even if an anonymous one) that stays in memory and does not get garbage collected. I’m not sure how I can bypass that, but dynamically compiling fused update is probably the fastest way.

Fusing function is non trivial and doesn’t follow any specific rules.
The only way to do so would be to add a lot of annotations to know exactly how to fuse them, and I don’t think game devs will like that.
Instead of that we can just add games logics to components or something like that. Then inlining it (more or less similar to a fusion)

Also, There is a difference between a theorical limitations and a pratical one. In theory, compiled functions leaks memory, in practice, it may takes a really long times for it to be visible. Plus, one way to correct that is just to limit creating functions. for that want way, just don’t abuse multiple dispatch for systems update. So systems always takes the same types of arguments, for 10k systems (very unlikely situation), we have just 10k functions, which seems good.

What I thought of is this: Imagine you had an update that says “for every entity on fire, -5 HP” and then another that says “for every entity poisoned, -5 HP” and so on, what you could do is that you could list all the updates a specific archetype would go through each round and fuse them. The fusion rule would be as if you applied each update one after another.

Oh…
If it’s just that it’s pretty simple.
I was thinking you wanted fused expressions, like I was already planning a full package just for that, :sweat_smile:

Wow. It was fast. I think I solve all your problems
1- I added pext support to Arceus + a software fallback and a possibility to use magic bitboard instead
2 - Actually, I solved that a long time ago (10 day ago). My ECS is not archetypal, it’s columnar. Data are stored like in a database, it sacrifice memory for more performances. Technically, in such ECS, every entity possess ALL the components. so instead of making a system for every status. You just have:

using ReactiveECS

@component Health begin
    hp::Int
end

# By default, the value is OnFire(0,0)
@component OnFire begin
    damage::Int
    duration::Int
end

# By default, the value is Poisoned(0,0)
@component Poisoned begin
    damage::Int
    duration::Int
end

const Status = [:OnFire, :Poisoned]

@system StatuSystem

function ReactiveECS.run!(world, sys::StatusSystem, ref::WeakRef)
    E = world[sys]
    health = E.health
    hps = health.hp
    indices::Vector{Int} = ref.value
    for status in Status
        status_struct = getproperty(E, status)
        damages = status_struct.damage
        masks = status_struct.duration .> 0
        for i in indices
            hps[i] -= damages[i] * masks[i]
        end
    end
end

And hop, one function for updating all status. But I need to have some little handlers.

3 - Don’t count on this one. My potato computer doesn’t support CUDA nor XLA. If I can’t test it, I can’t make it.

4 - For this one, as I said, reducing functions seems like the best way.

Guess I oversized the problem, it took me less than a day to finish.

3 Likes

Oh, no… that’s not what I meant. What I meant is that given arbitrary update function for each status, you fuse them together and let the compiler do the trick and optimize such that each entity gets loaded to the register once and so on.

From what I see, you iterate through the
entities twice. This is not even cache-optimal.

As for the XLA… your computer doesn’t need to be powerful to support XLA. It works on CPU too.

Huh ?
But if the user can fuse the functions with logic, why should I make something to fuse updates ?
Fusing arbitrary functions for updates seems convoluted, since you need to know for which entity you need to do a fusion.
Also for, that example, if the problem is running twice on entities you can have a version were they are processed once


results = Vector{Int}(undef, length(health))
for status in Status
    status_struct = getproperty(E, status)
    mask = status_struct.duration .> 0
    results .+= status_struct.damage .* mask
end

for i in indices
    hps[i] -= results[i]
end

This is totally cache friendly, the hps can be stored in the cache. So this is perfectly efficient. Even just putting @inbounds makes this faster than Overseer. And if you’re using LoopVectorization, you can write


# Totally valid given the nature of my ECS
@turbo for i in eachindex(hps)
    hps[i] -= results[i]
end

You know which entities need to do a fusion if it was archetypal. The entities with all the components needed have updates. You seem not to understand. Imagine you had like 10 components, and for each update, each entity updates based on the combination of components they have. There would be 2^10 possible updates. The most efficient way if there is a million entities of each version is to compile the fused update for each archetype. So, for example, when the archetype of entities with 5 status effects are updated, you do update by combining the five updates, each of which can run arbitrary code, so you would need a compiler to do the trick. Fortunately, dynamic compilation is easy in Julia, but you do need it if you need maximum speed. Manually fusing 1024 updates wouldn’t be practical, and even if it was, in a large system, you need systems to dynamically find archetypes with entities exceeding the threshold such that you update them with fused update, and the archetypes that have these entities might depend on circumstances, like if entities suddenly targeted with some flame, then suddenly, the “on fire” entities are now abundant, even if normally they’re rare.

Another issue with your design is that it lacks the optionality of components. Ideally, you don’t pay for what you don’t use. In this design, even if your entities don’t have the necessary components, they get updated anyway.

I know… I know… my ideal is pretty hard, and I failed too.

Also, using an undef initialization there is undefined behavior. The numbers stored in an undef-initialized variable can be anything that happened to be in the memory at the time.

1 Like

I am only saying what should be the theoretically fastest way to do stuff. I’m not even sure if it can be viably done in practice without too much effort. That is why I say they’re challenges. If these sound harsh on you, I reassure you that you do not need to do all of these, but I am telling you that you might be underestimating the scope of the challenge if you think it’s going to be simple. Accepting the full challenge means competing with Flecs, Bevy, Unity, etc.

Though to be fair, I might not even be a good game developer. I might be too obsessed with making games run as fast as possible lol.

1 Like

In fact, simply no. Looping across the entity we don’t use speed up calculations (seems weird but yeah) because of vectorization which is inefficient on a lower number of entity. The main flaw with my ECS is that there can’t be entity with mutiple instance of the same entity. In fact, This architecture remove the need for fused update to the root. Why making archetypes for each combinations when I can just make one big loog that blindly update everything ? The relevant one are saves, the irrevelant one are naturally ignored. Yeah, with archetype, we may need to fuse updates, but with this model it’s simply not necessary, that problem in inherently solved. Plus in addition with Arceus, this problem, it’s another way to solve that same problem.
And for undef initializer, with ints, by defaut it set them to 0 but if it’s too much of a problem you can use zeros(Int, length(health)).
So as I say, columnar ECS inherently doesn’t need fused updates, it’s something that can be easily done with basic logic. Vectorization ensure that you still get full speed.
Actually, I’m think about adding way to partition memory to simulate archetypes like behaviors, but I find that a bit useless and counter intuive plus inadapted and unpredictable.
Maybe what you need is just a comparison between this model and the other (Even if it’s clear that it’s better), thing I won’t do, my pc is so old game engines doesn’t even launch.
Also, this ECS suppres totally the cost of queries, they are only dependent on the number of systems (23us for 100 systems, thinking about optimizing this to go even lower)
This ECS already compet with Bevy, Unity and Flecs and in fact already surpass them on many points, for example check this Game of life example
In terms of performances, this model is way more cache efficient, the principle bottleneck only comes from precomputed archetype index (a vector of index to the entities matching an archetype) but even that is manageable. For processing power, if in a synchronous mode (tested but not yet available), my ECS can process 100k entity in just 90us on my potato hardware, trust me that Unity or Bevy are already done.
Even with the need to synchronize systems, it just take 161us.

One last point, speed isn’t everything, it’s really important but ease of use, flexibility and reloads also are. Imagine makig a fused updae for a function then wanting to remove a system at runtime ? Also, if onFire produce a specific animation that poisoned don’t ? All these are condition complicating updates.

For these, I was thinking in-place updates and events would solve all this.

1 Like

Uhh… I guess you did well with your hardware, but its limitations prevent you from running tests on larger systems. NVM. I’d you like you to know that if it was as easy as zeroing the updates of the components you didn’t use or checking if the thing requires update, it would’ve been simple. However, there are scenarios (like if there is a thousand possible components and a million entities) that you’d need archetype and so on. Things might perform differently on different scales.

Anyway, I’d encourage you to go to the Flecs discord server so you could compare the system to other systems. Nice work regardless.

Also, no, undef doesn’t default to 0. It’s actually undefined number. So, it could be anything. Oftentimes it’s zero because those are what happened to be in memory but not always.