Scalability and Efficiency of Julia for a Multi-Agent Scheduling System?

I’m working on a Ph.D. project, trying to design a multi-agent scheduling system where I need high performance, scalability and reliability.

Initially, I thought about creating a metaheuristic/Matheuristic model to solve the scheduling problem. But with a change in the project direction, I’m starting to worry if Julia will scale enough.

I have a feeling that Julia wasn’t really built for handling big asynchronous systems, and I’m wondering if anyone here has experience or thoughts about using Julia in a similar scenario?

Here are my main worries:

  1. Compile Errors: After adding new components, I hit compile errors at runtime, meaning a lot of runs are basically just to find the next compile bug in the code.
  2. Concurrent Programming: Does Julia scale well for multi-agent systems? I haven’t found much about people creating real-time, IO-heavy multi-agent systems in Julia. Any reasons I should think twice about this?
  3. Handling Mutable Data Structures: I’m also worried about managing multiple writes to mutable data structures concurrently.

Given these, I’m considering statically typed alternatives like Rust or C++, but I would prefer not switching as I really like coding in Juila.

Kind regards

Well, I try to answer your concerns:

  • regarding compile errors: I guess you need good unit tests to discover them early on. And clearly defined interfaces between your components. Both should be possible, but details depend on what you are trying to achieve (what your code looks like)
  • concurrent programming: multi-threading is not scaling very well due to the garbage collector, but in 1.10 it is already much better than in 1.9 … Depending on the size of your problem multitasking might be a better choice than multi-threading. Again, hard to tell without more info about what you are doing…
  • handling of mutable data structures: I think this is only relevant if you use multi-threading. Atomic primitives are available even on a field level (at least for 1.10). But again, there are other forms of communication that might be more efficient.

Just my two cents… You did not share enough detail to give a better answer…

2 Likes

Thank you for the reply Uwe

I am trying to model a scheduling process that is made up of multiple actors Agent. Basically there are three kinds of Agents that are able to make decisions that influence what is actually scheduled, each optimizing a different part of the scheduling process (Each agent is optimizing a known problem, multi-dimension multi-knapsack problem (scheduler), resource-constrained project scheduling problem (work planner) and a variant of the flexible job shop (worker)). Each type of Agent has a mutable struct that holds its state (simplified here):

mutable struct SchedulerAgent <: Agent
    backlog::Vector{WorkOrder}
    scheduled_work_orders::Vector{WorkOrder}
    periods::Vector{Period}
    inbox::Channel{SchedulerMessage}
end
mutable struct WorkerAgent <: Agent
    id::Int
    skill::Symbol
    availability::Vector{Availability}
    assigned::Vector{WorkOrder}
    inbox::Channel{Message}
end
mutable struct WorkPlannerAgent <: Agent
    orders::Vector{WorkOrder}
    inbox::Channel{Message}
end

The Channel{Message} allow the agents to send messges to each other where the messages are subtypes of abstract type Message end

Each of these agent the run an asynchronous function on their state like so:

for sa in scheduler_agents # Vector{SchedulerAgent}
    @async scheduler_internal!(sa::SchedulerAgent, message_channel::Channel{Message})
end

for wp in work_planner_agents # Vector{WorkPlannerAgent}
    @async work_planner_internal!(wpa::WorkPlannerAgent, message_channel::Channel{Message})
end

for worker in workers # Vector{Worker}
    @async worker_internal!(wa::WorkerAgent, message_channel::Channel{Message})
end 

The different agent operate on different time lines as shown below. The SchedulerAgent aggregates jobs into two week periods, optimizing priority against supply chain, aggregated capacity. The WorkPlannerAgent schedules all activities that make up the scheduled jobs coming from the SchedulerAgent and provides specific start and finish times (For the next six weeks). The WorkerAgent takes the activities scheduled by the WorkPlannerAgent and allocates them to specific workers in the current two week period.

The idea is that there should run in real-time with user-inputs into the Agents with 1 SchedulerAgent, 1 WorkPlannerAgent, and up to 50 WorkerAgents.

I have coded a lot in Julia and somehow Julia does not feel completely right for this. I am far from having implemented all of this with only the SchedulerAgent are currently running with the implemented message system, and the WorkPlannerAgent is so far only running in isolation.

Cycle times are becoming increasingly bothersome (mostly fixing bugs that a static type checker would catch) and I am a little worried about this kind of concurrency in Julia. Julia seems for geared for a different kind of concurrency, focused more handling large number of computations, but I do not have enough experience to tell it that matters in any practical sense.

But again, Julia may be a good option. I lack the experience to tell for myself.

Using RemoteChannel together with something like ClusterManagers and Network-Topology may allow for a good setup. (See links below to Distributed.jl)
https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers

https://docs.julialang.org/en/v1/manual/distributed-computing/#Specifying-Network-Topology-(Experimental)

I wonder what kinds of bugs these are, i.e. are they only reproducible in the concurrent environment and not amenable to unit testing?

In my humble opinion all your concerns are valid, and Rust (not C++) may be the choice that gives you more safety and good compile time erros. If that will end faster or better to develop is hard to say, maybe you start with another language and conclude that the tradeoffs are not worthwhile.

You may get some hints of problems before run time with JET.jl, and of course you’ll need to handle concurrency with atomics or locks.

Note, I don’t think your code uses concurrency at all. @async is all single threaded and just switches between tasks on the single thread unless something changed or I misunderstood it. Most likely what you want is to multi-thread your work, have a scheduler agent be sending work to the work planners, they all plan it in parallel on different threads etc.

Concurrent programming is hard for many people no matter what language you use. I’ve got a lot of experience with it and don’t find it particularly difficult myself but apparently it tends to screw with people’s intuitions. I started doing multi-threading in BeOS back in about 1995 or so and that really helped me develop good work-flow for concurrency. My opinion is that all the best kinds of primitives are here in Julia for you to use, but you need to know how to use them. It could help to read some stuff on Erlang since it’s also heavily concurrent oriented.

Isn‘t that the literal definition of concurrency?

No, it isn’t. See Concurrency (computer science) - Wikipedia

Yeah I read that. But I have also heard it said many times that concurrency does not necessitate parallelism, i.e. that interleaving tasks is also concurrency. I don’t think it helps the OP to discuss this distinction in-depth, but the problems here seem to be with concurrency, and I’m not sure they’d go away when multithreading is introduced.

1 Like

Just an exercise:

julia> using Base.Threads 
       
       mutable struct AtomicA
           @atomic x::Int
       end
       
       function update!(a::AtomicA)
           @atomic a.x += 1
           return a
       end
       
       function foo2(n)
           a = AtomicA(0)
           @sync for _ in 1:n
               @spawn update!(a) 
           end
           return a.x
       end
       
       all(foo2(n) == n for n in 1:1000)
true

or

julia> using Base.Threads 
       
       mutable struct Alock
           x::Int
       end
       
       function update!(a::Alock, a_lock::ReentrantLock)
           lock(a_lock) do
               a.x += 1
           end
           return a
       end
       
       function foo(n)
           a = Alock(0)
           a_lock = ReentrantLock()
           @sync for i in 1:n
               @spawn update!(a, a_lock) 
           end
           return a.x
       end
       
       all(foo(n) == n for n in 1:1000)
true

or

julia> using Base.Threads 
       
       mutable struct Awithlock{L<:ReentrantLock}
           lock::L
           x::Int
       end
       
       function update!(a::Awithlock)
           lock(a.lock) do 
               a += 1
           end
           return a
       end
       
       function foo(n)
           a = Awithlock(0)
           @sync for i in 1:n
               @spawn update!(a)
           end
           return a.x
       end
       
       all(foo(n) == n for n in 1:1000)
true

1 Like

Here it is Concurrency is not Parallelism. Anyway, terminologies aside, I‘d very much like to learn about any progress on the OP‘s problem.

2 Likes

Yes, I guess jargonwise perhaps concurrency and parallelism are different but related concepts. Still, from the point of view of wanting fast realtime responsiveness I think concurrent parallelism in multi threads is what’s going to get you the maximum responsiveness.

1 Like

A few years old, but I found this article helpful nonetheless: Concurrency in Julia [LWN.net]

2 Likes

I have not forgotten this discussion. I will write again when I have made a little more progress