Think of threads as a bunch of workers standing around waiting to be given tasks. When you call
@threads for i in 1:n
The compiler basically says, we have lets say 4 workers available, and we need to run this loop over 1 times, thus lets split the work 1:n/4 n/4:n/2 n/2:3n/4 3n/4:n and give it to the workers. They will now start working, unfortunately they don’t all work at the same pace.
If they are writing their answers into their own unique output spots, this isn’t a problem, however if they need to place all the results into a single pile, the sequence that you might end up with is random.
In your toy example, push! places an answer at the end of the array (or think of it as a queue), however now the answers are likely to arrive into the queue in a random order depending on which worker happened to work the fastest.
This explains a random ordering, but I would still think you should always get the same length.
There is a subtler problem called race conditions which could explain the length, it comes in in terms of the queue management. Essentially to add something to the queue, you need to lookup the queue length, go add the new data at the memory location just beyond the old data and then update the queue length. If the first worker is busy adding the new data and another worker happen to look up the queue length before the first worker has had a chance to update the length variable, the second worker will overwrite the same place in memory. That can be avoided by using what is called atomic actions (like read value and increment the queue length as a single event). I am a bit surprised that push! would not be using that natively (or is there a slightly slower thread-safe version of push! available somewhere?), however I have not yet ventured into threaded julia (excluding using BLAS and FFT’s multiple threading). You can add those elements by manually using lock, Mutex and Semaphores etc.
My personal opinion is that threads can be useful in certain cases, but they can more easily get you into weird troubles. The message passing (@spawn, Tasks and Channels is generally a safer approach, since the interactions between workers are clearly defined) or using higher level constructs specifically created for this parallel working, like Shared Arrays.
Unfortunately your toy example is so “toy” that I cannot comment on whether you can achieve that in a threaded manner. You are recursively calling a function and then trying to launch threaded actions (the for loop) at each level. I believe only the top level will be threaded, the inner levels will not be allowed to use threads.
- Do you only care about recursing and the sequence of outputs doesn’t matter? Then thread is OK, you just need to manage the locking on push!
- Do you care about the sequence? Then see if you can determine the output size beforehand, allocate it and then ask each task to write into the correct location, which will require passing some extra information so each call to foo! will know where it fits into the tree.
- Do you want to reuse previous calls of the recursive function, then using some caching mechanism is needed. I remember some julia package that helps with that, but cannot off-hand remember the name.
If this doesn’t answer your question sufficiently, maybe if you can give a slightly less toy example will help others understand what the concept is you need.