Question about optimal thread allocation for vector of problems of differing sizes

I have a question about multithreading. I have a vector of pairwise thread safe problems v, say of length 16, and I have four threads.

As I understand it, if I write

Threads.@theads for i in 1:16
do_problem(v[i])
end

this will assign thread 1 to v[1…4], thread 2 to v[5…8], etc.

My problem is that in fact problems v[1…4] are much bigger than the rest, and some of the problems further along are empty, so this assignment of threads is more or less the worst possible.

Thus my question: is there an equivalent piece of code that would instead thread the problem above so that the threads would work their way along the vector of problems taking which ever next one was free. Thus initially threads one to four work on v[1] to v[4], and then when the first thread, maybe the one on v[3], is finished it would move on to v[5], then the next thread to come free would take v[6], etc.

Does anyone have useful pointers / suggestions

Thanks

Sean Matthews

Probably you need task-based parallelism with @spawn

Compare:

for i in 1:16; Threads.@spawn begin sleep(i > 8 ? 4 : 1); @show (i, Threads.threadid()) end end

Threads.@threads for i in 1:16; sleep(i > 8 ? 4 : 1); @show (i, Threads.threadid()) end

To be honest, I haven’t read yet how task scheduler in Julia works, so maybe someone else will give more input, but @spawn seems to be way to go.

FYI with Transducers.jl it’s reduce(right, Map(do_problem), v; basesize=1) where basesize=1 means to process one item per thread (see Thread- and process-based parallelisms in Transducers.jl (+ some news))

It’s better to wrap it with @sync, as in @sync for i in 1:16; Threads.@spawn ... so that the next code is run only after all the tasks are finished.

2 Likes

@Sean_Matthews - I have been fighting the same issue. You are correct - Threads.@threads distributes via order. Threads.@spawn is closer, but has a subtlety that if one of the heavy threads hits the primary thread, funny things can happen on the result collection side. I’ve only just learned about Transducers, so I’m going to take a look there.

I do have a non-registered package, ThreadPools.jl, that I’ve been using and works pretty well. All you have to do is replace @threads with @bgthreads, and you’ll get the more optimized load behavior, but you’ll lose the primary thread.

It has a couple of example demos that visually demonstrate the differences between the methods - for anything approaching uniformity, the existing methods are best, but for a heavily-nonuniform workload, keeping the scheduler on the primary thread and the workers on background threads is a win. Example demo output (snapshot of which of 150 jobs, some of which are 10x the work, is running on each thread):

julia> Demo.run_with_outliers()


@bgthreads, Active Job Per Thread on 200ms Intervals

  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 15  36  48  58  64  64  64  64  64 123 128 141   0   0   0   0   0   0   0   0
 20  35  47  47  75  85  94 106 106 106 106 106 106   0   0   0   0   0   0   0
  9   9   9   9   9   9  98 104 112 118 130 139   0   0   0   0   0   0   0   0
 19  32  49  57  76  89  99  99 113 121 125 125 125 125 125   0   0   0   0   0
 17  34  44  63  78  84  84  84  84  84  84  84   0   0   0   0   0   0   0   0
 21  37  43  62  66  66  66  66  66  66 131 140   0   0   0   0   0   0   0   0
 18  30  50  61  74  88  97 101 101 101 101 142 142   0   0   0   0   0   0   0


@threads, Active Job Per Thread on 200ms Intervals

  3   7   9   9   9   9   9   9   9  12  15  17   0   0   0   0   0   0   0   0
 22  24  27  29  32  36  37   0   0   0   0   0   0   0   0   0   0   0   0   0
 40  42  44  47  48  50  53  56   0   0   0   0   0   0   0   0   0   0   0   0
 60  64  64  64  64  64  64  66  66  66  66  66  66  68  71  74  76   0   0   0
 81  83  84  84  84  84  84  84  85  88  93  94   0   0   0   0   0   0   0   0
 98  99  99 101 101 101 101 103 106 106 106 106 106 106 110 114   0   0   0   0
118 122 125 125 125 125 125 126 128 130 132   0   0   0   0   0   0   0   0   0
135 139 141 142 145 149   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Speed increase using Threads.@threads (ideal 14.3%): -12.6%

I still need to improve the API for the “active stack” case, though - it works well in my use case, but needs wrapping in a simpler interface. And if I look at Transducers and decide it gets close, I’ll just do PR’s in that direction.

2 Likes

Interesting package, I think I have use case in my project for it. Originally I planned to communicate with julia’s process from some external app in a different language, but after getting familiar with features from 1.3 I started thinking about writing a webapi service in julia that will schedule calculations as threads and will track their progress itself.

Cool. The package is currently in the registration-wait cycle, and should come out in a couple of days. I’m adding some features I need (result iterator, optional job logging for tuning purposes, and simple wrapper for the active stack case) for v0.2.0, hopefully in the next week. For my oddball use case, anyway, it has been a huge time-saver.

1 Like

Don’t quite have the hang of this website yet (It has just told me that I should not post lots of separate notes - which makes sense) - anyway just to say thanks for the feedback, all of which is useful.

Cheers

Sean Matthews

Thanks for that. Shalll take a closer look.

@Sean_Matthews, here is an activity log excerpt from my current project that behaves exactly as you describe, since I am in the middle of this right now. Here, the columns are the threads (there are 4, with the primary being used for management), and the numbers are the task id being worked on at that moment. (I’ve annotated the headers.) You’ll notice that some threads get stuck on a long job, while others will crank through the short ones until they also hit a long one. For example, in the first time step, Thread 2 gets through two jobs (1642 and 1643), ending at 1644 while Threads 3 and 4 are stuck on 1637 and 1641, respectively. In the next time step, 1645 and 1646 are taken on by Threads 3 and 2, while Thread 4 is still working 1641. And so on. This was achieved by just using ThreadPools.@bgthreads instead of Threads.@threads. @fgthreads keeps the primary in play, and is usually faster except for really nonuniform loading (like I have).

[Time]   Th1   Th2   Th3   Th4
33.200     -   1633  1637  1641  
33.250     -   1644  1637  1641  
33.300     -   1646  1645  1641  
33.350     -   1646  1645  1641    
33.400     -   1646  1645  1641   
33.450     -   1646  1645  1641   
33.500     -   1646  1645  1641   
33.550     -   1649  1645  1641  
33.600     -   1649  1645  1653   
33.650     -   1649  1645  1653    
33.700     -   1649  1645  1653   
33.750     -   1660  1657  1653   
33.800     -   1661  1657  1664