How to parallelize this (simple) function (?)

Hello, I’m (still) beginning with Julia. I wrote a simple (and probably quite naive) function which calculate the annual numbers of frostdays (sum of days with temperature lower than 0 Celsius).

Since this calculation is “grid-point independent”, I’d like to use parallel calculation (single machine, with lots of core (96), as shown by the Sys.CPU_CORES command). What would be the best way to do it?

Right now, the input array can be either a time x 1 , time x n or time x m x n Array.

Typical size of a matrix is like 40 000 x 5000 for 2D Array and 40 000 x 500 x 500 for a 3D Array.

Here’s the 2 methods for frostdays. Any suggestion is welcome, thanks!

function frostdays(data::Array{Float64, 2}, timeV::StepRange{Date, Base.Dates.Day})
  numYears = unique(Dates.year(timeV))
  FD = Array{Int64}(size(numYears, 1), size(data, 2))
  z = 1
  for iyear = numYears[1]:numYears[end]
    fgYear = findin(Dates.year(timeV), iyear)
    FD[z, :] = sum([isless(istep, 0) for istep in data[fgYear,:]], 1)
    z = z + 1
  end
  return FD
end
function frostdays(data::Array{Float64, 3}, timeV::StepRange{Date, Base.Dates.Day})
  numYears = unique(Dates.year(timeV))
  FD = Array{Int64}(size(numYears, 1), size(data, 2), size(data, 3))
  z = 1
  for iyear = numYears[1]:numYears[end]
    fgYear = findin(Dates.year(timeV), iyear)
    FD[z, :, :] = sum([isless(istep, 0) for istep in data[fgYear,:]], 1)
    z = z + 1
  end
  return FD  
end

Confession time: I’m having trouble following what frostdays is supposed to do absent any parallelism; I find the naming and indexing a bit confusing. Probably one of the other folks here can give you better advice.

That said, it looks like that inner for loop is pretty amenable to Threads.@threads parallelization. See here for the documentation: docs. Each loop is independently writing a row of FD and reading from (but not writing to) data, so different threads doing each iteration in parallel shouldn’t step on each others’ toes.

I think that @evanfields is right that @threads will be the easiest solution here. Something like

function frostdays(data::Array{Float64, 3}, timeV::StepRange{Date, Base.Dates.Day})
  years    = Dates.year(timeV)
  numYears = unique(years)
  FD       = zeros(Int64, (length(numYears), size(data, 2), size(data, 3)))

  Threads.@threads for i in 1:length(numYears)
    idx = searchsortedfirst(years, numYears[i]):searchsortedlast(years, numYears[i])
    Base.mapreducedim!(t -> t < 0, +, view(FD, i:i, :, :), view(data, idx, :, :))
  end
  return FD
end

should give some parallelism if you run Julia with JULIA_NUM_THREADS=96 julia (or more likely JULIA_NUM_THREADS=48 because of hyperthreading). I’ve tried to reduce the allocation in the loop body because it can be a bottleneck.

I think that you would want to set julia to use the number of hardware threads on the computer, which would include the hyperthreads. Or maybe the number of threads should be even higher:

Many thanks @andreasnoack ! Here’s the benchmark for your solution, compared to my initial version.

Results are for an Array of size 51134 x 5 x 5, which is smaller than our usual Array size. This is an impressive acceleration of the function.

frostdays → initial version from post #1

julia> @benchmark frostdays(data, d)
BenchmarkTools.Trial: 
  memory estimate:  67.42 mb
  allocs estimate:  5361
  --------------
  minimum time:     329.154 ms (0.54% GC)
  median time:      329.768 ms (0.56% GC)
  mean time:        329.812 ms (0.57% GC)
  maximum time:     330.756 ms (0.58% GC)
  --------------
  samples:          16
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

frostdays2, with JULIA_NUM_THREADS=48 → updated version from @andreasnoack

julia> @benchmark frostdays2(data, d)
BenchmarkTools.Trial: 
  memory estimate:  438.34 kb
  allocs estimate:  83
  --------------
  minimum time:     2.906 ms (0.00% GC)
  median time:      3.104 ms (0.00% GC)
  mean time:        3.232 ms (3.08% GC)
  maximum time:     162.196 ms (97.48% GC)
  --------------
  samples:          1588
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

It’s a function calculating the number of days where the minimum temperature is lower than 0 Celsius. This indicator is calculated for each year. Thanks for pointing out the “threads” approach.

About the number of threads, one thing I’m not sure about is how much threads should or could I use, assuming we are about 8 to 15 users on the server? I mean, if I set the number of threads to 96 and another also set this to 96, etc… where do we end up? Should we go to a more conservative approach where each user is limited to, say, 10 threads? Right now, I’m the only one using Julia, other people uses MATLAB, Python, R and other more specialized tools written in C or Fortran. Hence, my aim is to develop some functions in Julia and hopefully show them some speed measurements :slight_smile:

tried with JULIA_NUM_THREADS=96, it’s slightly slower than with 48.

julia> @benchmark frostdays2(data, d)
BenchmarkTools.Trial: 
  memory estimate:  437.88 kb
  allocs estimate:  75
  --------------
  minimum time:     4.365 ms (0.00% GC)
  median time:      4.654 ms (0.00% GC)
  mean time:        4.992 ms (0.00% GC)
  maximum time:     29.641 ms (0.00% GC)
  --------------
  samples:          1002
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Looking only at the effect of multiple threads, here’s some more results.

With number of threads to 1, with a bigger Array (51134,50,50)

julia> @benchmark frostdays2(data, d)
BenchmarkTools.Trial: 
  memory estimate:  3.08 mb
  allocs estimate:  324
  --------------
  minimum time:     202.471 ms (0.00% GC)
  median time:      232.875 ms (0.00% GC)
  mean time:        231.440 ms (0.03% GC)
  maximum time:     259.529 ms (0.00% GC)
  --------------
  samples:          22
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

With number of threads to 48 (same Array of course)

julia> @benchmark frostdays2(data, d)
BenchmarkTools.Trial: 
  memory estimate:  3.08 mb
  allocs estimate:  216
  --------------
  minimum time:     38.158 ms (0.00% GC)
  median time:      39.743 ms (0.00% GC)
  mean time:        39.974 ms (0.00% GC)
  maximum time:     43.231 ms (0.00% GC)
  --------------
  samples:          126
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

With number of threads to 96

julia> @benchmark frostdays2(data, d)
BenchmarkTools.Trial: 
  memory estimate:  3.07 mb
  allocs estimate:  140
  --------------
  minimum time:     23.334 ms (0.00% GC)
  median time:      25.278 ms (0.00% GC)
  mean time:        27.879 ms (0.00% GC)
  maximum time:     101.619 ms (0.00% GC)
  --------------
  samples:          180
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Perhaps a final (open) question on the matter of using an experimental feature: What’s the stability of it? I mean, what could change between here and Julia 1.0? Only the call to threads?

Yes in the blog post I showed that heuristics are not correct. People always say to match the number of cores, but it’s clear that in many cases where your parallelism is not 100% efficient, having more processes than cores can be beneficial. On the other hand, threads are more efficient than processes, so this difference would be much lower. It’s probably not too hard to get multithreading to be nearly perfect so that # of threads == # of cores is the best, but it’s always best to benchmark. And then you have the fact that it can be highly dependent on the input, like you have shown.

Given that, I would be afraid to give you any heuristic, other than this. If you do have multiple users on the server at once, you should add together to thread usage. Not entirely because they might not all overlap, but it should definitely would not be globally optimal to have the local optimal amount per user. I think that’s about the only heuristic I can know to be true :slight_smile:

I can’t find the thread but it was mentioned here on Discourse, but here’s what I remember. It’s called experimental because there’s still a few known bugs so it’s not quite production-ready. However it’s good to start trying it out and the syntax which is available at least shouldn’t change too much. (If anything, more ways of controlling threads will be added).

1 Like