Parallelize double nested loop using Dagger for Montecarlo

Hello everyone,
I am trying to parallelize this montecarlo function that I have created for measuring an option price using the BlackScholes model and I would like to improve its performance by parallelizing the double nested loop. Looking around I found that Dagger.jl may be a good solution, but when I have used @sync the results were only slightly faster. Am I doing something wrong?

using Random: randn
using Dagger

############## Financial Parameters ##############
S = 100.00 # initial stock price
K = 100.00 # strike price
T = 1.0 # time to maturity in years
r = 0.3 # risk free annual percentage
sigma = 0.5 # annual standard deviation percentage
t = 0.0 #time to consider the value
############## MonteCarlo Parameters ##############
steps::UInt = 20 # time steps for SDE
trials::UInt = 1000000 # number of trials

function montecarlo(start_price::F, strike_price::F, time_to_maturity::F, 
    risk_free::F, standard_deviation::F, time_steps::U, trials::U, 
    time::F) where {F <: Float64, U <: UInt64}
    dt = time_to_maturity / time_steps # differential time steps
    V::Float64 = 0
    dS::Float64 = 0
    for _ in 1:trials
        dS = 0.0
        for _ in 1:time_steps
            dS += ((risk_free - standard_deviation^2 / 2) * dt + 
            standard_deviation * sqrt(dt) * randn()) # log of differential evolution
        end
        # apply call option formula
        V +=  max(exp(log(start_price) + dS) - strike_price, 0)
    end
    # calculate the mean and add time evolution 
    V = V/trials * exp(-risk_free * (time_to_maturity - time)) 
    return V
end

function parallel(start_price::F, strike_price::F, time_to_maturity::F, 
    risk_free::F, standard_deviation::F, time_steps::U, trials::U, 
    time::F) where {F <: Float64, U <: UInt64}
    dt = time_to_maturity / time_steps # differential time steps
    V::Float64 = 0
    dS::Float64 = 0
    @sync for _ in 1:trials
        dS = 0.0
        for _ in 1:time_steps
            dS = dS + ((risk_free - standard_deviation^2 / 2) * dt + 
            standard_deviation * sqrt(dt) * randn()) # log of differential evolution
        end
        # apply call option formula
        V = V + max(exp(log(start_price) + fetch(dS)) - strike_price, 0)
    end
    # calculate the mean and add time evolution 
    V = V/trials * exp(-risk_free * (time_to_maturity - time)) 
    return V
end

Results:

@btime montecarlo(S, K, T, r, sigma, steps, trials, t) #62.635 ms (1 allocation: 16 bytes)
@btime parallel(S, K, T, r, sigma, steps, trials, t)#60.742 ms (9 allocations: 368 bytes)

I have obviously settled the number of threads to 8 using a quad-core machine, but I obtain a very small performance improvement.

I never used Dagger, but @sync by itself does not parallelize, you need to use spawn the tasks.

One to do it (without Dagger here) could be:

julia> using ChunkSplitters, Base.Threads

julia> function montecarlo2(start_price::F, strike_price::F, time_to_maturity::F, 
           risk_free::F, standard_deviation::F, time_steps::U, trials::U, 
           time::F) where {F <: Float64, U <: UInt64}
           dt = time_to_maturity / time_steps # differential time steps
           nchunks = Threads.nthreads()
           V_chunk = zeros(nchunks)
           @sync for (idxs, ichunk) in chunks(1:trials, nchunks)
               @spawn for _ in idxs
                   dS = 0.0
                   for _ in 1:time_steps
                       dS += ((risk_free - standard_deviation^2 / 2) * dt + 
                       standard_deviation * sqrt(dt) * randn()) # log of differential evolution
                   end
                   # apply call option formula
                   V_chunk[ichunk] +=  max(exp(log(start_price) + dS) - strike_price, 0)
               end
           end
           # calculate the mean and add time evolution 
           V = sum(V_chunk)/trials * exp(-risk_free * (time_to_maturity - time)) 
           return V
       end
montecarlo2 (generic function with 1 method)

julia> @btime montecarlo2(S, K, T, r, sigma, steps, trials, t) #62.635 ms (1 allocation: 16 bytes)
  31.803 ms (51 allocations: 5.06 KiB)
33.028447068979474

julia> @btime montecarlo(S, K, T, r, sigma, steps, trials, t) #62.635 ms (1 allocation: 16 bytes)
  106.646 ms (1 allocation: 16 bytes)
33.07541293040824

Thank you very much, I wanted only a parallelized version of it, I forgot that you could create chunks with threads and do that. Maybe is not the best optimal approach, but at least it is parallelized. Again, thank you!

Nevermind solved

1 Like

If you end up needing this to work in a distributed (multi-server or multi-process) situation, please let us know, since this approach will not generalize automatically to a distributed setting (whereas doing so with Dagger is easy).

That would be great, at the moment at work my main focus is not to create the best possible montecarlo approach, but compare it with the time required for a TTCross algorithm to solve the same problem. It would be interesting if (big if) they will allow me to scale both methods with multiple machine to see which of these two scales better and is faster. In case I will have this possibility, I will write to you again.
Only two final considerations:

  • Thank you for creating such a simple distributed library as Dagger.jl that even I am able to use :heart: .

  • For the multi-asset case I notice that the parallelized version is slower than the non parallelized version, can I post it here to see if I am doing something wrong?

1 Like