Variation in computation time for execution of the exact same code

jleman · March 22, 2020, 5:15am

I have a block of code (about 1000 lines) that performs electromagnetic computation. The code is executed multiple times with random variation of key input variables. Out of curiosity I set the input variables to be the same for each of 10 iterations. Using @time I looked at the computation time for each iteration. Results below. Overall these times do not surprise me given the computational burden. However, there is quite a bit of variation from case to case. I suspect this is a Windows thing. However, I have to run about 50,000 iterations and would like to try to keep times closer to the 2 to 3 second range achieved in many runs. Any suggestions for how to keep times consistently closer to the best performing instances?

Simulations started…
2.626384 seconds (36.68 M allocations: 4.539 GiB, 11.39% gc time)
=> simulation 1 complete
4.110695 seconds (36.68 M allocations: 4.539 GiB, 44.26% gc time)
=> simulation 2 complete
7.725060 seconds (36.68 M allocations: 4.539 GiB, 66.71% gc time)
=> simulation 3 complete
3.774782 seconds (36.68 M allocations: 4.539 GiB, 43.79% gc time)
=> simulation 4 complete
2.395420 seconds (36.68 M allocations: 4.539 GiB, 9.77% gc time)
=> simulation 5 complete
2.786028 seconds (36.68 M allocations: 4.539 GiB, 23.08% gc time)
=> simulation 6 complete
4.374934 seconds (36.68 M allocations: 4.539 GiB, 46.56% gc time)
=> simulation 7 complete
7.798998 seconds (36.68 M allocations: 4.539 GiB, 66.30% gc time)
=> simulation 8 complete
3.917994 seconds (36.68 M allocations: 4.539 GiB, 42.09% gc time)
=> simulation 9 complete
2.303349 seconds (36.68 M allocations: 4.539 GiB, 10.46% gc time)
=> simulation 10 complete
=> Results Saved

42.319144 seconds (366.80 M allocations: 45.883 GiB, 44.93% gc time)
=> Batch complete.

baggepinnen · March 22, 2020, 6:03am

The output tells you that for some runs, those that take longer time, a large part of the time is spent doing garbage collection. The gc triggers when it needs to and thus not during every trial. The way to reduce the stochasticity of the timing, and also very likely increase the overall performance dramatically, is to optimize your code such that it allocates less memory.

johnh · March 22, 2020, 7:26am

@baggepinnen has the answer.
I was going to chip in and ask about the effects of hyperthreading and on your code running on different CPU cores.
Can you comment on how many CPU cores your workstation has?

I realise that you have a single threaded application here, and arent doing any distributed computation.
It is worth discussing CPU pinning though.
For HPC it is usually better to ‘pin’ processes to run on specific cores. The OS will allocate processes to cores, and usually does a good job of allocating long running processes.
However if processes have to switch cores then the caches are invalidated and have to be repopulated.

Also Operating system processes run on the cores of your CPU, as do your interactive processes like liiking at Julia discourse forums as your code is running.
These are referred to as ‘OS jitter’ - ie . OS processes interfere with the uniform runtimes of your code.

My point being once you have worked on the memory allocations there is lots of fun to be had of you REALLY want that fine grained runtime control. Also I guess that is not worth pursuing!

jleman · March 23, 2020, 1:04pm

Ok, that makes sense. Based on information from these forums and documentation I try to avoid the big performance gotchas, but I’m not a programmer and suspect there is a lot more that could be done. Below is a simplified block of my code. Unfortunately I can’t include the whole thing and this block won’t run as-is, but it illustrates the layout and process. If anyone has suggestions for ways to further improve this I would appreciate it.

In a nut shell the code is a finite difference time domain process that follows this sequence:

Outer loop (layer 1) generates simulation batches (desired is 500 per batch)
- Inner loop (layer 2) iterates over time (up to 10000 time-step iterations)
  - Innermost loop (layer 3) iterates over spatial elements (up to 2000 elements) - this layer updates voltage and current arrays => each are 3 x 2000 (3 conductors with 2000 spatial elements each).
- At the end of each layer 2 loop, the results are appended to two overall arrays of stored data. The ultimate size of these arrays is 3 x 2000 x 10000 (3 conductors, 2000 spatial elements, and 10000 time steps).
At the end of each layer 1 loop, some basic post processing is completed on the final result arrays.

Code Block
After the code block I have included some additional timing performance data.

using LinearAlgebra
using NLsolve
using DelimitedFiles
using Dates

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@  Prelim - Define global input prameters with "const"
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


#Physics constants
const μ0 = Float64(4*π*1e-7) #permeability of free space [H/m]
const ϵ0 = Float64(8.85418782e-12) #permittivity of free space [F/m]

#User defined constants
const cc = Matrix{Float64}([-10.0 14.67;0.0 14.67; 10.0 14.67]) #conductor coordinates 1st col = x values, 2nd col = y values (average height)
const nc = Int64(3)                              #number of phases
const rw = Float64(1.708/100)                        #radii of conductors [m] = cm/100.  Program not currently set to handle shield wires.

#derived constants
const Rdc = Float64(rdcPerKm/(1000.0*conductorsPerBundle)) #[Ω/m]
const ϵg = Float64(ϵ0*ϵgr)                                 #relative permittivity of ground [F/m]  
const a = Float64(σg/ϵg)                                   #some ratio


#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@  Functions to implement voltage source options for the FDTD main function  
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

#define source voltage(s) and thevenin equivalent resistance

function stepSource()
    nt = collect(Int64,(0:NDT-1))  #time step index
    U = zeros(Float64,(nc,1,NDT)) #intermediate voltage variable (starts at t=0)
    U[1,1,50:NDT] .= 300000.0 #step function at n = 50, change first element to change number of phases
    return U[:,1,1:NDT-4]  #short four to account for offset in initialization
end

function surgeSource()
    nt = collect(Int64,(0:NDT-1))  #time step index
    U = zeros(Float64,(nc,1,NDT)) #intermediate voltage variable (starts at t=0)
    riseTime = Float64(100.0e-6) #seconds
    tailTime = Float64(750.0e-6) #seconds
    A = Float64(-1.0/tailTime) #-2.05e3 #-13333.33#
    B = Float64(-1.0/riseTime) #-5.64e5 #-100000#
    amplitude = Float64(500000.0*sqrt(2.0/3.0)*2.0) #volts - 3 per unit SOV on 500kV
    U[1,1,:] .= amplitude .* (exp.(A * Δt * nt) .- exp.(B * Δt * nt))
    amplitudeCorrection = Float64(abs(amplitude)/maximum(abs.(U)))
    return U[:,1,1:NDT-4]*amplitudeCorrection  #short four to account for offset in initialization
end


#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@  Function to implement FDTD method  
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

#Matrix convention for update voltages and currents: [conductor x NDZ]
#Matrix convention for stored voltages and currents: [conductor x NDZ x time]

function fdtd(Lmat,Cmat,Amat,Bmat,aw,bw,ag,bg,θ,ncl1,ncl2,ncl3,trappedChargeV,
                    enableDynamicCapacitance,enableTerminalMOV,enableTrappedCharge)

    #Initialize local arrays and variables
    VV = zeros(Float64,(nc,NDZ)) #initialize array to store update voltage values
    II = zeros(Float64,(nc,NDZ)) #initialize array to store update current values
    I1 = zeros(Float64,(nc,NDZ)) 
    I2 = zeros(Float64,(nc,NDZ)) 
    I3 = zeros(Float64,(nc,NDZ))
    I4 = zeros(Float64,(nc,NDZ))
    Vout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
    Iout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
    cdynout = zeros(Float64,(nc,NDZ,NDT))
    Vs = zeros(Float64,(nc,1,NDT))
    cdyn = zeros(Float64,(nc,nc,NDZ)) #initialize array to store dynamic matrix values
 
   
    #Update dynamic capacitance array based on voltage results from the previous iteration
    function dynamicCapacitance(VV,Vpast,cdyn,Cmat,n)
        
        for k = 1:NDZ #iterate over spatial steps (generally around 500 to 1000 steps)

            for i = 1:nc #iterate over number of conductors (generally nc=3)
                pcmtemp[i,i] = updatePCM(VV[i,k],Ecp,GMRb,avgHeight,α,1,n)  #note cgeo is not needed for switching transients as Tc is neglected (see Li,Malik,Zhao)
            end
            cdyn[:,:,k] = inv(pcmtemp)
        end
        return cdyn
    end
    
    
    #called by previous function:  function used for dynamic capacitance calculation to determine charge
    function updatePCM(wireV,Ec,rgeo,h,α,caseCall,n)#,cgeo,Tc,t) #commented out information are variables for the steep front terms

        #use nsolve to find an iterative solution to the nonlinear equation for volts vs current
        function f!(F, rCorona)
            F[1] = (rgeo + wireV/(α*Ec))/(1 + (2*h-rCorona[1])/(2*h)*log((2*h-rCorona[1])/rCorona[1])) - rCorona[1]    
        end
        rCorona = nlsolve(f!,[rgeo]).zero[1]
        
        #calculate charge, equivalent radius, and updated potential coefficient matrix
        ρtotal = (2.0*π*ϵ0*α*Ec*rCorona*(2*h-rCorona)/(2.0*h))
        GMReq = 2.0*h/(exp(2.0*π*ϵ0*wireV/ρtotal) + 1.0)
        pcmii = 1/(2.0*π*ϵ0)*log(2.0*h/GMReq)

        return pcmii  #return updated potential coefficient matrix self term
    end    
    
    
    #voltage update equation and calculation of dynamic capcitance
    function update_V(VV,II,cdyn)
        for k = 2:(NDZ-1)  #iterate over spatial steps (between about 500 and 1000 steps)
            #voltage update equation
            VV[:,k] = (((Δz/Δt)*cdyn[:,:,k])^-1*((Δz/Δt)*cdyn[:,:,k])*VV[:,k] - 
                        ((Δz/Δt)*cdyn[:,:,k])^-1 * (II[:,k] - II[:,k-1])) # VV[i] on right side is the previous time-step.
        end
        return VV
    end


    #Update source boundary condition (first node of spatial steps)
    function update_src(VV,II,cdynNode1,Identity,Rs,n,ncl1,ncl2,ncl3)
        Cs = Float64(0.5e-6)
        Yc = ((Δz/Δt)*cdynNode1 + (Cs/Δt)*Identity)
        bkr = copy(Identity)
              
        VV[:,1] = inv(Yc + bkr*1.0/(2*Rs))*((Yc - bkr*1.0/(2*Rs))*VV[:,1] - (2*II[:,1]) 
                    + (bkr*1.0/(2*Rs))*(Vs[:,1,n+1] + Vs[:,1,n]))

        return VV
    end


    #Load boundary condition (last node of spatial steps)
    function update_load(VV,II,cdynNodeNDZ,Identity,RL,n)
        #RL = Rnominal
        if enableTerminalMOV && any(abs.(VV[:,NDZ]) .> 200000.0)
            Yx = (Δz/Δt)*cdynNodeNDZ + (1.0/RL)*I
            Yz = (Δz/Δt)*cdynNodeNDZ - (1.0/RL)*I

            #NLsolve iterative approach - includes coupling per the derived equations.  Must take that into accoutn for stability purposes.
            function f!(F,v)
                F[1] = Yx[1,1]*v[1] + Yx[1,2]*v[2] + Yx[1,3]*v[3] - Yz[1,1]*VV[1,NDZ] - Yz[1,2]*VV[2,NDZ] - Yz[1,3]*VV[3,NDZ] - 
                        2*II[1,NDZ-1] + sign(v[1])*0.5*exp(10.0e-6*abs(v[1])) + sign(VV[1,NDZ])*0.5*exp(10.0e-6*abs(VV[1,NDZ])) - 1.0
                
                F[2] = Yx[2,1]*v[1] + Yx[2,2]*v[2] + Yx[2,3]*v[3] - Yz[2,1]*VV[1,NDZ] - Yz[2,2]*VV[2,NDZ] - Yz[2,3]*VV[3,NDZ] - 
                        2*II[2,NDZ-1] + sign(v[2])*0.5*exp(10.0e-6*abs(v[2])) + sign(VV[2,NDZ])*0.5*exp(10.0e-6*abs(VV[2,NDZ])) - 1.0
                
                F[3] = Yx[3,1]*v[1] + Yx[3,2]*v[2] + Yx[3,3]*v[3] - Yz[3,1]*VV[1,NDZ] - Yz[3,2]*VV[2,NDZ] - Yz[3,3]*VV[3,NDZ] - 
                        2*II[3,NDZ-1] + sign(v[3])*0.5*exp(10.0e-6*abs(v[3])) + sign(VV[3,NDZ])*0.5*exp(10.0e-6*abs(VV[3,NDZ])) - 1.0
                
            end
            VV[:,NDZ] = nlsolve(f!,VV[:,NDZ]).zero
        
       
        return VV 
    end


    #current update equation (includes convolution term because of the impedance time function ()
    function update_I(VV,II,Iout,Amat,Bmat,ZFINV,Bcoeff,n,Ψw,Ψg,aexpbw,expbw,aexpbg,expbg)
        
        for k = 1:(NDZ-1)  #iterate over spacial steps
           
            ΔI = (Iout[:,k,n-1] - Iout[:,k,n-2])
            #recursive convolution for internal frequency dependent impedance (uses Matrix Pencil coefficients)
            ∑Ψw = zeros(Float64,nc)

            for i = 1:size(aw)[1]  #iteration involves only about 5 terms
                Ψw[:,i,k] = aexpbw[i]*ΔI + expbw[i]*Ψw[:,i,k]
                ∑Ψw += Ψw[:,i,k]
            end
            
            
            #recursive convolution for external (ground) frequency dependent impedance
            ∑Ψg = zeros(Float64,nc)

            for i = 1:size(aexpbg)[3]  #iteration invovles only about 3 terms
                Ψg[:,:,i,k] = aexpbg[:,:,i]*diagm(ΔI) + expbg[:,:,i] .* Ψg[:,:,i,k]
            end            
            ∑Ψg = sum(sum(Ψg[:,:,:,k],dims=3),dims=2)[:,1]
            

            #combine terms to get the update current
            I1[:,k] = ZFINV * ZP * II[:,k]
            I2[:,k] = ZFINV * Bcoeff * ∑Ψw
            I3[:,k] = ZFINV * ∑Ψg 
            I4[:,k] = ZFINV * (VV[:,k+1] - VV[:,k]) / Δz            
            
            
            II[:,k] = I1[:,k] - I2[:,k] - I3[:,k] - I4[:,k] 

        end
        return II,Ψw,Ψg
    end
    

    #outer ime stepping loop (order in which functions are called is important)
    for n = 1:NDT  #NDT is as large as 10000

        #update voltage (function iterates over all spatial steps, as many as 2000 steps)
        VV = update_V(VV,II,cdyn) #note recursive nature (VV is passed in, updated by the function, and stored again as VV)
        
        #update source node  (one spatial step only)
        VV = update_src(VV,II,cdyn[:,:,1],Identity,Rs,n,ncl1,ncl2,ncl3) 
        
        #update load (one spatial step only)
        VV = update_load(VV,II,cdyn[:,:,NDZ],Identity,RL,n) 
        
        #update current (function iterates over all spatial steps, as many as 2000 steps)
        II,Ψw,Ψg = update_I(VV,II,Iout,Amat,Bmat,ZFINV,Bcoeff,n,Ψw,Ψg,aexpbw,expbw,aexpbg,expbg) 
        
        #update capcitance (function iterates over all spatial steps, as many as 2000 steps)
        cdyn = dynamicCapacitance(VV,Vpast,cdyn,Cmat,n) 


        #Accumulate the results of each time step into the overall results matrices.
        #The largest these arrays get is (3 x 2000 x 10000) = (number of conductors x spatial steps x time steps)
        Vout[:,:,n] = VV 
        Iout[:,:,n] = II 

    end
    return Vout,Iout,cdynout,Vs
end


#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@  Main function to control batch iteration and process output data from FDTD execution
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


function main(simulations,θ,ncl1,ncl2,ncl3,
                    enableDynamicCapacitance,enableTerminalArrester,enableTrappedCharge,randomClosing)

    simulationBatch = 500 #as many as 500 batchs of FDTD simulations
        
    Vout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
    Iout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
    cdynout = zeros(Float64,(nc,NDZ,NDT))

    for i = 1:simulationBatch
        println("Update FDTD")
        @time Vout,Iout,cdynout,Vs = fdtd(Lmat,Cmat,Amat,Bmat,aw,bw,ag,bg,θ,ncl1,ncl2,ncl3,trappedChargeV,
                                           enableDynamicCapacitance,enableTerminalArrester,enableTrappedCharge)

        #Post processing of output data for each batch
        Voutlg = copy(Vout)
        Voutlgabs = abs.(Vout)
        Voutlgmaxn = maximum(Voutlgabs,dims=3)[:,:,1]
        Voutlgmaxnnc = maximum(Voutlgmaxn,dims=1)[1,:]
        VLG[i,:] = Voutlgmaxnnc

        Voutpp = Vout - circshift(Vout,(1,0,0))
        Voutppabs = abs.(Voutpp)
        Voutppmaxn = maximum(Voutppabs,dims=3)[:,:,1]
        Voutppmaxnnc = maximum(Voutppmaxn,dims=1)[1,:]
        VPP[i,:] = Voutppmaxnnc

    end
    
    #setup automatic file naming to save out the post-processed data after each batch
    len = string("-L",Int64(kilometers))
    sims = string("-S",simulations)
    znum = string("-Z",NDZ)
    alt = string("-ALT",Int64(altitude))
    dateCode = Dates.format(Dates.now(), "yymmdd-HHMMSS")
    fileCode = string(len,sims,znum,arrest,alt)
    path="C:/Users/jleman/OneDrive - POWER Engineers, Inc/02_WSU Research/Energies Paper/"
    writedlm( string(path,dateCode,"_01_VoltageLG_",fileCode,".csv"), VLG, ',')
    writedlm( string(path,dateCode,"_02_VoltagePP_",fileCode,".csv"), VLG, ',')
    
    return VLG,VPP,Vout,Iout,Vs
end

Detailed Timing Results
This next set of time results is for:

1 iteration of the outer loop (one simulation in the batch, layer 1)
6 time-step iterations (inner loop, layer 2)
500 spatial step iterations (innermost loop, layer 3).

Time Step Iteration 1
Update Voltage
0.001771 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000008 seconds (29 allocations: 5.219 KiB)
Update Load
0.000012 seconds (20 allocations: 7.844 KiB)
Update Current
0.003318 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000069 seconds (500 allocations: 78.125 KiB)

Time Step Iteration 2
Update Voltage
0.002103 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000007 seconds (29 allocations: 5.219 KiB)
Update Load
0.000013 seconds (20 allocations: 7.844 KiB)
Update Current
0.003011 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000063 seconds (500 allocations: 78.125 KiB)

Time Step Iteration 3
Update Voltage
0.001725 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000016 seconds (29 allocations: 5.219 KiB)
Update Load
0.000009 seconds (20 allocations: 7.844 KiB)
Update Current
0.003045 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000083 seconds (500 allocations: 78.125 KiB)
…
3 more iterations

Total time to complete time iterations (inner loop - layer 2):
0.752052 seconds (1.21 M allocations: 290.975 MiB, 11.06% gc time)
=> simulation 1 complete
=> Results Saved

Total time to complete outer loop iterations (outer loop - layer 1):
2.215731 seconds (3.84 M allocations: 1.079 GiB, 29.71% gc time) (total time to complete outer loop
=> Batch complete.

Observation 1: there is a comparatively large block of time (1.5) spent post processing data in the outer loop layer. I’m not worried about that time. Its the time in the inner loop (layer 2) that adds up (see final timing example, below).
Observation 2: GC time for this example is less than 30%. However, during full simulations these are higher (see final timing example below).

Final timing example
Here I run the same code with the following:

One iteration of the outer loop (one simulation in the batch, layer 1)
6000 time-step iterations (inner loop, layer 2), this is the only change from the previous example
500 spatial step iterations (innermost loop, layer 3).

Total time to complete time iterations (inner loop - layer 2):
244.647256 seconds (394.54 M allocations: 47.762 GiB, 85.97% gc time)
=> simulation 1 complete
=> Results Saved

Total time to complete outer loop iterations (outer loop - layer 1):
246.305489 seconds (394.55 M allocations: 48.434 GiB, 85.88% gc time)
=> Batch complete.

Observation 3: There are 1000 times as many time steps as the previous timing example, however the total time to compute only increases by a factor of 325.
Observation 4: % gc time increases when the time step count is closer to the maximum desired for the simulation.

jleman · March 23, 2020, 1:12pm

I am running a Windows 10, Core i7 8th gen laptop (6 cores) with 64 GB of RAM. I did try setting multiple threads (4, 6, and 12) in the kernel environment and it appeared to help some, but I am very new to the concept of hyper-threading and don’t know for sure. My background is EE, Julia is something I’ve picked up in the last year to help with PhD research.

Iulian.Cioarca · March 23, 2020, 2:23pm

I see that Update Voltage and Update Current allocate the most and take more time, so you can dig deeper into those and see where the allocations come from. You could try to use views wherever possible since slicing an array allocates memory.

kevbonham · March 23, 2020, 3:31pm

This is not super helpful, I know, but I suspect that you might be a programmer after all

jleman · March 23, 2020, 3:47pm

Ha!, very generous, but thanks.

DNF · March 23, 2020, 6:34pm

You’re creating an enormous amount of temporary arrays, and creating new ones over and over, instead of updating them in place. Slicing also creates copies. Try to cut down on temporary allocations.

Throughout your code you also have a large number of type constructors that are unneeded, and creates visual noise (they even make your code a bit slower)

jleman:

#User defined constants 
const cc = Matrix{Float64}([-10.0 14.67;0.0 14.67; 10.0 14.67]) #conductor coordinates 1st col = x values, 2nd col = y values (average height) 
const nc = Int64(3) #number of phases 
const rw = Float64(1.708/100)

Remove all those Matrix{Float64}, Int64 and Float64, you can probably remove all of them or almost all (didn’t check), remove collect everywhere, etc. This won’t speed up your code much, but it will make it cleaner.

Getting rid of most of the temporary arrays, though, will probably help significantly.

jleman · March 23, 2020, 7:02pm

A quick follow up question: I am running batches of simulations. Each simulation in the batch is taking longer than I hoped and hyperthreading didn’t help. However, each simulation in a batch is independent of the others in that batch. If I have 600 simulations could I split them out and put 100 on each core? If so, where could I go to read about how to do this? FYI - I am using Jupyter notebook.

jleman · March 23, 2020, 7:07pm

Thanks! Regarding type constructors, I guess I thought I need to declare type for type stability, but apparently that is not necessary due to the “const” designation?

Regarding arrays: I must have major concept error then. I though I was updating arrays. Apparently not. Could you give me a quick example of how I would change the code to update in place (or point me to a resource where I can read about it)?

DNF · March 23, 2020, 7:48pm

Actually, it’s completely unrelated, and the const doesn’t have anything to do with it. For example, 3 is already an Int64, you are just taking an Int64 and passing it to the Int64 constructor. Same with Float64(1.708/100), you’re taking something that is already a Float64 and converting it to a Float64. There’s no declaration here, just construction.

This is worse: Matrix{Float64}([-10.0 14.67;0.0 14.67; 10.0 14.67]), here you take something that is already a Matrix{Float64} and passing it to the Matrix{Float64} constructor, which will copy it and return another Matrix{Float64}, using twice the memory.

There are several different kinds, but here you have more than one thing going on

The right side here creates a copy, because slices copy. Additionally, it’s not updating in-place, instead creating a new instance of ∑Ψw. Doing this

∑Ψw .+= Ψw[:,i,k]  # notice the dot

will update in-place, doing this

∑Ψw .+= @view Ψw[:,i,k]

will create a view instead of a copy on the right side. And in fact, on the previous line you create another copy Ψw[:,i,k], instead of reusing a single one.

Here’s one

VV[:,k] = (((Δz/Δt)*cdyn[:,:,k])^-1*((Δz/Δt)*cdyn[:,:,k])*VV[:,k] - 
                        ((Δz/Δt)*cdyn[:,:,k])^-1 * (II[:,k] - II[:,k-1]))

Here you have a ton of temporary arrays, every slice creates a copy for every iteration of the loop; every addition, multiplication, exponentiation create copies. (BTW, I wonder, do you really mean to do matrix^-1, or do you mean matrix.^-1?) Some of these may be fundamentally matrix operations, so you perhaps cannot just add @., but it is possible to do matrix multiplications and inverses in-place.

In general, your coding style is very ‘allocation heavy’ and Matlab-like, it takes some time to stop thinking this way, but don’t give up

jleman · March 23, 2020, 8:05pm

Excellent, thanks! Regarding the matrices, the variable cdyn is a stack of 3x3 matrices and I iterate through the stack (via node variable k). At each node I am inverting the 3x3 matrix. As I understand, that means I would want to keep the inversion syntax I have, correct?

The result of the right side of the larger equation above is a [3x1] matrix. If that is the case could I still use the VV[:,k] .= syntax (with the dot) to update VV in place? Also can I wrap this whole statement in @view begin end to avoid the slice/copy issue?

pkofod · March 23, 2020, 8:23pm

If you want a complete code review, is it possible for you to post the full code somewhere? Something that a user could run. Or at least provide data to use as nput for some of the functions.

DNF · March 23, 2020, 8:23pm

Well, then you can’t just use broadcasting, that’s true. But firstly, it’s not a good idea, numerically, to calculate the inverse at all. In stead of

A^-1 * b

you should normally use

A \ b

which has better numerical properties. You could also do this in-place with ldiv!. (On top of this, you are calculating (Δz/Δt)*cdyn[:,:,k])^-1 twice in the same expression.)

You can avoid slices by making cdyn into a vector of matrices instead of a 3D array, and II into a vector of vectors. In that case cdyn[k] would return a matrix which would not be a copy.

And in fact, if your matrices are fixed to a 3x3 size, you can use StaticArrays instead, which will be vastly faster. Look at this example, where I implement two different methods, and call them on normal and static arrays:

foo(A, b) = A^-1 * b
bar(A, b) = A \ b

Now, benchmarking

julia> using BenchmarkTools, StaticArrays

julia> M = rand(3,3); b = rand(3);

julia> Ms = SMatrix{3,3}(M); bs = SVector{3}(b);

julia> @btime foo($M, $b);
  762.336 ns (6 allocations: 2.09 KiB)

julia> @btime bar($M, $b);
  459.127 ns (4 allocations: 416 bytes)

julia> @btime foo($Ms, $bs);
  20.089 ns (0 allocations: 0 bytes)

julia> @btime bar($Ms, $bs);
  8.571 ns (0 allocations: 0 bytes)

So, bar is faster (and better) than foo, and the speed difference between the fastest and slowest calls is almost 100 times!

DNF · March 23, 2020, 8:26pm

There are so many things you can improve, that it may be a bit overwhelming. But it’s a good news/bad news sort of thing: There’s a lot of work to do to get this up to speed, but there is a tremendous potential for improvement.

DNF · March 23, 2020, 8:41pm

Here’s another strange one. What is the purpose of nt? And why create U as a 3D array, but then throw away one of the dimensions? Why not just make it a 2D array in the first place? And why make the third dimension have length NDT, but only use 1:NDT-4. Just make it length NDT-4 in the first place:

jleman:

function stepSource()
    nt = collect(Int64,(0:NDT-1))  #time step index
    U = zeros(Float64,(nc,1,NDT)) #intermediate voltage variable (starts at t=0)
    U[1,1,50:NDT] .= 300000.0 #step function at n = 50, change first element to change number of phases
    return U[:,1,1:NDT-4]  #short four to account for offset in initialization
end

Instead:

function stepSource()
    U = zeros(Float64, nc, NDT-4) #intermediate voltage variable (starts at t=0)
    U[1, 50:end] .= 300_000 #step function at n = 50, change first element to change number of phases
    return U  #short four to account for offset in initialization
end

jleman · March 23, 2020, 10:03pm

The ‘nt’ variable line is just one I forgot to delete - good catch. Regarding the extra dimension of U: Later in the code (in a line I deleted from this abbreviated block, but shown below). I copy the values of U to another matrix that has to be in three dimensions to play well with other calculations.

    #call one of the above functions to define a source type
    Vs[:,1,5:NDT] = stepSource()#surgeSource()#zeroSource()

I appreciated the detailed look. This has been a very helpful thread. Looks like I have plenty of homework tonight cleaning up my code per the above suggestions.

baggepinnen · March 24, 2020, 2:34am

Have you tried using the profiler? It can be useful in determining where you spend most of your time. Also --track-allocations can be useful.

jleman · March 24, 2020, 1:59pm

What is the function of the $ prefix in the function calls?

Topic		Replies	Views
Understanding Julia performance in simple finite difference code New to Julia	38	4448	June 7, 2018
Slower with threads Performance question	26	1171	August 6, 2022
Improving performance of Integral equations (3-Nested loops) Performance performance , loops	4	1094	May 13, 2021
Avoiding allocations of small but non-trivial arrays (work array alternative?) Performance question	38	4487	November 17, 2022
Optimizing iteration over slices of multiple matrices Performance question , array	14	1894	December 30, 2018

Variation in computation time for execution of the exact same code

Related topics