Ok, that makes sense. Based on information from these forums and documentation I try to avoid the big performance gotchas, but I’m not a programmer and suspect there is a lot more that could be done. Below is a simplified block of my code. Unfortunately I can’t include the whole thing and this block won’t run asis, but it illustrates the layout and process. If anyone has suggestions for ways to further improve this I would appreciate it.
In a nut shell the code is a finite difference time domain process that follows this sequence:

Outer loop (layer 1) generates simulation batches (desired is 500 per batch)

Inner loop (layer 2) iterates over time (up to 10000 timestep iterations)
 Innermost loop (layer 3) iterates over spatial elements (up to 2000 elements)  this layer updates voltage and current arrays => each are 3 x 2000 (3 conductors with 2000 spatial elements each).

At the end of each layer 2 loop, the results are appended to two overall arrays of stored data. The ultimate size of these arrays is 3 x 2000 x 10000 (3 conductors, 2000 spatial elements, and 10000 time steps).

At the end of each layer 1 loop, some basic post processing is completed on the final result arrays.
Code Block
After the code block I have included some additional timing performance data.
using LinearAlgebra
using NLsolve
using DelimitedFiles
using Dates
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@ Prelim  Define global input prameters with "const"
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#Physics constants
const μ0 = Float64(4*π*1e7) #permeability of free space [H/m]
const ϵ0 = Float64(8.85418782e12) #permittivity of free space [F/m]
#User defined constants
const cc = Matrix{Float64}([10.0 14.67;0.0 14.67; 10.0 14.67]) #conductor coordinates 1st col = x values, 2nd col = y values (average height)
const nc = Int64(3) #number of phases
const rw = Float64(1.708/100) #radii of conductors [m] = cm/100. Program not currently set to handle shield wires.
#derived constants
const Rdc = Float64(rdcPerKm/(1000.0*conductorsPerBundle)) #[Ω/m]
const ϵg = Float64(ϵ0*ϵgr) #relative permittivity of ground [F/m]
const a = Float64(σg/ϵg) #some ratio
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@ Functions to implement voltage source options for the FDTD main function
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#define source voltage(s) and thevenin equivalent resistance
function stepSource()
nt = collect(Int64,(0:NDT1)) #time step index
U = zeros(Float64,(nc,1,NDT)) #intermediate voltage variable (starts at t=0)
U[1,1,50:NDT] .= 300000.0 #step function at n = 50, change first element to change number of phases
return U[:,1,1:NDT4] #short four to account for offset in initialization
end
function surgeSource()
nt = collect(Int64,(0:NDT1)) #time step index
U = zeros(Float64,(nc,1,NDT)) #intermediate voltage variable (starts at t=0)
riseTime = Float64(100.0e6) #seconds
tailTime = Float64(750.0e6) #seconds
A = Float64(1.0/tailTime) #2.05e3 #13333.33#
B = Float64(1.0/riseTime) #5.64e5 #100000#
amplitude = Float64(500000.0*sqrt(2.0/3.0)*2.0) #volts  3 per unit SOV on 500kV
U[1,1,:] .= amplitude .* (exp.(A * Δt * nt) . exp.(B * Δt * nt))
amplitudeCorrection = Float64(abs(amplitude)/maximum(abs.(U)))
return U[:,1,1:NDT4]*amplitudeCorrection #short four to account for offset in initialization
end
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@ Function to implement FDTD method
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#Matrix convention for update voltages and currents: [conductor x NDZ]
#Matrix convention for stored voltages and currents: [conductor x NDZ x time]
function fdtd(Lmat,Cmat,Amat,Bmat,aw,bw,ag,bg,θ,ncl1,ncl2,ncl3,trappedChargeV,
enableDynamicCapacitance,enableTerminalMOV,enableTrappedCharge)
#Initialize local arrays and variables
VV = zeros(Float64,(nc,NDZ)) #initialize array to store update voltage values
II = zeros(Float64,(nc,NDZ)) #initialize array to store update current values
I1 = zeros(Float64,(nc,NDZ))
I2 = zeros(Float64,(nc,NDZ))
I3 = zeros(Float64,(nc,NDZ))
I4 = zeros(Float64,(nc,NDZ))
Vout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
Iout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
cdynout = zeros(Float64,(nc,NDZ,NDT))
Vs = zeros(Float64,(nc,1,NDT))
cdyn = zeros(Float64,(nc,nc,NDZ)) #initialize array to store dynamic matrix values
#Update dynamic capacitance array based on voltage results from the previous iteration
function dynamicCapacitance(VV,Vpast,cdyn,Cmat,n)
for k = 1:NDZ #iterate over spatial steps (generally around 500 to 1000 steps)
for i = 1:nc #iterate over number of conductors (generally nc=3)
pcmtemp[i,i] = updatePCM(VV[i,k],Ecp,GMRb,avgHeight,α,1,n) #note cgeo is not needed for switching transients as Tc is neglected (see Li,Malik,Zhao)
end
cdyn[:,:,k] = inv(pcmtemp)
end
return cdyn
end
#called by previous function: function used for dynamic capacitance calculation to determine charge
function updatePCM(wireV,Ec,rgeo,h,α,caseCall,n)#,cgeo,Tc,t) #commented out information are variables for the steep front terms
#use nsolve to find an iterative solution to the nonlinear equation for volts vs current
function f!(F, rCorona)
F[1] = (rgeo + wireV/(α*Ec))/(1 + (2*hrCorona[1])/(2*h)*log((2*hrCorona[1])/rCorona[1]))  rCorona[1]
end
rCorona = nlsolve(f!,[rgeo]).zero[1]
#calculate charge, equivalent radius, and updated potential coefficient matrix
ρtotal = (2.0*π*ϵ0*α*Ec*rCorona*(2*hrCorona)/(2.0*h))
GMReq = 2.0*h/(exp(2.0*π*ϵ0*wireV/ρtotal) + 1.0)
pcmii = 1/(2.0*π*ϵ0)*log(2.0*h/GMReq)
return pcmii #return updated potential coefficient matrix self term
end
#voltage update equation and calculation of dynamic capcitance
function update_V(VV,II,cdyn)
for k = 2:(NDZ1) #iterate over spatial steps (between about 500 and 1000 steps)
#voltage update equation
VV[:,k] = (((Δz/Δt)*cdyn[:,:,k])^1*((Δz/Δt)*cdyn[:,:,k])*VV[:,k] 
((Δz/Δt)*cdyn[:,:,k])^1 * (II[:,k]  II[:,k1])) # VV[i] on right side is the previous timestep.
end
return VV
end
#Update source boundary condition (first node of spatial steps)
function update_src(VV,II,cdynNode1,Identity,Rs,n,ncl1,ncl2,ncl3)
Cs = Float64(0.5e6)
Yc = ((Δz/Δt)*cdynNode1 + (Cs/Δt)*Identity)
bkr = copy(Identity)
VV[:,1] = inv(Yc + bkr*1.0/(2*Rs))*((Yc  bkr*1.0/(2*Rs))*VV[:,1]  (2*II[:,1])
+ (bkr*1.0/(2*Rs))*(Vs[:,1,n+1] + Vs[:,1,n]))
return VV
end
#Load boundary condition (last node of spatial steps)
function update_load(VV,II,cdynNodeNDZ,Identity,RL,n)
#RL = Rnominal
if enableTerminalMOV && any(abs.(VV[:,NDZ]) .> 200000.0)
Yx = (Δz/Δt)*cdynNodeNDZ + (1.0/RL)*I
Yz = (Δz/Δt)*cdynNodeNDZ  (1.0/RL)*I
#NLsolve iterative approach  includes coupling per the derived equations. Must take that into accoutn for stability purposes.
function f!(F,v)
F[1] = Yx[1,1]*v[1] + Yx[1,2]*v[2] + Yx[1,3]*v[3]  Yz[1,1]*VV[1,NDZ]  Yz[1,2]*VV[2,NDZ]  Yz[1,3]*VV[3,NDZ] 
2*II[1,NDZ1] + sign(v[1])*0.5*exp(10.0e6*abs(v[1])) + sign(VV[1,NDZ])*0.5*exp(10.0e6*abs(VV[1,NDZ]))  1.0
F[2] = Yx[2,1]*v[1] + Yx[2,2]*v[2] + Yx[2,3]*v[3]  Yz[2,1]*VV[1,NDZ]  Yz[2,2]*VV[2,NDZ]  Yz[2,3]*VV[3,NDZ] 
2*II[2,NDZ1] + sign(v[2])*0.5*exp(10.0e6*abs(v[2])) + sign(VV[2,NDZ])*0.5*exp(10.0e6*abs(VV[2,NDZ]))  1.0
F[3] = Yx[3,1]*v[1] + Yx[3,2]*v[2] + Yx[3,3]*v[3]  Yz[3,1]*VV[1,NDZ]  Yz[3,2]*VV[2,NDZ]  Yz[3,3]*VV[3,NDZ] 
2*II[3,NDZ1] + sign(v[3])*0.5*exp(10.0e6*abs(v[3])) + sign(VV[3,NDZ])*0.5*exp(10.0e6*abs(VV[3,NDZ]))  1.0
end
VV[:,NDZ] = nlsolve(f!,VV[:,NDZ]).zero
return VV
end
#current update equation (includes convolution term because of the impedance time function ()
function update_I(VV,II,Iout,Amat,Bmat,ZFINV,Bcoeff,n,Ψw,Ψg,aexpbw,expbw,aexpbg,expbg)
for k = 1:(NDZ1) #iterate over spacial steps
ΔI = (Iout[:,k,n1]  Iout[:,k,n2])
#recursive convolution for internal frequency dependent impedance (uses Matrix Pencil coefficients)
∑Ψw = zeros(Float64,nc)
for i = 1:size(aw)[1] #iteration involves only about 5 terms
Ψw[:,i,k] = aexpbw[i]*ΔI + expbw[i]*Ψw[:,i,k]
∑Ψw += Ψw[:,i,k]
end
#recursive convolution for external (ground) frequency dependent impedance
∑Ψg = zeros(Float64,nc)
for i = 1:size(aexpbg)[3] #iteration invovles only about 3 terms
Ψg[:,:,i,k] = aexpbg[:,:,i]*diagm(ΔI) + expbg[:,:,i] .* Ψg[:,:,i,k]
end
∑Ψg = sum(sum(Ψg[:,:,:,k],dims=3),dims=2)[:,1]
#combine terms to get the update current
I1[:,k] = ZFINV * ZP * II[:,k]
I2[:,k] = ZFINV * Bcoeff * ∑Ψw
I3[:,k] = ZFINV * ∑Ψg
I4[:,k] = ZFINV * (VV[:,k+1]  VV[:,k]) / Δz
II[:,k] = I1[:,k]  I2[:,k]  I3[:,k]  I4[:,k]
end
return II,Ψw,Ψg
end
#outer ime stepping loop (order in which functions are called is important)
for n = 1:NDT #NDT is as large as 10000
#update voltage (function iterates over all spatial steps, as many as 2000 steps)
VV = update_V(VV,II,cdyn) #note recursive nature (VV is passed in, updated by the function, and stored again as VV)
#update source node (one spatial step only)
VV = update_src(VV,II,cdyn[:,:,1],Identity,Rs,n,ncl1,ncl2,ncl3)
#update load (one spatial step only)
VV = update_load(VV,II,cdyn[:,:,NDZ],Identity,RL,n)
#update current (function iterates over all spatial steps, as many as 2000 steps)
II,Ψw,Ψg = update_I(VV,II,Iout,Amat,Bmat,ZFINV,Bcoeff,n,Ψw,Ψg,aexpbw,expbw,aexpbg,expbg)
#update capcitance (function iterates over all spatial steps, as many as 2000 steps)
cdyn = dynamicCapacitance(VV,Vpast,cdyn,Cmat,n)
#Accumulate the results of each time step into the overall results matrices.
#The largest these arrays get is (3 x 2000 x 10000) = (number of conductors x spatial steps x time steps)
Vout[:,:,n] = VV
Iout[:,:,n] = II
end
return Vout,Iout,cdynout,Vs
end
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@
#@@ Main function to control batch iteration and process output data from FDTD execution
#@@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
function main(simulations,θ,ncl1,ncl2,ncl3,
enableDynamicCapacitance,enableTerminalArrester,enableTrappedCharge,randomClosing)
simulationBatch = 500 #as many as 500 batchs of FDTD simulations
Vout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
Iout = zeros(Float64,(nc,NDZ,NDT)) #initialize array to store all output variables
cdynout = zeros(Float64,(nc,NDZ,NDT))
for i = 1:simulationBatch
println("Update FDTD")
@time Vout,Iout,cdynout,Vs = fdtd(Lmat,Cmat,Amat,Bmat,aw,bw,ag,bg,θ,ncl1,ncl2,ncl3,trappedChargeV,
enableDynamicCapacitance,enableTerminalArrester,enableTrappedCharge)
#Post processing of output data for each batch
Voutlg = copy(Vout)
Voutlgabs = abs.(Vout)
Voutlgmaxn = maximum(Voutlgabs,dims=3)[:,:,1]
Voutlgmaxnnc = maximum(Voutlgmaxn,dims=1)[1,:]
VLG[i,:] = Voutlgmaxnnc
Voutpp = Vout  circshift(Vout,(1,0,0))
Voutppabs = abs.(Voutpp)
Voutppmaxn = maximum(Voutppabs,dims=3)[:,:,1]
Voutppmaxnnc = maximum(Voutppmaxn,dims=1)[1,:]
VPP[i,:] = Voutppmaxnnc
end
#setup automatic file naming to save out the postprocessed data after each batch
len = string("L",Int64(kilometers))
sims = string("S",simulations)
znum = string("Z",NDZ)
alt = string("ALT",Int64(altitude))
dateCode = Dates.format(Dates.now(), "yymmddHHMMSS")
fileCode = string(len,sims,znum,arrest,alt)
path="C:/Users/jleman/OneDrive  POWER Engineers, Inc/02_WSU Research/Energies Paper/"
writedlm( string(path,dateCode,"_01_VoltageLG_",fileCode,".csv"), VLG, ',')
writedlm( string(path,dateCode,"_02_VoltagePP_",fileCode,".csv"), VLG, ',')
return VLG,VPP,Vout,Iout,Vs
end
Detailed Timing Results
This next set of time results is for:
 1 iteration of the outer loop (one simulation in the batch, layer 1)
 6 timestep iterations (inner loop, layer 2)
 500 spatial step iterations (innermost loop, layer 3).
Time Step Iteration 1
Update Voltage
0.001771 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000008 seconds (29 allocations: 5.219 KiB)
Update Load
0.000012 seconds (20 allocations: 7.844 KiB)
Update Current
0.003318 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000069 seconds (500 allocations: 78.125 KiB)
Time Step Iteration 2
Update Voltage
0.002103 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000007 seconds (29 allocations: 5.219 KiB)
Update Load
0.000013 seconds (20 allocations: 7.844 KiB)
Update Current
0.003011 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000063 seconds (500 allocations: 78.125 KiB)
Time Step Iteration 3
Update Voltage
0.001725 seconds (11.95 k allocations: 2.834 MiB)
Update Source
0.000016 seconds (29 allocations: 5.219 KiB)
Update Load
0.000009 seconds (20 allocations: 7.844 KiB)
Update Current
0.003045 seconds (46.41 k allocations: 4.576 MiB)
Update Capacitance
0.000083 seconds (500 allocations: 78.125 KiB)
…
3 more iterations
Total time to complete time iterations (inner loop  layer 2):
0.752052 seconds (1.21 M allocations: 290.975 MiB, 11.06% gc time)
=> simulation 1 complete
=> Results Saved
Total time to complete outer loop iterations (outer loop  layer 1):
2.215731 seconds (3.84 M allocations: 1.079 GiB, 29.71% gc time) (total time to complete outer loop
=> Batch complete.

Observation 1: there is a comparatively large block of time (1.5) spent post processing data in the outer loop layer. I’m not worried about that time. Its the time in the inner loop (layer 2) that adds up (see final timing example, below).

Observation 2: GC time for this example is less than 30%. However, during full simulations these are higher (see final timing example below).
Final timing example
Here I run the same code with the following:
 One iteration of the outer loop (one simulation in the batch, layer 1)
 6000 timestep iterations (inner loop, layer 2), this is the only change from the previous example
 500 spatial step iterations (innermost loop, layer 3).
Total time to complete time iterations (inner loop  layer 2):
244.647256 seconds (394.54 M allocations: 47.762 GiB, 85.97% gc time)
=> simulation 1 complete
=> Results Saved
Total time to complete outer loop iterations (outer loop  layer 1):
246.305489 seconds (394.55 M allocations: 48.434 GiB, 85.88% gc time)
=> Batch complete.

Observation 3: There are 1000 times as many time steps as the previous timing example, however the total time to compute only increases by a factor of 325.

Observation 4: % gc time increases when the time step count is closer to the maximum desired for the simulation.