Results update
It took me a couple of days to re-write the old allocation heavy code per the static array suggestions above, but got it working and here are the results:
Old Code (allocation heavy due to use of base array and slicing approach):
455.657741 seconds (585.01 M allocations: 73.176 GiB, 88.12% gc time)
Revised Code (making use of static vector/matricies (StaticArrays.jl))
23.023161 seconds (71.79 M allocations: 3.910 GiB, 66.59% gc time)
Between the above improvements and those discussed on a related thread that’s about a 1200X improvement to my original ugly code. Thanks to all who contributed comments and suggestions!