I ran Profiler as well, although I don’t know how to helpfully interpret the results. Unsurprisingly, there seem to be a few things going on, but I’m not sure which might be the most significant.
unknown allocs
Flat Flat% Sum% Cum Cum% Name Inlined?
22 12.87% 12.87% 22 12.87% Alloc: Base.IntrusiveLinkedList{Task}
18 10.53% 23.39% 18 10.53% Alloc: Base.Threads.SpinLock
16 9.36% 32.75% 16 9.36% Alloc: Task
16 9.36% 42.11% 16 9.36% Alloc: NNlib.var\"#539#540\"{NNlib.var\"#conv_part#538\"{Array{Float32, 3}, Float32, Float32, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, NNlib.DenseConvDims{3, 3, 3, 6, 3}, Int64, Int64, Int64}, UnitRange{Int64}, Int64}
16 9.36% 51.46% 16 9.36% Alloc: Base.GenericCondition{Base.Threads.SpinLock}
12 7.02% 58.48% 12 7.02% Alloc: Memory{Float32}
10 5.85% 64.33% 10 5.85% Alloc: Matrix{Float32}
10 5.85% 70.18% 10 5.85% Alloc: Array{Float32, 4}
8 4.68% 74.85% 8 4.68% Alloc: Array{Float32, 5}
7 4.09% 78.95% 7 4.09% Alloc: Profile.Allocs.BufferType
4 2.34% 81.29% 4 2.34% Alloc: Vector{UInt64}
4 2.34% 83.63% 4 2.34% Alloc: Memory{UInt64}
4 2.34% 85.96% 4 2.34% Alloc: Memory{Any}
4 2.34% 88.30% 4 2.34% Alloc: BitVector
3 1.75% 90.06% 3 1.75% Alloc: Tuple{Int64, Int64, Int64}
2 1.17% 91.23% 2 1.17% Alloc: Vector{Int64}
2 1.17% 92.40% 2 1.17% Alloc: Vector{Any}
2 1.17% 93.57% 2 1.17% Alloc: ReentrantLock
2 1.17% 94.74% 2 1.17% Alloc: Memory{Int64}
2 1.17% 95.91% 2 1.17% Alloc: InvalidStateException
2 1.17% 97.08% 2 1.17% Alloc: Channel{Any}
2 1.17% 98.25% 2 1.17% Alloc: Array{Float32, 3}
1 0.58% 98.83% 1 0.58% Alloc: NTuple{6, Int64}
1 0.58% 99.42% 1 0.58% Alloc: NNlib.PoolDims{3, 3, 3, 6, 3}
1 0.58% 100.00% 1 0.58% Alloc: @NamedTuple{alpha::Int64, beta::Int64}
0 0.00% 100.00% 171 100.00% with_logstate
0 0.00% 100.00% 171 100.00% with_logger_and_io_to_logs
0 0.00% 100.00% 171 100.00% with_logger
0 0.00% 100.00% 171 100.00% with_io_to_logs
0 0.00% 100.00% 4 2.34% vect
0 0.00% 100.00% 2 1.17% sync_end(::Channel{Any})
0 0.00% 100.00% 171 100.00% start_task
0 0.00% 100.00% 31 18.13% similar
0 0.00% 100.00% 171 100.00% run_inside_trycatch
0 0.00% 100.00% 171 100.00% run_expression
0 0.00% 100.00% 18 10.53% reshape
0 0.00% 100.00% 4 2.34% put_buffered(::Channel{Any}, ::Task)
0 0.00% 100.00% 4 2.34% put!
0 0.00% 100.00% 4 2.34% push!
0 0.00% 100.00% 6 3.51% permutedims!
0 0.00% 100.00% 17 9.94% permutedims
0 0.00% 100.00% 19 11.11% new_as_memoryref
0 0.00% 100.00% 6 3.51% meanpool_direct!
0 0.00% 100.00% 8 4.68% meanpool!
0 0.00% 100.00% 11 6.43% meanpool
0 0.00% 100.00% 171 100.00% maybe_record_alloc_to_profile
0 0.00% 100.00% 171 100.00% macro expansion
0 0.00% 100.00% 171 100.00% jl_toplevel_eval_flex
0 0.00% 100.00% 171 100.00% jl_interpret_toplevel_thunk
0 0.00% 100.00% 36 21.05% jl_gc_alloc_
0 0.00% 100.00% 171 100.00% jl_f_invokelatest
0 0.00% 100.00% 171 100.00% jl_f__apply_iterate
0 0.00% 100.00% 171 100.00% jl_apply
0 0.00% 100.00% 27 15.79% jl_alloc_genericmemory_unchecked
0 0.00% 100.00% 12 7.02% isperm
0 0.00% 100.00% 8 4.68% insert_singleton_spatial_dimension
0 0.00% 100.00% 171 100.00% ijl_toplevel_eval_in
0 0.00% 100.00% 171 100.00% ijl_toplevel_eval
0 0.00% 100.00% 16 9.36% ijl_new_task
0 0.00% 100.00% 128 74.85% ijl_gc_small_alloc
0 0.00% 100.00% 7 4.09% ijl_gc_managed_malloc
0 0.00% 100.00% 12 7.02% falses
0 0.00% 100.00% 171 100.00% eval_value
0 0.00% 100.00% 171 100.00% eval_stmt_value
0 0.00% 100.00% 171 100.00% eval_body
0 0.00% 100.00% 171 100.00% eval(::Module, ::Any)
0 0.00% 100.00% 171 100.00% do_call
0 0.00% 100.00% 106 61.99% conv_im2col!
0 0.00% 100.00% 106 61.99% conv_group
0 0.00% 100.00% 112 65.50% conv!
0 0.00% 100.00% 118 69.01% conv
0 0.00% 100.00% 171 100.00% compute
0 0.00% 100.00% 2 1.17% close
0 0.00% 100.00% 6 3.51% checkdims_perm
0 0.00% 100.00% 4 2.34% array_new_memory
0 0.00% 100.00% 12 7.02% _isperm
0 0.00% 100.00% 4 2.34% _growend!
0 0.00% 100.00% 171 100.00% _applychain
0 0.00% 100.00% 48 28.07% _Task
0 0.00% 100.00% 171 100.00% [unknown function]
0 0.00% 100.00% 4 2.34% Val
0 0.00% 100.00% 80 46.78% Task
0 0.00% 100.00% 18 10.53% SpinLock
0 0.00% 100.00% 6 3.51% ReentrantLock
0 0.00% 100.00% 11 6.43% MeanPool
0 0.00% 100.00% 22 12.87% IntrusiveLinkedList
0 0.00% 100.00% 29 16.96% GenericMemory
0 0.00% 100.00% 40 23.39% GenericCondition
0 0.00% 100.00% 21 12.28% Dense
0 0.00% 100.00% 118 69.01% Conv
0 0.00% 100.00% 14 8.19% Channel
0 0.00% 100.00% 171 100.00% Chain
0 0.00% 100.00% 12 7.02% BitArray
0 0.00% 100.00% 45 26.32% Array
0 0.00% 100.00% 11 6.43% *
0 0.00% 100.00% 4 2.34% (::Base.var\"#_growend!##0#_growend!##1\"{Vector{Any}, Int64, Int64, Int64, Int64, Int64, Memory{Any}, MemoryRef{Any}})()
0 0.00% 100.00% 171 100.00% #with_logger_and_io_to_logs#121
0 0.00% 100.00% 171 100.00% #with_io_to_logs#125
0 0.00% 100.00% 171 100.00% #run_expression#28
0 0.00% 100.00% 6 3.51% #meanpool_direct!#564
0 0.00% 100.00% 11 6.43% #meanpool#377
0 0.00% 100.00% 8 4.68% #meanpool!#361
0 0.00% 100.00% 6 3.51% #meanpool!#346
0 0.00% 100.00% 10 5.85% #init_cvn##2
0 0.00% 100.00% 11 6.43% #init_cvn##0
0 0.00% 100.00% 171 100.00% #handle##0
0 0.00% 100.00% 100 58.48% #conv_im2col!#536
0 0.00% 100.00% 118 69.01% #conv#124
0 0.00% 100.00% 106 61.99% #conv!#181
0 0.00% 100.00% 112 65.50% #conv!#143
0 0.00% 100.00% 171 100.00% #36
0 0.00% 100.00% 171 100.00% #32
0 0.00% 100.00% 171 100.00% #123
0 0.00% 100.00% 171 100.00% ##function_wrapped_cell#632