Custom Flux layer looking weird upon profiling

I’m developing some custom Conv and Dense layers to use with Flux for my research and I’m having trouble understanding the profiler output.

I’m using LeNet from Flux examples to study my implementation. The following code is the baseline, which builds the model with Flux layers.

function build_model(; imgsize=(28,28,1), nclasses=10)
    out_conv_size = (imgsize[1]÷4 - 3, imgsize[2]÷4 - 3, 16)
    
    return Chain(
            Conv((5, 5), imgsize[end]=>6, relu),
            MaxPool((2, 2)),
            Conv((5, 5), 6=>16, relu),
            MaxPool((2, 2)),
            flatten,
            Dense(prod(out_conv_size), 120, relu), 
            Dense(120, 84, relu), 
            Dense(84, nclasses)
          )
end

model = build_model() |> gpu

@time test = eval_loss_accuracy(test_loader, model)

Profile.clear()
Profile.@profile test = eval_loss_accuracy(test_loader, model)
pprof()

This results in a profile that looks like this:

function build_custom_model(; imgsize=(28,28,1), nclasses=10)
    out_conv_size = (imgsize[1]÷4 - 3, imgsize[2]÷4 - 3, 16)
    
    return Chain(
            CustomConv((5, 5), imgsize[end]=>6, relu),
            MaxPool((2, 2)),
            CustomConv((5, 5), 6=>16, relu),
            MaxPool((2, 2)),
            flatten,
            CustomDense(prod(out_conv_size)=>120, relu), 
            CustomDense(120=>84, relu), 
            CustomDense(84=>nclasses)
          )
end

model = build_custom_model() |> gpu

@time test = eval_loss_accuracy(test_loader, model)

Profile.clear()
Profile.@profile test = eval_loss_accuracy(test_loader, model)
pprof()

This results in a profile that looks like this:

The custom layers are reasonably optimized and written with CUDA.jl kernels. I do expect a slowdown from the baseline, as my custom layers perform ~8x more allocations, but the @time calls show an ~80x slowdown on the second run (i.e., without compile overheads).
The flame graphs show something I don’t understand. What is (::CUDA.var\"#10#13\"{CuStream, Timer, CuDevice, Base.Event})()? Why are 96.3% of the sampled events in that box when profiling the model built with my custom layer? For baseline, that is only 3.45%.

This becomes significantly worse when I profile with bigger inputs. For a CustomConv layer running on an ImageNet image sized input, I observe a ~350x slowdown. Profiling the CustomConv layer with this larger input shows 99.67% of the sampled events in (::CUDA.var\"#10#13\"{CuStream, Timer, CuDevice, Base.Event})().

What is happening? I don’t think this is correct behavior. Or, am I misunderstanding something? I’d be very grateful for any insights!

Thanks!

I’m not sure what that CUDA closure is, but generally when profiling GPU code you should use a GPU profiler, like NSight: Profiling · CUDA.jl. The problem with GPU code is that its asynchronous, so from the host side (i.e., what you’re currently visualizing in your profile trace) time spent may be misaccounted.