Custom Flux layer looking weird upon profiling

AMart · May 3, 2023, 6:53pm

I’m developing some custom Conv and Dense layers to use with Flux for my research and I’m having trouble understanding the profiler output.

I’m using LeNet from Flux examples to study my implementation. The following code is the baseline, which builds the model with Flux layers.

function build_model(; imgsize=(28,28,1), nclasses=10)
    out_conv_size = (imgsize[1]÷4 - 3, imgsize[2]÷4 - 3, 16)
    
    return Chain(
            Conv((5, 5), imgsize[end]=>6, relu),
            MaxPool((2, 2)),
            Conv((5, 5), 6=>16, relu),
            MaxPool((2, 2)),
            flatten,
            Dense(prod(out_conv_size), 120, relu), 
            Dense(120, 84, relu), 
            Dense(84, nclasses)
          )
end

model = build_model() |> gpu

@time test = eval_loss_accuracy(test_loader, model)

Profile.clear()
Profile.@profile test = eval_loss_accuracy(test_loader, model)
pprof()

This results in a profile that looks like this:

function build_custom_model(; imgsize=(28,28,1), nclasses=10)
    out_conv_size = (imgsize[1]÷4 - 3, imgsize[2]÷4 - 3, 16)
    
    return Chain(
            CustomConv((5, 5), imgsize[end]=>6, relu),
            MaxPool((2, 2)),
            CustomConv((5, 5), 6=>16, relu),
            MaxPool((2, 2)),
            flatten,
            CustomDense(prod(out_conv_size)=>120, relu), 
            CustomDense(120=>84, relu), 
            CustomDense(84=>nclasses)
          )
end

model = build_custom_model() |> gpu

@time test = eval_loss_accuracy(test_loader, model)

Profile.clear()
Profile.@profile test = eval_loss_accuracy(test_loader, model)
pprof()

This results in a profile that looks like this:

The custom layers are reasonably optimized and written with CUDA.jl kernels. I do expect a slowdown from the baseline, as my custom layers perform ~8x more allocations, but the @time calls show an ~80x slowdown on the second run (i.e., without compile overheads).
The flame graphs show something I don’t understand. What is (::CUDA.var\"#10#13\"{CuStream, Timer, CuDevice, Base.Event})()? Why are 96.3% of the sampled events in that box when profiling the model built with my custom layer? For baseline, that is only 3.45%.

This becomes significantly worse when I profile with bigger inputs. For a CustomConv layer running on an ImageNet image sized input, I observe a ~350x slowdown. Profiling the CustomConv layer with this larger input shows 99.67% of the sampled events in (::CUDA.var\"#10#13\"{CuStream, Timer, CuDevice, Base.Event})().

What is happening? I don’t think this is correct behavior. Or, am I misunderstanding something? I’d be very grateful for any insights!

Thanks!

maleadt · May 3, 2023, 8:11pm

I’m not sure what that CUDA closure is, but generally when profiling GPU code you should use a GPU profiler, like NSight: Profiling · CUDA.jl. The problem with GPU code is that its asynchronous, so from the host side (i.e., what you’re currently visualizing in your profile trace) time spent may be misaccounted.

Topic		Replies	Views
FluxML Basic Custom Layer with Custom Loss Function Machine Learning flux	3	1825	March 19, 2019
Flux: `nothing` gradients with custom layer General Usage	6	295	May 30, 2023
How to make a Custom Layer work on GPU? Machine Learning question , flux	3	792	September 2, 2020
Differentiating the Custom Layer and Regularizer New to Julia flux	7	557	June 14, 2022
Flux: Locally connected layer performance Machine Learning gpu , performance , flux , tullio , loopvectorization	2	543	February 21, 2021

Custom Flux layer looking weird upon profiling

Related topics