How to efficiently evaluate a Flux.jl neural network millions of times on the GPU?

PolarizedPoutine · October 17, 2022, 6:52pm

I have trained a small Flux.jl neural network that I am embedding into a larger model but I need to use the neural network to make millions (billions?) of predictions as part of running this larger model.

Since the larger model runs on the GPU I also want to evaluate the neural network on the GPU. And even if the network is too small to saturate the GPU, I think queuing up millions of evaluations should saturate the GPU and result in a significant speedup. However, I am not sure how to do this.

I tried calling the neural network in a CUDA kernel so that I could launch many of them but Chains and Dense layers are not isbits so I don’t think you can use a kernel here. I’m also hesitant to write a custom kernel to evaluate the chain since I plan to try out different chains/architectures so I’m looking for a more generic solution.

Unfortunately evaluating the chain in a loop for _ in 1:10^4; G(y); end doesn’t queue up many CUDA kernel launches which can then be executed in parallel. It probably also doesn’t help that evaluating the chain on the GPU actually incurs quite a few CPU allocations.

I’d appreciate any tips for speeding up batch chain evaluations on the GPU if anyone else has tried doing something similar!

CPU benchmark

using BenchmarkTools
using CUDA
using Flux

x = ones(Float32, 32)

C = Chain(
    Dense(32, 128, relu),
    Dense(128, 128, relu),
    Dense(128, 31, relu)
)

@benchmark C(x)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.436 μs … 395.964 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.751 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.526 μs ±   6.871 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▆██▆▂           ▁▃▄▆▅▄▃▁▁                             
  ▁▁▂▂▃▅███████▇▄▃▂▂▂▃▃▅▆███████████▆▆▄▄▄▄▄▃▃▃▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁ ▄
  11.4 μs         Histogram: frequency by time         26.5 μs <

 Memory estimate: 2.62 KiB, allocs estimate: 6.

GPU benchmark

y = CUDA.ones(32)

G = gpu(C)

CUDA.@time CUDA.@sync G(y)

@benchmark CUDA.@sync G(y)

  0.000273 seconds (102 CPU allocations: 5.766 KiB) (6 GPU allocations: 2.242 KiB, 9.46% memmgmt time)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  43.629 μs …  2.288 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.206 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.249 μs ± 27.861 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅██▇▆▄▁                                                    
  ▂▆███████▇▆▅▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▃▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁ ▃
  43.6 μs         Histogram: frequency by time        97.7 μs <

 Memory estimate: 5.77 KiB, allocs estimate: 102.

jmair · October 17, 2022, 7:16pm

You want to input a batch into the neural network, so your input is y=CUDA.ones(32, N), where N is the number of inputs you want to process in parallel. This will be the easiest way to parallelise the execution. You should get an output matrix which is 31 by N.

PolarizedPoutine · October 17, 2022, 7:32pm

Thanks for pointing this out @jmair! Can’t believe I didn’t know that you can batch evaluate this easily.

jmair · October 17, 2022, 7:36pm

You’re welcome

Topic		Replies	Views
Neural network in Flux.jl using CUDA is slower General Usage	0	478	July 15, 2020
Using trained neural networks inside GPU computations General Usage neural-network	0	58	October 18, 2024
Allocation of Memory while evaluate a model Machine Learning flux	7	613	November 8, 2021
Automatic gradient ∼10x slower to evaluate than the primal computation Machine Learning performance , flux , adjoint , chainrulescore , gradient	2	418	February 3, 2023
Flux on GPU too slow Machine Learning gpu , cuda , flux	5	1119	September 22, 2022

How to efficiently evaluate a Flux.jl neural network millions of times on the GPU?

CPU benchmark

GPU benchmark

Related topics