How to efficiently evaluate a Flux.jl neural network millions of times on the GPU?

I have trained a small Flux.jl neural network that I am embedding into a larger model but I need to use the neural network to make millions (billions?) of predictions as part of running this larger model.

Since the larger model runs on the GPU I also want to evaluate the neural network on the GPU. And even if the network is too small to saturate the GPU, I think queuing up millions of evaluations should saturate the GPU and result in a significant speedup. However, I am not sure how to do this.

I tried calling the neural network in a CUDA kernel so that I could launch many of them but Chains and Dense layers are not isbits so I don’t think you can use a kernel here. I’m also hesitant to write a custom kernel to evaluate the chain since I plan to try out different chains/architectures so I’m looking for a more generic solution.

Unfortunately evaluating the chain in a loop for _ in 1:10^4; G(y); end doesn’t queue up many CUDA kernel launches which can then be executed in parallel. It probably also doesn’t help that evaluating the chain on the GPU actually incurs quite a few CPU allocations.

I’d appreciate any tips for speeding up batch chain evaluations on the GPU if anyone else has tried doing something similar!


CPU benchmark

using BenchmarkTools
using CUDA
using Flux

x = ones(Float32, 32)

C = Chain(
    Dense(32, 128, relu),
    Dense(128, 128, relu),
    Dense(128, 31, relu)
)

@benchmark C(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.436 ΞΌs … 395.964 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     17.751 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.526 ΞΌs Β±   6.871 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–‚β–†β–ˆβ–ˆβ–†β–‚           ▁▃▄▆▅▄▃▁▁                             
  β–β–β–‚β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–ƒβ–ƒβ–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–β–β–β–β– β–„
  11.4 ΞΌs         Histogram: frequency by time         26.5 ΞΌs <

 Memory estimate: 2.62 KiB, allocs estimate: 6.

GPU benchmark

y = CUDA.ones(32)

G = gpu(C)

CUDA.@time CUDA.@sync G(y)

@benchmark CUDA.@sync G(y)
  0.000273 seconds (102 CPU allocations: 5.766 KiB) (6 GPU allocations: 2.242 KiB, 9.46% memmgmt time)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  43.629 ΞΌs …  2.288 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     50.206 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.249 ΞΌs Β± 27.861 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–…β–ˆβ–ˆβ–‡β–†β–„β–                                                    
  β–‚β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–ƒβ–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β– β–ƒ
  43.6 ΞΌs         Histogram: frequency by time        97.7 ΞΌs <

 Memory estimate: 5.77 KiB, allocs estimate: 102.
1 Like

You want to input a batch into the neural network, so your input is y=CUDA.ones(32, N), where N is the number of inputs you want to process in parallel. This will be the easiest way to parallelise the execution. You should get an output matrix which is 31 by N.

2 Likes

Thanks for pointing this out @jmair! Can’t believe I didn’t know that you can batch evaluate this easily.

2 Likes

You’re welcome :slight_smile:

2 Likes