A implementation of ResNet-18 uses lot of GPU memory

I have been trying to create ResNet-18 on Julia. I have managed to create a functional one but it is slow and uses lot of GPU memory. This is puzzling since similar architecture on Tensorflow runs 5-10 times faster and uses much less memory. Also, there is a implementation of VGG-16, which I found that also uses much less memory than mine implementation of ResNet-18. What I am doing wrong?

The implementation of VGG-16 that doesn’t use much memory: https://julialang.kr/?p=2529

My ResNet-18:

using Statistics
using CuArrays
using Zygote
using Flux, Flux.Optimise
using Metalhead, Images
using Metalhead: trainimgs
using Images.ImageCore
using Flux: onehotbatch, onecold
using Base.Iterators: partition


Metalhead.download(Metalhead.CIFAR10)
X = trainimgs(Metalhead.CIFAR10)
labels = onehotbatch([X[i].ground_truth.class for i in 1:50000],1:10)

image(x) = x.img 
ground_truth(x) = x.ground_truth
image.(X[rand(1:end, 10)])

getarray(X) = float.(permutedims(channelview(X), (2, 3, 1)))
imgs = [getarray(X[i].img) for i in 1:50000]

batch_size = 1


train = ([(cat(imgs[i]..., dims = 4), labels[:,i]) for i in partition(1:49000, batch_size)])
train_data = train |>
  x -> map(y->gpu.(y),x)
valset = 49001:50000
valX = cat(imgs[valset]..., dims = 4)
valY = labels[:, valset]


identity_layer(n) = Chain(
                              Conv((3,3), n=>n, pad = (1,1), stride = (1,1)),
                              BatchNorm(n,relu),
                              Conv((3,3), n=>n, pad = (1,1), stride = (1,1)),
                              BatchNorm(n,relu)
                              )

convolution_layer(n) = Chain(
                             Conv((3,3), n=> 2*n, pad = (1,1), stride = (2,2)),
                             BatchNorm(2*n,relu),
                             Conv((3,3), 2*n=>2*n, pad = (1,1), stride = (1,1)),
                             BatchNorm(2*n,relu)
                             )

simple_convolution(n) = Chain(
                              Conv((1,1), n=>n, pad = (1,1), stride = (2,2)),
                              BatchNorm(n,relu)
                              )


m_filter(n) = Chain(
  Conv((3,3), n=>2*n, pad = (1,1), stride = (2,2)),
  BatchNorm(2*n,relu),
) |> gpu

struct Combinator
    conv::Chain
end |> gpu
Combinator(n) = Combinator(m_filter(n))


function (op::Combinator)(x, y)
  z = op.conv(y)
  return x + z
end

n = 7

m = Chain(

  ConvTranspose((n, n), 3 => 3, stride = n),
  Conv((7,7), 3=>64, pad = (3,3), stride = (2,2)),
  BatchNorm(64,relu),
  MaxPool((3,3), pad = (1,1), stride = (2,2)),
  SkipConnection(identity_layer(64), (variable_1, variable_2) -> variable_1 + variable_2),
  SkipConnection(identity_layer(64), (variable_1, variable_2) -> variable_1 + variable_2),
  SkipConnection(convolution_layer(64), Combinator(64)),
  SkipConnection(identity_layer(128), (variable_1, variable_2) -> variable_1 + variable_2),
  SkipConnection(convolution_layer(128), Combinator(128)),
  SkipConnection(identity_layer(256), (variable_1, variable_2) -> variable_1 + variable_2),
  SkipConnection(convolution_layer(256), Combinator(256)),
  SkipConnection(identity_layer(512), (variable_1, variable_2) -> variable_1 + variable_2),
  MeanPool((7,7)),
  x -> reshape(x, :, size(x,4)),
  Dense(512, 10),
  softmax,
) |> gpu


using Flux: crossentropy, Momentum, @epochs

loss(x, y) = sum(crossentropy(m(x), y))
opt = Momentum(0.01)


@epochs 5  train!(loss, params(m), train_data, opt)
```
1 Like

Hello and welcome!
You can customize the memory allocator for CuArrays. Please check:
https://juliagpu.gitlab.io/CUDA.jl/usage/memory/

Example (add these lines before ‘using’ anything, at the start of your session):

ENV["JULIA_CUDA_VERBOSE"] = true
ENV["CUARRAYS_MEMORY_POOL"] = "split"
ENV["CUARRAYS_MEMORY_LIMIT"] = 8000_000_000

using CuArrays

You can also try to use larger batch sizes during training.

I also had similar issues with some custom networks and playing with the ENVs above helped a lot.
I never did a comparison with Tensorflow or other frameworks, though…

I see you do some float. conversions. This by default converts to Float64. It’s better to use Float32 data all over your code.

As a side note, please format your code example using backticks, it’s much easier for others to read and understand.

1 Like

Thanks! All give that a shot.

I think Julia doesn’t use GPU efficiently. Bit like it spend more time moving data around than calculating it.

gpu_usage_julia_flux

Edit: I did a test run with Tensorflow and got result:

Epoch 10/10 time=5.79 mins: step 7800 total loss=0.8476 loss=0.4077 reg loss=0.4400 accuracy=0.7989

Batch size was 64. Much faster than Flux.

In my case your Flux implementation takes around 7 mins per epoch with batchsize of 64, but my GPU might not be as fast as yours. It’s quite busy, at 100%.
Tensorflow trains in 6 min per epoch or total?

Edit: are you using FP16 on RTX2070?

I’m using RTX 2070. It took me 6min in total. My Tensorflow output in full can be seen here https://pastebin.com/qa1Zgft3

If I use same GPU metrics as you have above I get following results

When I use Tensorflow my GPU metrics are as follows

I was given a hint that this might help me https://juliagpu.gitlab.io/CUDA.jl/development/profiling/#Application-profiling-1 . I haven’t had time to try it properly.

I gave the profiler a try (it’s the first time I use it).
Here are the “trimmed down” results for the forward pass on a batch on 512 images:

==11960== Profiling application: julia
==11960== Profiling result:
 Type             Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   34.19%  2.03996s       232  8.7929ms  1.4400us  127.79ms  [CUDA memcpy HtoD]
                   16.30%  972.30ms        67  14.512ms  2.0727ms  134.15ms  ptxcall_anonymous25_1
                   11.07%  660.39ms        30  22.013ms  14.045ms  27.541ms  void cudnn::detail::implicit_convolve_sgemm
                   10.03%  598.66ms        12  49.888ms  49.718ms  50.754ms  void cudnn::detail::implicit_convolve_sgemm
                    8.66%  516.77ms        60  8.6128ms  1.7418ms  55.664ms  ptxcall_anonymous25_4
                    6.30%  376.12ms        15  25.075ms  16.165ms  31.079ms  void cudnn::detail::implicit_convolve_sgemm
                    5.91%  352.84ms         5  70.568ms  49.790ms  99.667ms  void cudnn::detail::implicit_convolve_sgemm
      API calls:   38.19%  6.88944s       563  12.237ms  5.8000us  232.92ms  cuMemAlloc
                   25.29%  4.56288s       262  17.416ms  9.9000us  317.13ms  cuMemFree
                   15.71%  2.83392s         8  354.24ms  1.0000us  2.83391s  cudaStreamCreateWithFlags
                   11.00%  1.98506s       230  8.6307ms  20.200us  32.151ms  cuMemcpyHtoD
                    4.68%  844.44ms         7  120.63ms     600ns  608.43ms  cudaFree
                    3.95%  712.55ms        10  71.255ms  932.20us  123.56ms  cuModuleLoadDataEx
                    1.08%  194.43ms         1  194.43ms  194.43ms  194.43ms  cuDevicePrimaryCtxRetain

Which version of Julia, Flux, and CuArrays did you use?

Julia 1.4, Flux 0.10.3. I did not explicitly install CuArrays since I don’t need it, but the version added by Flux is 1.7.0.

I did couple of experiments and it seems that the ConvTranspose seems to be culprit here. Just compare performances with these two models

m = Chain(
  ConvTranspose((n, n), 3 => 3, stride = n),
  Conv((7,7), 3=>64, pad = (3,3), stride = (2,2)),
  MeanPool((7,7)),
  x -> reshape(x, :, size(x,4)),
  Dense(512*32, 10),
  softmax,
) |> gpu
m = Chain(
  Conv((7,7), 3=>64, pad = (3,3), stride = (2,2)),
  MeanPool((7,7)),
  x -> reshape(x, :, size(x,4)),
 Dense(256, 512*32),
  Dense(512*32, 10),
  softmax,
) |> gpu