Lux tutorial: AMDGPU 20x slower than CPU

Tetrakai · December 2, 2023, 6:45pm

So far, during my quest to get deep learning working in Julia, I found FastAI to have multiple dependency issues (see my previous posts). And Flux itself currently requires a downgrade of AMDGPU to v0.6.x, so I also had issues there.

But progress is being made!

In the case of Lux, I only needed to downgrade from v0.8 to v0.7.6 and was able to use the gpu to train a model.* However, it is much slower than on CPU. In the tutorial (presumably using nVidia/CUDA), they report ~6 sec for first epoch then ~0.4 s for the rest:

I ran that code in two fresh REPL sessions, only commenting out using LuxAMDGPU for the CPU trial. In case someone wants to try it, here is the actual script used: using Zygote, ComponentArrays, Lux, SciMLSensitivity, Optimisers, Ordinar - Pastebin.com

CPU:

julia> train(NeuralODE)
┌ Warning: No functional GPU backend found! Defaulting to CPU.
│ 
│ 1. If no GPU is available, nothing needs to be done.
│ 2. If GPU is available, load the corresponding trigger package.
│     a. LuxCUDA.jl for NVIDIA CUDA Support!
│     b. LuxAMDGPU.jl for AMD GPU ROCM Support!
│     c. Metal.jl for Apple Metal GPU Support!
└ @ LuxDeviceUtils ~/.julia/packages/LuxDeviceUtils/Dee3d/src/LuxDeviceUtils.jl:158
[1/9] 	 Time 1.33s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.1s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.08s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.08s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.08s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.09s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.08s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.08s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.09s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

GPU:

julia> train(NeuralODE)
[1/9] 	 Time 2.88s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 1.38s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 2.12s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 1.87s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 2.14s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 2.21s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 4.31s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 2.83s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 2.78s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

Can anyone verify whether that is a bad model to benchmark GPU vs CPU?

When monitoring GPU usage, I did see that very little time was spent actually doing computations (it was very “spiky”). Also, when I reran the GPU model in the same julia session, VRAM usage kept growing and it got slower and slower.

From looking at the repos, I’m starting to suspect I decided to try Julia just a few months before the ML ecosystem worked out the AMDGPU kinks.

*Also I decided to try the 1.10 release candidate since it had a newer version of LLVM than the 1.9.4 I got from juliaup for some reason. Here is the current versioninfo:


julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa9485436 (2023-11-03 07:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver1)
  Threads: 11 on 64 virtual cores

julia> AMDGPU.versioninfo()
ROCm provided by: system
[+] HSA Runtime v1.1.0
    @ /opt/rocm-5.7.1/lib/libhsa-runtime64.so
[+] ld.lld
    @ /opt/rocm/llvm/bin/ld.lld
[+] ROCm-Device-Libs
    @ /home/user1/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode
[+] HIP Runtime v5.7.31921
    @ /opt/rocm-5.7.1/lib/libamdhip64.so
[+] rocBLAS v3.1.0
    @ /opt/rocm-5.7.1/lib/librocblas.so
[+] rocSOLVER v3.23.0
    @ /opt/rocm-5.7.1/lib/librocsolver.so
[+] rocALUTION
    @ /opt/rocm-5.7.1/lib/librocalution.so
[+] rocSPARSE
    @ /opt/rocm-5.7.1/lib/librocsparse.so.0
[+] rocRAND v2.10.5
    @ /opt/rocm-5.7.1/lib/librocrand.so
[+] rocFFT v1.0.21
    @ /opt/rocm-5.7.1/lib/librocfft.so
[+] MIOpen v2.20.0
    @ /opt/rocm-5.7.1/lib/libMIOpen.so

HIP Devices [2]
    1. HIPDevice(name="AMD Radeon VII", id=1, gcn_arch=gfx906:sramecc+:xnack-)
    2. HIPDevice(name="Radeon RX 580 Series", id=2, gcn_arch=gfx803)

ufechner7 · December 2, 2023, 7:36pm

Which CPU are you using?
Which GPU are you using?
Which datatype are you using?

If you use FP64 this is to be expected with most consumer GPUs. Consumer GPUs are usually only fast when using FP32.

Tetrakai · December 2, 2023, 7:46pm

The versioninfo() is at the bottom of my post. The CPU is Threadripper 2990WX and GPU is AMD Radeon VII. Basically state of the art HEDT from ~5 years ago.

As for datatype, it looks like float32:

julia> train_dataloader, test_dataloader = loadmnist(128, 0.9)
(DataLoader(::Tuple{Array{Float32, 4}, Matrix{Bool}}, shuffle=true, batchsize=128), DataLoader(::Tuple{Array{Float32, 4}, Matrix{Bool}}, batchsize=128))

And here is what happens if I repeat the training in the same REPL session:

julia> train(NeuralODE)
[1/9] 	 Time 2.98s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 1.45s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 1.22s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 1.52s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 1.45s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 1.5s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 1.81s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 1.81s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 1.76s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

julia> train(NeuralODE)
[1/9] 	 Time 2.74s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 1.91s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 2.99s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 4.42s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 4.64s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 5.02s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 4.97s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 4.9s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 5.02s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

julia> train(NeuralODE)
[1/9] 	 Time 7.79s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 4.57s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 6.9s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 4.85s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 4.98s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 5.22s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 5.24s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 5.61s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 5.45s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

julia> train(NeuralODE)
[1/9] 	 Time 7.84s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 5.03s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 8.43s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 5.3s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 5.41s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 5.74s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 5.72s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 6.17s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 5.98s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

julia> train(NeuralODE)
[1/9] 	 Time 8.78s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 5.49s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 14.37s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 8.77s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 10.41s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 7.08s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 10.43s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 9.05s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 10.07s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

julia> train(NeuralODE)
[1/9] 	 Time 19.6s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 14.36s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 59.38s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 6.28s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 6.51s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 6.8s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 7.02s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 7.16s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 7.96s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

VRAM usage gradually grew from ~300 MB after the first trial, to ~1000 MB after the last. There are 16 GB of VRAM so it isn’t hitting any limits, but you can see it clearly gets slower.

ufechner7 · December 2, 2023, 7:57pm

The Radeon VII should be much faster for highly parallel tasks… The question remains, is your testcase highly parallel…

tamasgal · December 2, 2023, 8:02pm

There is clearly something going wrong there. Increasing VRAM and processing time hints to some accumulation and state entanglement… I am really not an expert there, but I am sure it’s not the testcase. Even for a bad (not GPU-friendly) testcase, the times and memory consuptions should be similar for multiple runs.

mike.ingold · December 3, 2023, 2:33am

As a fellow Radeon VII user who hasn’t tried it out with AMDGPU yet, I’d be very interested in hearing what the issue was if you manage to resolve it.

pxl-th · December 4, 2023, 1:56pm

Can reproduce this with my RX6700 XT, not sure what’s causing it yet.
I don’t see such behavior with other compute intensive projects though, will try to figure out.
Would be good to create a smaller MWE as this one is quite involved.

Tetrakai · December 4, 2023, 9:49pm

Thanks, I will also try to come up with a MWE. This was the closest I found:

The problem is: after running sufficiently long, buffers do not get reclaimed by CuArrays quickly enough, which causes the GPU to run out of memory, and performance to slow to a crawl.

github.com/JuliaGPU/CUDA.jl

Allocator very slow to reclaim memory after running for sufficiently long

opened 11:53PM - 17 Apr 19 UTC

closed 12:17PM - 02 Mar 21 UTC

aterenin

cuda array performance

This is related to the Flux issue https://github.com/FluxML/Flux.jl/issues/736 w…hich after playing around with in an interactive session, I am becoming increasingly convinced is related to strange behavior in the CuArrays memory allocator. Self-contained MWE can be found in that issue. The problem is: *after running sufficiently long*, buffers do not get reclaimed by CuArrays quickly enough, which causes the GPU to run out of memory, and performance to slow to a crawl. This is not a `Tracker` issue - I've replicated it with no `TrackedArray` types. It occurs when training MNIST with a completely standard static graph deep convolutional network with no fancy components. Example output *after just starting Julia*. ``` finished minibatch 5 Total GPU memory usage: 30.0% (2.363 GiB/7.766 GiB) CuArrays.jl pool usage: 48.0% (357.994 MiB in use by 4302 buffer(s), 810.212 MiB idle) finished minibatch 10 Total GPU memory usage: 31.0% (2.394 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (537.099 MiB in use by 5007 buffer(s), 662.265 MiB idle) finished minibatch 15 Total GPU memory usage: 35.0% (2.711 GiB/7.766 GiB) CuArrays.jl pool usage: 55.00000000000001% (1.489 GiB in use by 8782 buffer(s), 0 bytes idle) finished minibatch 20 Total GPU memory usage: 41.0% (3.146 GiB/7.766 GiB) CuArrays.jl pool usage: 61.0% (932.075 MiB in use by 6517 buffer(s), 1.014 GiB idle) finished minibatch 25 Total GPU memory usage: 41.0% (3.146 GiB/7.766 GiB) CuArrays.jl pool usage: 61.0% (1.875 GiB in use by 10292 buffer(s), 51.035 MiB idle) ──────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 63.5s / 0.89% Section ncalls time %tot avg ──────────────────────────────────────────────── pooled alloc 80.2k 480ms 85.3% 5.99μs 1 try alloc 10.9k 469ms 83.3% 42.9μs background task 11 92.8ms 16.5% 8.43ms reclaim 11 17.1μs 0.00% 1.55μs scan 11 16.6μs 0.00% 1.51μs ──────────────────────────────────────────────── ``` Output *after running many epochs of Flux training* - notice how memory usage continues to grow. Timings are after 100 minibatches. ``` finished minibatch 5 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (740.112 MiB in use by 5811 buffer(s), 5.730 GiB idle) finished minibatch 10 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (1.687 GiB in use by 9586 buffer(s), 4.765 GiB idle) finished minibatch 15 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (2.651 GiB in use by 13361 buffer(s), 3.801 GiB idle) finished minibatch 20 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (3.616 GiB in use by 17136 buffer(s), 2.837 GiB idle) finished minibatch 25 Total GPU memory usage: 100.0% (7.763 GiB/7.766 GiB) CuArrays.jl pool usage: 83.0% (4.580 GiB in use by 20911 buffer(s), 1.872 GiB idle) ──────────────────────────────────────────────── Time ────────────────────── Tot / % measured: 1385s / 42.4% Section ncalls time %tot avg ──────────────────────────────────────────────── pooled alloc 28.4M 295s 50.1% 10.4μs 2 gc(false) 331 289s 49.2% 874ms 1 try alloc 33.4k 1.70s 0.29% 50.8μs background task 885 294s 50.0% 332ms reclaim 885 49.7ms 0.01% 56.2μs scan 885 731μs 0.00% 826ns ──────────────────────────────────────────────── ``` Output *if `GC.gc()` is added after every 5 minibatches*. ``` finished minibatch 5 Total GPU memory usage: 27.0% (2.086 GiB/7.766 GiB) CuArrays.jl pool usage: 38.0% (255.660 MiB in use by 3967 buffer(s), 563.883 MiB idle) finished minibatch 10 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 15 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 20 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) finished minibatch 25 Total GPU memory usage: 32.0% (2.498 GiB/7.766 GiB) CuArrays.jl pool usage: 49.0% (255.660 MiB in use by 3967 buffer(s), 987.439 MiB idle) ``` I'd like to ask for some help in debugging this further. What can I do to further try to find out what is the issue?

However, I wasn’t close to running out of memory.

pxl-th · December 4, 2023, 11:41pm

Just noticed that stream-ordered allocator (async) is ~300x slower than non-async.

#include <hip/hip_runtime.h>
#include <iostream>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t s;
    check(hipStreamCreateWithPriority(&s, 0, 0));

    /*
    std::cout << "Regular" << std::endl;
    for (int i = 1; i < 100000; i++) {
        float *x;
        check(hipMalloc((void**)&x, 4));
        check(hipFree(x));
    }
    */

    std::cout << "Async" << std::endl;
    for (int i = 1; i < 100000; i++) {
        float *x;
        check(hipMallocAsync((void**)&x, 4, s));
        check(hipFreeAsync(x, s));
    }

    return 0;
}

pxl-th@Leleka:~/code$ time ./a.out 
Regular

real	0m0,256s
user	0m0,206s
sys	0m0,033s

pxl-th@Leleka:~/code$ time ./a.out 
Async

real	1m15,237s
user	1m47,751s
sys	0m0,828s

pxl-th · December 4, 2023, 11:42pm

That is consistent with profiling results that show that most of the time is spent in either hipMallocAsync or hipFreeAsync.

pxl-th · December 5, 2023, 7:30am

Indeed, moving to non-async allocator improves performance significantly:

julia> train(NeuralODE)
[1/9] 	Time 1.71
[2/9] 	Time 0.23
[3/9] 	Time 0.25
[4/9] 	Time 0.18
[5/9] 	Time 0.25
[6/9] 	Time 0.18
[7/9] 	Time 0.26
[8/9] 	Time 0.27
[9/9] 	Time 0.18

julia> train(NeuralODE)
[1/9] 	Time 0.17
[2/9] 	Time 0.27
[3/9] 	Time 0.16
[4/9] 	Time 0.18
[5/9] 	Time 0.2
[6/9] 	Time 0.37
[7/9] 	Time 0.18
[8/9] 	Time 0.3
[9/9] 	Time 0.18

julia> train(NeuralODE)
[1/9] 	Time 0.18
[2/9] 	Time 0.25
[3/9] 	Time 0.17
[4/9] 	Time 0.19
[5/9] 	Time 0.21
[6/9] 	Time 0.32
[7/9] 	Time 0.24
[8/9] 	Time 0.25
[9/9] 	Time 0.18

I’ll create a PR to do that.

ufechner7 · December 5, 2023, 7:38am

But after this change, is AMDGPU still slower than the CPU?

pxl-th · December 5, 2023, 7:49am

Yes, because I think this problem is bound by the memory transfer, not compute.
You have lots of small allocations + memory transfers and performing little computation.

Just tried on Nvidia RTX 2060 and it gives similar performance to AMD GPU.

Tetrakai · December 5, 2023, 5:05pm

Thanks, is there anything else I need to try it out?

I think I’d have to modify my local copy of LuxAMDGPU/project.toml so its compatible with v0.8 right?

pxl-th · December 5, 2023, 5:21pm

Probably also master branches of NNlib and Flux (if Lux relies on it).

avikpal · December 5, 2023, 7:00pm

To clarify this example is going to be quite slow on GPU given the size is quite small. I also left a note “For a model this size, you will notice that training time is significantly lower for training on CPU than on GPU.” in the tutorial but for some reason the CPU timings on the CI is really bad. Timings on all other servers / local machines I have suggest CPU to be faster.

avikpal · December 5, 2023, 7:01pm

Just NNlib and LuxAMDGPU need to be updated

Tetrakai · December 5, 2023, 8:55pm

Yes, that solved the speed and vram usage. Thanks.

=====
1
[1/9] 	 Time 2.78s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.46s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.32s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.54s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.32s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.54s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.34s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.55s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.34s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
2
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.31s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.6s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.33s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.79s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.34s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.35s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.33s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.46s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
3
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.44s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.3s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.31s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.59s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.54s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.64s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.35s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.32s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
4
[1/9] 	 Time 0.34s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.44s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.32s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.36s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.34s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.52s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.68s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.62s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.32s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
5
[1/9] 	 Time 0.31s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.43s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.31s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.35s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.31s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.5s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.64s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.6s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.47s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
6
[1/9] 	 Time 0.3s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.43s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.31s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.34s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.35s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.52s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.66s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.61s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.45s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
7
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.43s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.31s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.34s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.33s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.52s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.66s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.59s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.47s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
8
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.42s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.3s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.34s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.32s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.51s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.69s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.6s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.48s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
9
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.43s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.3s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.34s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.32s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.52s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.66s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.6s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.45s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%
=====
10
[1/9] 	 Time 0.32s 	 Training Accuracy: 50.96% 	 Test Accuracy: 43.33%
[2/9] 	 Time 0.42s 	 Training Accuracy: 69.63% 	 Test Accuracy: 66.0%
[3/9] 	 Time 0.3s 	 Training Accuracy: 77.93% 	 Test Accuracy: 71.33%
[4/9] 	 Time 0.34s 	 Training Accuracy: 80.74% 	 Test Accuracy: 76.67%
[5/9] 	 Time 0.32s 	 Training Accuracy: 82.52% 	 Test Accuracy: 78.0%
[6/9] 	 Time 0.52s 	 Training Accuracy: 84.07% 	 Test Accuracy: 78.67%
[7/9] 	 Time 0.67s 	 Training Accuracy: 85.33% 	 Test Accuracy: 80.67%
[8/9] 	 Time 0.6s 	 Training Accuracy: 86.59% 	 Test Accuracy: 81.33%
[9/9] 	 Time 0.46s 	 Training Accuracy: 87.7% 	 Test Accuracy: 82.0%

Topic		Replies	Views
Flux with AMD GPU(s)? Machine Learning flux , amdgpu	34	5085	February 15, 2023
MNIST GPU CuArrays error GPU	23	3063	January 22, 2019
AMDGPU: Very aggressive allocation? And what tools to monitor vram usage? GPU question , gpu , memory-allocation , amdgpu	15	934	December 12, 2023
Slow LSTM on GPU in Flux Machine Learning gpu , flux , pytorch	21	2118	February 15, 2024
NeuralPDE issue trying to use LuxAMDGPU GPU question	15	398	March 28, 2024

Lux tutorial: AMDGPU 20x slower than CPU

Related topics