ArrayFire and Flux

In part because of having an AMD GPU, the only package I can get to work with it is ArrayFire.jl.

ArrayFire v3.7.0 (OpenCL, 64-bit Linux, build a4485443)
[0] AMD: Ellesmere, 7999 MB
-1- INTEL: AMD Ryzen 7 2700X Eight-Core Processor         , 32178 MB

Generally, the benchmarks for standard matrix operations look considerably better. The next step for me was to try to utilize my GPU with Flux.jl. I did this by a relatively naive approach, with my understanding of how CUDA interfaces with Flux. From my understanding, this meant converting the arrays of the network into ArrayFire arrays (which requires the arrays be converted to regular, untracked arrays I think), which is exactly what I did:

model = mapleaves(AFArray, mapleaves(Tracker.data, Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)))

It’s a simple, small network that I’m just using for benchmarking, but I saw similar results across a number of different sizes and numbers of layers. In order for this to work, I had to extend the AFArray function for a few types, which I tried to do in as simple a way as possible.

ArrayFire.AFArray(func::typeof(σ)) = σ
ArrayFire.AFArray(func::typeof(identity)) = identity
ArrayFire.AFArray(func::typeof(softmax)) = softmax

And this worked, as in, it evaluated without error. Unfortunately, the GPU version performed much more slowly (0.048556 vs. 0.000040 seconds) for a single evaluation. Is there something that I’m doing wrong here or could be sped up?

1 Like

sorry I don’t have the answer for you, I am a Linux and thus an AMD user as well, esp with their CPUs completely crushing Intel in the past few years I feel like they deserve more love in Julia community (for how terribly Nvidia treats FOSS ocmmunity).

Unfortunately devs don’t have/use AMD GPUs so the progress is slow, I honestly want to crowd source some money to buys some AMD GPUs for a few interested devs to make AMDGPU just work…

1 Like

I believe that good progress is being made with AMD GPUs. cc @jpsamaroo

2 Likes

Yup, see https://github.com/JuliaGPU/AMDGPU.jl/. @jpsamaroo’s work there is funded IIRC, so it’s been progressing quite quickly!

4 Likes

Are you able to upgrade to a more recent version of Flux? Tracker and mapleaves have long since been deprecated in favour of Zygote and fmap.

c.f.

...
using AMDGPU

model = Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)

m_gpu = fmap(HSAArray, m)
...

(not tested since I don’t have a compatible GPU, but given how close the CUDA and AMDGPU APIs are I suspect it’s not too far off)

Unless you’re using a batch size in the high hundreds or thousands, the cost of multiplying 24xN matrices is going to be completely outweighed by communication and data transfer overhead with the GPU. These overheads should be amortized assuming your actual network is quite a bit larger.

1 Like

the problem at the moment is that I can’t even ~~ install ~~ build AMDGPU, since the dependence is even more non-trivial than CUDA

1 Like

Are you able to run anything else that uses ROCm? If so, I would 100% open an issue. The maintainers are usually very responsive and equally patient :slight_smile:

2 Likes

I was wondering about this. How would I ensure that the batching is working properly (I’ve tried them in a loop with similar performance unfortunately)? Is that through ArrayFire (or whatever GPU library)?

I’ll give it a try thanks!

I definitely need to give this a try. It looks like the first release has happened since I last tried to get this package working on my computer.

Batching is done through Flux. I believe the last dimension is the batch dimension by convention.

1 Like

Using the latest released (in Pkg) version of Flux like this yields this error:

UndefVarError: fmap not defined

Trying to install the master branch of Flux yields this strange error (considering AMDGPU is only on 0.1.0):

Unsatisfiable requirements detected for package GPUArrays [0c68f7d7]:
 GPUArrays [0c68f7d7] log:
 ├─possible versions are: [0.3.0-0.3.4, 0.4.0-0.4.2, 0.5.0, 0.6.0-0.6.1, 0.7.0-0.7.2, 1.0.0-1.0.4, 2.0.0-2.0.1, 3.0.0-3.0.1, 3.1.0, 3.2.0, 3.3.0, 3.4.0-3.4.1, 4.0.0] or uninstalled
 ├─restricted by compatibility requirements with AMDGPU [21141c5a] to versions: 2.0.0-2.0.1
 │ └─AMDGPU [21141c5a] log:
 │   ├─possible versions are: 0.1.0 or uninstalled
 │   └─restricted to versions * by an explicit requirement, leaving only versions 0.1.0
 └─restricted by compatibility requirements with CUDA [052768ef] to versions: 4.0.0 — no versions left
   └─CUDA [052768ef] log:
     ├─possible versions are: [0.1.0, 1.0.0-1.0.2, 1.1.0] or uninstalled
     └─restricted to versions 1 by Flux [587475ba], leaving only versions [1.0.0-1.0.2, 1.1.0]
       └─Flux [587475ba] log:
         ├─possible versions are: 0.11.0 or uninstalled
         └─Flux [587475ba] is fixed to version 0.11.0

Update: I’ve installed AMDGPU.jl (and rocblas, rocrand, rocsparse, rocalution, rocfft, and MIOpen though not sure how necessary they are for this) per @ToucheSir and @dpsanders ’ suggestions . I then interfaced it with Flux in the exact same way as I did with ArrayFire:

AMDGPU.HSAArray(func::typeof(σ)) = σ
AMDGPU.HSAArray(func::typeof(identity)) = identity
AMDGPU.HSAArray(func::typeof(softmax)) = softmax
model2 = mapleaves(HSAArray, mapleaves(Tracker.data, Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)))

For this small network, the GPU version did outperform the CPU (though only by a little: 1.498 μs vs. 1.788 μs) for evaluation of the neural network.

However, trying this with a larger network like this:

Chain(
  Dense(604, 400, σ),
  Dense(400, 300),
  Dense(300, 197),
  Dense(197, 197),
  Dense(197, 197),
  Dense(197, 110),
  Dense(110, 20),
  softmax
)

did not see significant performance improvments.

CPU:

BenchmarkTools.Trial: 
  memory estimate:  23.84 KiB
  allocs estimate:  117
  --------------
  minimum time:     53.960 μs (0.00% GC)
  median time:      62.500 μs (0.00% GC)
  mean time:        65.319 μs (2.06% GC)
  maximum time:     3.030 ms (96.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

GPU:

BenchmarkTools.Trial: 
  memory estimate:  20.89 KiB
  allocs estimate:  22
  --------------
  minimum time:     201.469 μs (0.00% GC)
  median time:      206.269 μs (0.00% GC)
  mean time:        207.926 μs (0.38% GC)
  maximum time:     2.330 ms (85.28% GC)
  --------------
  samples:          10000
  evals/sample:     1
1 Like

instead, significant slow-doen

It would be great if someone could test this with a CUDA GPU to compare. However, as stated, you may not get competitive performance over the CPU without using very expensive layers like convolutions.

Alternatively, it’s equally as likely that AMDGPU.jl is at fault for being slow. HSAArray isn’t really optimized for anything, and it’s slated to be merged with the ROCArray, which is supposed to be the more performant and featureful array type (we mostly just need the HSAArray for tests). Both array types currently do very bad things as well, such as falling back to running operations on the CPU silently (which CuArrays explicitly avoids, and we will too soon). Much of this will be fixed in the next few months, and I’ll have better news :slightly_smiling_face:

2 Likes