ArrayFire and Flux

Ian_Slagle · July 11, 2020, 8:19pm

In part because of having an AMD GPU, the only package I can get to work with it is ArrayFire.jl.

ArrayFire v3.7.0 (OpenCL, 64-bit Linux, build a4485443)
[0] AMD: Ellesmere, 7999 MB
-1- INTEL: AMD Ryzen 7 2700X Eight-Core Processor         , 32178 MB

Generally, the benchmarks for standard matrix operations look considerably better. The next step for me was to try to utilize my GPU with Flux.jl. I did this by a relatively naive approach, with my understanding of how CUDA interfaces with Flux. From my understanding, this meant converting the arrays of the network into ArrayFire arrays (which requires the arrays be converted to regular, untracked arrays I think), which is exactly what I did:

model = mapleaves(AFArray, mapleaves(Tracker.data, Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)))

It’s a simple, small network that I’m just using for benchmarking, but I saw similar results across a number of different sizes and numbers of layers. In order for this to work, I had to extend the AFArray function for a few types, which I tried to do in as simple a way as possible.

ArrayFire.AFArray(func::typeof(σ)) = σ
ArrayFire.AFArray(func::typeof(identity)) = identity
ArrayFire.AFArray(func::typeof(softmax)) = softmax

And this worked, as in, it evaluated without error. Unfortunately, the GPU version performed much more slowly (0.048556 vs. 0.000040 seconds) for a single evaluation. Is there something that I’m doing wrong here or could be sped up?

jling · July 11, 2020, 10:32pm

sorry I don’t have the answer for you, I am a Linux and thus an AMD user as well, esp with their CPUs completely crushing Intel in the past few years I feel like they deserve more love in Julia community (for how terribly Nvidia treats FOSS ocmmunity).

Unfortunately devs don’t have/use AMD GPUs so the progress is slow, I honestly want to crowd source some money to buys some AMD GPUs for a few interested devs to make AMDGPU just work…

dpsanders · July 12, 2020, 12:24am

I believe that good progress is being made with AMD GPUs. cc @jpsamaroo

ToucheSir · July 12, 2020, 12:27am

Yup, see https://github.com/JuliaGPU/AMDGPU.jl/. @jpsamaroo’s work there is funded IIRC, so it’s been progressing quite quickly!

ToucheSir · July 12, 2020, 12:39am

Are you able to upgrade to a more recent version of Flux? Tracker and mapleaves have long since been deprecated in favour of Zygote and fmap.

c.f.

...
using AMDGPU

model = Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)

m_gpu = fmap(HSAArray, m)
...

(not tested since I don’t have a compatible GPU, but given how close the CUDA and AMDGPU APIs are I suspect it’s not too far off)

Unless you’re using a batch size in the high hundreds or thousands, the cost of multiplying 24xN matrices is going to be completely outweighed by communication and data transfer overhead with the GPU. These overheads should be amortized assuming your actual network is quite a bit larger.

jling · July 12, 2020, 12:56am

the problem at the moment is that I can’t even ~~ install ~~ build AMDGPU, since the dependence is even more non-trivial than CUDA

ToucheSir · July 12, 2020, 1:09am

Are you able to run anything else that uses ROCm? If so, I would 100% open an issue. The maintainers are usually very responsive and equally patient

Ian_Slagle · July 12, 2020, 1:22am

I was wondering about this. How would I ensure that the batching is working properly (I’ve tried them in a loop with similar performance unfortunately)? Is that through ArrayFire (or whatever GPU library)?

I’ll give it a try thanks!

Ian_Slagle · July 12, 2020, 3:24am

I definitely need to give this a try. It looks like the first release has happened since I last tried to get this package working on my computer.

ToucheSir · July 12, 2020, 5:34am

Batching is done through Flux. I believe the last dimension is the batch dimension by convention.

Ian_Slagle · July 12, 2020, 7:03pm

ToucheSir:

Are you able to upgrade to a more recent version of Flux? Tracker and mapleaves have long since been deprecated in favour of Zygote and fmap .

c.f.
...
using AMDGPU

model = Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)

m_gpu = fmap(HSAArray, m)
...

Using the latest released (in Pkg) version of Flux like this yields this error:

UndefVarError: fmap not defined

Trying to install the master branch of Flux yields this strange error (considering AMDGPU is only on 0.1.0):

Unsatisfiable requirements detected for package GPUArrays [0c68f7d7]:
 GPUArrays [0c68f7d7] log:
 ├─possible versions are: [0.3.0-0.3.4, 0.4.0-0.4.2, 0.5.0, 0.6.0-0.6.1, 0.7.0-0.7.2, 1.0.0-1.0.4, 2.0.0-2.0.1, 3.0.0-3.0.1, 3.1.0, 3.2.0, 3.3.0, 3.4.0-3.4.1, 4.0.0] or uninstalled
 ├─restricted by compatibility requirements with AMDGPU [21141c5a] to versions: 2.0.0-2.0.1
 │ └─AMDGPU [21141c5a] log:
 │   ├─possible versions are: 0.1.0 or uninstalled
 │   └─restricted to versions * by an explicit requirement, leaving only versions 0.1.0
 └─restricted by compatibility requirements with CUDA [052768ef] to versions: 4.0.0 — no versions left
   └─CUDA [052768ef] log:
     ├─possible versions are: [0.1.0, 1.0.0-1.0.2, 1.1.0] or uninstalled
     └─restricted to versions 1 by Flux [587475ba], leaving only versions [1.0.0-1.0.2, 1.1.0]
       └─Flux [587475ba] log:
         ├─possible versions are: 0.11.0 or uninstalled
         └─Flux [587475ba] is fixed to version 0.11.0

Ian_Slagle · July 12, 2020, 11:10pm

Update: I’ve installed AMDGPU.jl (and rocblas, rocrand, rocsparse, rocalution, rocfft, and MIOpen though not sure how necessary they are for this) per @ToucheSir and @dpsanders ’ suggestions . I then interfaced it with Flux in the exact same way as I did with ArrayFire:

AMDGPU.HSAArray(func::typeof(σ)) = σ
AMDGPU.HSAArray(func::typeof(identity)) = identity
AMDGPU.HSAArray(func::typeof(softmax)) = softmax
model2 = mapleaves(HSAArray, mapleaves(Tracker.data, Chain(
  Dense(24, 24, σ),
  Dense(24, 24),
  softmax
)))

For this small network, the GPU version did outperform the CPU (though only by a little: 1.498 μs vs. 1.788 μs) for evaluation of the neural network.

However, trying this with a larger network like this:

Chain(
  Dense(604, 400, σ),
  Dense(400, 300),
  Dense(300, 197),
  Dense(197, 197),
  Dense(197, 197),
  Dense(197, 110),
  Dense(110, 20),
  softmax
)

did not see significant performance improvments.

CPU:

BenchmarkTools.Trial: 
  memory estimate:  23.84 KiB
  allocs estimate:  117
  --------------
  minimum time:     53.960 μs (0.00% GC)
  median time:      62.500 μs (0.00% GC)
  mean time:        65.319 μs (2.06% GC)
  maximum time:     3.030 ms (96.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

GPU:

BenchmarkTools.Trial: 
  memory estimate:  20.89 KiB
  allocs estimate:  22
  --------------
  minimum time:     201.469 μs (0.00% GC)
  median time:      206.269 μs (0.00% GC)
  mean time:        207.926 μs (0.38% GC)
  maximum time:     2.330 ms (85.28% GC)
  --------------
  samples:          10000
  evals/sample:     1

jling · July 13, 2020, 3:11am

instead, significant slow-doen

jpsamaroo · July 13, 2020, 9:19pm

It would be great if someone could test this with a CUDA GPU to compare. However, as stated, you may not get competitive performance over the CPU without using very expensive layers like convolutions.

Alternatively, it’s equally as likely that AMDGPU.jl is at fault for being slow. HSAArray isn’t really optimized for anything, and it’s slated to be merged with the ROCArray, which is supposed to be the more performant and featureful array type (we mostly just need the HSAArray for tests). Both array types currently do very bad things as well, such as falling back to running operations on the CPU silently (which CuArrays explicitly avoids, and we will too soon). Much of this will be fixed in the next few months, and I’ll have better news

Topic		Replies	Views
ArrayFire vs AbstractArray performance and future in julia Machine Learning	7	1363	May 5, 2021
Flux with AMD GPU(s)? Machine Learning flux , amdgpu	34	5084	February 15, 2023
Do you suggest using ArrayFire? General Usage gpu	1	478	June 20, 2020
Lux tutorial: AMDGPU 20x slower than CPU New to Julia flux , amdgpu , lux	17	1379	December 5, 2023
MNIST GPU CuArrays error GPU	23	3063	January 22, 2019

ArrayFire and Flux

Related topics