Performance of view with cuArrays

LaurentPlagne · November 10, 2018, 1:49pm

Hi,

Let x and y be 2 3D arrays of Float32 with size (200,200,200)
I would like to perform the following operation
y[i,:,:]=x[i+1,:,:]-x[i,:,:] for i in [1:199]

I have implement this in a function that can be called both with Arrays and cuArrays:

function Btmult!(y,x)
    s=size(x)
    for i=1:s[1]-1
        yi=view(y,i,:,:)
        xi=view(x, i,:, :)
        xip1=view(x, i+1,:, :)
        yi.=xip1.-xi
    end
end

I have also written a function acting on the third dimension, having in mind that it should be better suited to GPU computations:

function Btmult_plan!(y,x)
    s=size(x)
    for k=1:s[3]-1
        yk=view(y,:,:,k)
        xk=view(x, :, :, k)
        xkp1=view(x, :, :, k+1)
        yk.=xkp1.-xk
    end
end

The performance comparison surprised me because I expected the GPU to perform much faster.

diff=0.0
Btmult!(Array{Float32,3},Array{Float32,3}): t=0.07029846898
Btmult!(CuArray{Float32,3},CuArray{Float32,3}): t=0.00921697145
SpeedUp GPU/CPU=7.62706810597748
diff=0.0
Btmult_plan!(Array{Float32,3},Array{Float32,3}): t=0.0037557243699999996
Btmult_plan!(CuArray{Float32,3},CuArray{Float32,3}): t=0.0014109958899999999
SpeedUp GPU/CPU=2.661754294691815

What would be the most efficient way of implementing this operation on GPU ?

Laurent

Here is the MWE:

using Pkg
using CuArrays
# using BenchmarkTools

function Btmult!(y,x)
    s=size(x)
    for i=1:s[1]-1
        yi=view(y,i,:,:)
        xi=view(x, i,:, :)
        xip1=view(x, i+1,:, :)
        yi.=xip1.-xi
        # y[i,:,:].=x[i+1,:,:].-x[i,:,:]
    end
end
function Btmult_plan!(y,x)
    s=size(x)
    for k=1:s[3]-1
        yk=view(y,:,:,k)
        xk=view(x, :, :, k)
        xkp1=view(x, :, :, k+1)
        yk.=xkp1.-xk
        # y[:,:,k].=x[:,:,k+1].-x[:,:,k]
    end
end

function permute3D!(y,x)
    permutedims!(y,x,(2,3,1))
end

function init3D(n::Int64)
    xcpu=zeros(Float32,(n,n,n))
    copyto!(xcpu,1:length(xcpu))
    ycpu=zeros(Float32,(n,n,n))
    xgpu = CuArray{Float32}(n, n,n)
    ygpu = CuArray{Float32}(n, n,n)
    copyto!(xgpu,xcpu)
    copyto!(ygpu,ycpu)
    xcpu,ycpu,xgpu,ygpu
end

using LinearAlgebra
function norm_diff_cpugpu(xcpu,xgpu)
    xcpubis=similar(xcpu)
    copyto!(xcpubis,xgpu)
    return norm(xcpubis.-xcpu)
end

function test_binaryf!(y,x,bf)
    nsample=100
    bf(y,x)#warm up
    t=@elapsed for i=1:nsample
        bf(y,x)
    end
    t/=nsample
    return t
end


function test_binaryf_cpugpu(n::Int64,bf)
    (xcpu,ycpu,xgpu,ygpu)=init3D(n)
    tcpu=test_binaryf!(ycpu,xcpu,bf)
    tgpu=test_binaryf!(ygpu,xgpu,bf)
    println("diff=",norm_diff_cpugpu(ycpu,ygpu))
    println(string(bf),"(",string(typeof(ycpu)),",",string(typeof(xcpu)),"): t=", tcpu)
    println(string(bf),"(",string(typeof(ygpu)),",",string(typeof(xgpu)),"): t=", tgpu)
    println("SpeedUp GPU/CPU=",tcpu/tgpu)
    return
end


# test_binaryf_cpugpu(100,permute3D!)
test_binaryf_cpugpu(200,Btmult!)
test_binaryf_cpugpu(200,Btmult_plan!)

ksmcreynolds · November 10, 2018, 8:45pm

I would get rid of the outer loop and just use views.

function Btmult_new!(y,x)
    yv = @view y[:,:,1:end-1]
    xv = @view x[:,:,1:end-1]
    xvp1 = @view x[:,:,2:end]
    @. yv = xvp1 - xv
end

Using Btmult_new! defined above gave

diff=0.0
Btmult_new!(Array{Float32,3},Array{Float32,3}): t=0.005671426393
Btmult_new!(CuArray{Float32,3},CuArray{Float32,3}): t=5.9955349999999995e-6
SpeedUp GPU/CPU=945.941670426409

For comparison, Btmult_plan! gave

diff=0.0
Btmult_plan!(Array{Float32,3},Array{Float32,3}): t=0.005703055463
Btmult_plan!(CuArray{Float32,3},CuArray{Float32,3}): t=0.0019898782790000002
SpeedUp GPU/CPU=2.8660323212664203

Note: I bumped up nsamples to 1000 for better statistics.

LaurentPlagne · November 11, 2018, 7:41am

Thank you very much !
your implementation is much more elegant and implies only 1 gpu kernel launch. There is an obvious problem with my timing method (I guess that the gpu kernel is launched asynchronously). I should include a copy of the result on the cpu to force the synchronization.
Unless a method for GPU synchronization is available ?

maleadt · November 11, 2018, 9:00am

CUDAdrv.synchronize() or CuArrays.@sync if you’re using master.

Better run all that code under nvprof to get to see actual timings. You were probably launching way to many kernels, and depending on the kernel execution time that launch overhead may or may not have been hidden by the executions happening on the GPU. You can use nvpp (the visual counterpart of nvprof) to visualize that and see if kernels on the GPU are “tightly packed”, or with gaps in between due to host overhead. You can of course avoid dealing with that by launching fewer kernels, which is always a good thing.

LaurentPlagne · November 11, 2018, 10:21am

Thank you for the advice. I will try nvprof/nvpp.
BTW you launch it like this ? (I have an error that it “returns a nonzero return code”)
nvvp /home/lolo/Public/julia-1.0.2/bin/julia TestGPU.jl

LaurentPlagne · November 11, 2018, 10:32am

OK, nvprof works (I had to disable unified memory because my GPU is too old).
I guess that the kernel is ptxcall_anonymous_1
I still can’t use nvvp.

nvprof --unified-memory-profiling off /home/lolo/Public/julia-1.0.2/bin/julia TestGPU.jl 
==7820== NVPROF is profiling process 7820, command: /home/lolo/Public/julia-1.0.2/bin/julia TestGPU.jl
diff=0.0
Btmult_new!(Array{Float32,3},Array{Float32,3}): t=0.013200700000000001
Btmult_new!(CuArray{Float32,3},CuArray{Float32,3}): t=1.02e-5
SpeedUp GPU/CPU=1294.186274509804
==7820== Profiling application: /home/lolo/Public/julia-1.0.2/bin/julia TestGPU.jl
==7820== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.36%  41.233ms        11  3.7485ms  3.7080ms  3.9029ms  ptxcall_anonymous_1
                   35.45%  35.346ms         2  17.673ms  1.3760us  35.345ms  [CUDA memcpy DtoH]
                   23.19%  23.125ms         7  3.3036ms     704ns  11.573ms  [CUDA memcpy HtoD]
      API calls:   53.62%  406.69ms        27  15.063ms     792ns  181.41ms  cudaFree

Which means a speed up of 4 (not very impressive…)

maleadt · November 11, 2018, 12:30pm

I haven’t looked at your code in detail, but it looks like your just doing a subtraction? That’s a trivial operation, so you can’t expect the GPU to shine here. It’s all about latency hiding, and you don’t have many computations to hide memory operations with. There might be room for optimization by improving how you load from memory (eg. iteration order to avoid bank conflicts, cached loads, etc), but in general it’s much easier to try and fuse kernels or otherwise increase the computational load.

LaurentPlagne · November 11, 2018, 5:39pm

I totally agree on the need to fuse my different kernels but I am discovering the cuArray.jl package and I wanted to start with basic computations (especially because I could not find a tutorial nor documentation). For example, do you validate how this operation is implemented by @ksmcreynolds ?

As you pointed out, this kernel has a very low arithmetic intensity: I expect its time to be proportional to the VRAM bandwith:

For 100x100x100: tgpu=138.6us (the cpu time is faster but the computation fits in cache)

Btmult_new!(Array{Float32,3},Array{Float32,3}): t=0.0002206
GPU activities:   64.36%  2.9138ms        21  138.75us  138.59us  139.62us  ptxcall_anonymous_1

For 200x200x200: tgpu= 1094us (x 7.89) ->tcpu/tgpu=3.54

Btmult_new!(Array{Float32,3},Array{Float32,3}): t=0.0038772 
GPU activities:   57.36%  22.987ms        21  1.0946ms  1.0942ms  1.0959ms  ptxcall_anonymous_1

For 400x400x400: tgpu=8796us (x 8.04) → tcpu/tgpu=3.5

Btmult_new!(Array{Float32,3},Array{Float32,3}): t=0.030702550000000002
GPU activities:   56.78%  184.70ms        21  8.7953ms  8.6659ms  8.8032ms  ptxcall_anonymous_1

The gpu performance is proportional to the pb size, but the cpu/gpu ratio does not correspond to the cpu/gpu bandwidth ratio which I expect to be around 10 (330GB/s for 780ti compared to my 6700 K 30 system 30GB/s).
For example, my comparison with permutedims! gave me a cpu/gpu speedup factor around 10.
Thank you again for your help.

maleadt · November 11, 2018, 5:52pm

Here’s a slightly more idiomatic Julia script to benchmark the performance:

using Test

using BenchmarkTools
using Statistics

using CuArrays
import CUDAdrv

function Btmult!(y,x)
    s=size(x)
    for i=1:s[1]-1
        yi=view(y,i,:,:)
        xi=view(x, i,:, :)
        xip1=view(x, i+1,:, :)
        yi.=xip1.-xi
        # y[i,:,:].=x[i+1,:,:].-x[i,:,:]
    end
end

function Btmult_plan!(y,x)
    s=size(x)
    for k=1:s[3]-1
        yk=view(y,:,:,k)
        xk=view(x, :, :, k)
        xkp1=view(x, :, :, k+1)
        yk.=xkp1.-xk
        # y[:,:,k].=x[:,:,k+1].-x[:,:,k]
    end
end

function Btmult_new!(y,x)
    yv = @view y[:,:,1:end-1]
    xv = @view x[:,:,1:end-1]
    xvp1 = @view x[:,:,2:end]
    @. yv = xvp1 - xv
end

function init3D(n::Int64)
    x = zeros(Float32, (n,n,n))
    copyto!(x,1:length(x))
    y = zeros(Float32, (n,n,n))
    x,y
end

function test_binaryf!(y, x, bf; nsample=100)
    for i in 1:nsample
        bf(y,x)
    end
    return
end

function benchmark(N=100)
    for bf in (Btmult!, Btmult_plan!, Btmult_new!)
        x,y = init3D(N)
        x_gpu, y_gpu = CuArray(x), CuArray(y)

        test_binaryf!(y, x, bf)
        test_binaryf!(y_gpu, x_gpu, bf)
        @test y ≈ Array(y_gpu)

        @show cpu = @benchmark begin
            test_binaryf!(x, y, $bf)
        end setup=((x,y) = init3D($N))

        @show gpu = @benchmark begin
            CuArrays.@sync test_binaryf!(x_gpu, y_gpu, $bf)
        end setup=((x,y) = init3D($N); (x_gpu, y_gpu) = (CuArray(x), CuArray(y)))

        @info "GPU vs CPU for $bf" judge(median(gpu),median(cpu))
    end
end

With on my GTX 1080:

┌ Info: GPU vs CPU for Btmult!
│   judge(median(gpu), median(cpu)) =
│    BenchmarkTools.TrialJudgement: 
│      time:   -75.82% => improvement (5.00% tolerance)
└      memory: +1825.00% => regression (1.00% tolerance)
┌ Info: GPU vs CPU for Btmult_plan!
│   judge(median(gpu), median(cpu)) =
│    BenchmarkTools.TrialJudgement: 
│      time:   -62.68% => improvement (5.00% tolerance)
└      memory: +1025.02% => regression (1.00% tolerance)
┌ Info: GPU vs CPU for Btmult_new!
│   judge(median(gpu), median(cpu)) =
│    BenchmarkTools.TrialJudgement: 
│      time:   -89.90% => improvement (5.00% tolerance)
└      memory: +1158.42% => regression (1.00% tolerance)

So I’m getting a 10x improvement.

As for your performance, note that with older GPUs the overhead of our generated code (which includes 64-bit indexing, broadcast dimensionality padding, array bounds checks, etc) is much more visible than it is with more recent GPUs like my GTX 1080, see eg. fig 1 of https://arxiv.org/pdf/1810.08297.pdf. That might be affecting your performance here, since a 780ti is pretty old.

LaurentPlagne · November 11, 2018, 8:16pm

Thank you very much for the Julia idioms and the paper ! It will take some time before I get fluent in Julia

Please do not insist on the respectable age of my graphics cards, my wallet gets afraid.
A few months ago I made some tests on a Volta card via AWS and the performance boost was really impressive with non trivial transport kernels. I remember that the atomic operations made a strong difference. The performance boost was less impressive on embarrassingly parallel kernels.

I am (over?)enthusiast about the cuArrays approach and I really hope that high level GPU programming will enhance and replace most of the manual cuda kernel programming. I was in particular deeply impressed by the futhark project and its ability to compile efficiently complex SOACS expressions (map,zip,reduce,…). I am not a specialist but I think that a lot of the futhark’s transformation rules could be implemented in pure Julia and enhance the cuArrays expressiveness in the future.

maleadt · November 11, 2018, 8:41pm

a lot of the futhark’s transformation rules could be implemented in pure Julia and enhance the cuArrays expressiveness in the future

Futhark is really built to facilitate these analyses and generate high-performance code, I’m not sure that’ll be easy to match with Julia in the near future. Although the new SSA IR could help here (as demonstrated by the TPU back-end), and there’s some research working on this (Mike’s Tokamak, Peter’s Tortilla.jl). For now, broadcast + some hand-optimized kernels looks to be as good as it’ll get, but that’s fine for most users. We have other benefits that often outweigh raw performance.

LaurentPlagne · November 11, 2018, 9:09pm

You are the specialist. My wish as an end-user are not necessarily realistic