CatViews for CUDA.jl?

Ahmed_Salih · February 8, 2023, 9:23pm

Hello!

I found a bit of an old package, but with a great feature I want to have so much in my code. It would let me keep current code structure and give me the flexibility I need. The package is:

With the example provided as:

A = randn(10, 10);
B = randn(10, 10);
a = view(A, :);      # no copying
b = view(B, :);      # no copying
x = CatView(a, b);   # no copying!!!

Which works out of the box and is great. But if I try putting A and B as CuArray:

A = CuArray(randn(10, 10));
B = CuArray(randn(10, 10));
a = view(A, :);      # no copying
b = view(B, :);      # no copying
x = CatView(a, b);   # no copying!!!

x = CatView(a, b)
ERROR: StackOverflowError:
Stacktrace:
     [1] CatView(arr::Tuple{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}})

Would anyone know if there is some way to fix this?

Basically what I want is that I want to concatenate A and B without allocations.

EDIT: From the error message I think I get what happens, i.e. a view on CUDA is not an actual view. Perhaps some other way to achieve my goal then?

Kind regards

jling · February 8, 2023, 9:28pm

based on this:

github.com

JuliaGPU/CUDA.jl/blob/e9833ed71977b423586734a5f81151925e00d960/src/array.jl#L711-L721


      
          @inline function Base.view(A::CuArray, I::Vararg{Any,N}) where {N}
              J = to_indices(A, I)
              @boundscheck begin
                  # Base's boundscheck accesses the indices, so make sure they reside on the CPU.
                  # this is expensive, but it's a bounds check after all.
                  J_cpu = map(j->adapt(BackToCPU(), j), J)
                  checkbounds(A, J_cpu...)
              end
              J_gpu = map(j->adapt(CuArray, j), J)
              unsafe_view(A, J_gpu, CuIndexStyle(I...))
          end

I believe view of cuda array is still a CuArray. So what needs to happen here is CatView needs to know about CuArray.

Btw, why do you need this to begin with? First of all CatView is kinda slow, secondly for stuff to run fast on GPU you probably really want a contiguous array, not a cat view anyway

Ahmed_Salih · February 8, 2023, 9:32pm

How would I make CatView know about CuArrays?

I need it because I want to avoid allocating data on the GPU, and perhaps I can avoid rewriting my whole data-structure if I get something such as CatView to work for GPU

Kind regards

jling · February 8, 2023, 9:33pm

is that really your bottleneck?

jling · February 8, 2023, 9:34pm

try SentinelArrays.jl/chainedvector.jl at 839aed26e53f62779259095f4efda360d09f9a63 · JuliaData/SentinelArrays.jl · GitHub

Ahmed_Salih · February 8, 2023, 9:43pm

As far as I been told the fastest GPU code is when there is no data allocations on the GPU, therefore I aim to write GPU code without any allocations. I am running a particle simulation of 100k iterations etc. so I think it is a smart design choice, without knowing that much tbh

I tried SentinelArrays and the ChainedVector thank you, but it uses scalar indexing and came up with an error

Kind regards

jling · February 8, 2023, 9:49pm

yeah if you lazily chain two array together, you’re bound to do scalar index.

sure that’s assuming you can keep everything else intact. In reality, you can’t, and it may not be worth the effort after all.

besides, it’s possible to allocate once and re-use the same memory on GPU as well

Ahmed_Salih · February 8, 2023, 9:52pm

Yes, I am doing that for my code, allocating memory once and working on it continously, trying to avoid resizing etc.

And I can rewrite my code to just increase the size of the vectors from the start - I just thought it would be more elegant with something like CatView

Kind regards, thank you for the help

wsshin · February 10, 2023, 3:14pm

If you know the size of A and B in advance, as shown in OP’s example (both 10×10), then you can use splitview of CatViews:

using CUDA, CatViews
x = cu(rand(200))  # CuVector storing the contents of A and B
(A, B), s, e = splitview(x, (10,10), (10,10))

where s and e store the indices of x corresponding to the start and end of A and B.

With this, you get

julia> typeof(A), typeof(B)
(CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})

julia> size(A), size(B)
((10, 10), (10, 10))

so A and B are CuMatrix, not Matrix.

Ahmed_Salih · February 10, 2023, 4:25pm

Thank you for your help!

This seems to be close to what I want, but not quite there. Ideally what I would want to do is as follows.

Imagine I have;

x1 = cu(rand(100))
x2 = cu(rand(100))

Then I want to construct a combined array/view without allocations, so that I get;

# x = vcat(x1,x2) but this allocates.. which is what I want to avoid
x = cu(rand(200))

And this way where I hoped CatView could come in and Julia really could shine, by letting me keep my current very nice data structure for my purpose.

Kind regards

wsshin · February 10, 2023, 5:48pm

If the lengths of x1 and x2 are predetermined, can’t you do the following?

x = cu(rand(200))  # 200 = length(x1) + length(x2)
(x1,x2), s, e = splitview(x, (100,), (100,))

I notice that splitview() allocates:

julia> @btime splitview($x, (100,), (100,));
  56.213 ns (2 allocations: 96 bytes)

but I don’t think it allocates the memory for the elements of x1 and x2. In other words, x1 and x2 are partial views of x: changing the contents of x1 and x2 changes the contents of x, and vice versa.

Also note that README of CatViews recommends using splitview() over CatView() when possible, because it is faster.

Ahmed_Salih · February 10, 2023, 5:56pm

In pseudo code what I am trying is:

x, s, e = splitview((x1,x2), (100,), (100,))

I want to output a concatted array x, based on the values from x1 and x2. What I want to replicate is basically:

x = reduce(vcat,(x1,x2))
 @CUDA.time x = reduce(vcat,(x1,x2))
  0.000699 seconds (113 CPU allocations: 6.500 KiB) **(1 GPU allocation: 800 bytes, 1.63% memmgmt time)**

But this allocates on the GPU, which I don’t want. This is where I hoped this package could help me keep my data structure, but “combine the arrays allocation free”

It is because I start from x1 and x2 and want to produce x.

You are showing me how to produce x1 and x2, when I have x, which I do not have, because that was my initial design choice

Kind regards

wsshin · February 10, 2023, 6:06pm

That’s why I keep asking if you know the lengths of x1 and x2 in advance. If you do, you can create x first using the sum of the lengths of x1 and x2, and then create x1 and x2 using splitview(). If you define types containing x1 and x2 as fields, you can pass x1 and x2 created by splitview() to those types.