I found a bit of an old package, but with a great feature I want to have so much in my code. It would let me keep current code structure and give me the flexibility I need. The package is:
With the example provided as:
A = randn(10, 10);
B = randn(10, 10);
a = view(A, :); # no copying
b = view(B, :); # no copying
x = CatView(a, b); # no copying!!!
Which works out of the box and is great. But if I try putting A and B as CuArray:
A = CuArray(randn(10, 10));
B = CuArray(randn(10, 10));
a = view(A, :); # no copying
b = view(B, :); # no copying
x = CatView(a, b); # no copying!!!
x = CatView(a, b)
ERROR: StackOverflowError:
Stacktrace:
[1] CatView(arr::Tuple{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}})
Would anyone know if there is some way to fix this?
Basically what I want is that I want to concatenate A and B without allocations.
EDIT: From the error message I think I get what happens, i.e. a view on CUDA is not an actual view. Perhaps some other way to achieve my goal then?
I believe view of cuda array is still a CuArray. So what needs to happen here is CatView needs to know about CuArray.
Btw, why do you need this to begin with? First of all CatView is kinda slow, secondly for stuff to run fast on GPU you probably really want a contiguous array, not a cat view anyway
I need it because I want to avoid allocating data on the GPU, and perhaps I can avoid rewriting my whole data-structure if I get something such as CatView to work for GPU
As far as I been told the fastest GPU code is when there is no data allocations on the GPU, therefore I aim to write GPU code without any allocations. I am running a particle simulation of 100k iterations etc. so I think it is a smart design choice, without knowing that much tbh
I tried SentinelArrays and the ChainedVector thank you, but it uses scalar indexing and came up with an error
Yes, I am doing that for my code, allocating memory once and working on it continously, trying to avoid resizing etc.
And I can rewrite my code to just increase the size of the vectors from the start - I just thought it would be more elegant with something like CatView
but I don’t think it allocates the memory for the elements of x1 and x2. In other words, x1 and x2 are partial views of x: changing the contents of x1 and x2 changes the contents of x, and vice versa.
Also note that README of CatViews recommends using splitview() over CatView() when possible, because it is faster.
I want to output a concatted array x, based on the values from x1 and x2. What I want to replicate is basically:
x = reduce(vcat,(x1,x2))
@CUDA.time x = reduce(vcat,(x1,x2))
0.000699 seconds (113 CPU allocations: 6.500 KiB) **(1 GPU allocation: 800 bytes, 1.63% memmgmt time)**
But this allocates on the GPU, which I don’t want. This is where I hoped this package could help me keep my data structure, but “combine the arrays allocation free”
It is because I start from x1 and x2 and want to produce x.
You are showing me how to produce x1 and x2, when I have x, which I do not have, because that was my initial design choice
That’s why I keep asking if you know the lengths of x1 and x2 in advance. If you do, you can create x first using the sum of the lengths of x1 and x2, and then create x1 and x2 using splitview(). If you define types containing x1 and x2 as fields, you can pass x1 and x2 created by splitview() to those types.