Dreaded CuArray only supports element types that are stored inline

GPU noob. I’ve simplified my problem here. I have a compression type which basically has information that yields the correct phase adjustment. But it is an abstract composite type which could have various memory footprints as shown below.

I’m trying to design something that could allow a vector of these different objects to be passed into a GPU and processed.

I supposed I might be able to do something with function handles instead of composite types for the codeStyle. But is there a way to do it the way I am attempting?

Best Regards,
Allan B

using CUDA

# This file is going to test putting together a GPU version of an array of various different memory structures

struct defaultCompType
	a::Float64
	b::Float64
	c::Int64
end

abstract type codeStyle{CT} end

struct code1{CT<:NTuple} <: codeStyle{CT}
	code::CT
	s_chiplen::Float64
end

struct code2{CT<:NTuple} <: codeStyle{CT}
	B::Float64
	T::Float64
end

struct code3{CT<:NTuple} <: codeStyle{CT}
end

struct compressedCompType{CT<:NTuple}
	w::defaultCompType
	c::codeStyle{CT}
end


w1=compressedCompType{NTuple{}}(defaultCompType(1.0,2.0,3),code3{NTuple{}}())
w2=compressedCompType{NTuple{}}(defaultCompType(1.0,2.0,3),code3{NTuple{}}())
w3=compressedCompType{NTuple{}}(defaultCompType(1.0,2.0,3),code2{NTuple{}}(1.0,2.0))
w4=compressedCompType{Tuple{Int64,Int64,Int64}}(defaultCompType(1.0,2.0,3),code1{Tuple{Int64,Int64,Int64}}((1,2,3),4.0))

CuArray([w1,w2,w3])  # Works -- because of the NTuple thing is not there I guess

CuArray([w1,w2,w3,w4])  # Does not work -- what should I do?

I think I’ve decided to represent my bitvector as a UInt64 instead of a Tuple. I think this allows for a better size concreteness.

There’s several things wrong with your code. First, your compressedCompType structs can never be isbits because they contain an abstract type:

abstract type codeStyle{CT} end
struct compressedCompType{CT<:NTuple}
    w::defaultCompType
    c::codeStyle{CT}
end

That field isn’t fully specialized. Instead, use something like:

struct compressedCompType{CT<:codeStyle}
    w::defaultCompType
    c::CT
end

… and a constructor to simplify things:

compressedCompType(w, c::CT) where CT = compressedCompType{CT}(w, c)
w1=compressedCompType(defaultCompType(1.0,2.0,3),code3{NTuple{}}())

Next, you can’t represent an array of different instances of these structs on the GPU, because their layout isn’t the same, and as a result these elements aren’t stored contiguously:

julia> [w1,w1]
2-element Vector{compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}}:
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())

julia> isbitstype(eltype(ans))
true

# w2 has the same type as w1, but is a different object
julia> [w1,w2]
2-element Vector{compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}}:
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())

julia> isbitstype(eltype(ans))
true

# w3 has a different type
julia> [w1,w2,w3]
3-element Vector{compressedCompType}:
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())
 compressedCompType{code3{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code3{Tuple{Vararg{T, N}} where {N, T}}())
 compressedCompType{code2{Tuple{Vararg{T, N}} where {N, T}}}(defaultCompType(1.0, 2.0, 3), code2{Tuple{Vararg{T, N}} where {N, T}}(1.0, 2.0))

julia> isbitstype(eltype(ans))
false

FWIW, these somewhat subtle details aren’t by CUDA.jl’s choice, we just require array storage to be contiguous (see Base.allocatedinline), the rest is Julia’s design.

Thank you. I think I figured this out… eventually. Not having programmed on GPUs before and trying to convert probably poorly behaving CPU julia code to GPU I’ve had a lot of iterations.

I moved the structure so that instead of using multiple dispatch to parse out the function calls, it uses a common data format and uses an ENUM to parse out which function to call, since this is really a run-time determination, this makes it more plateable to the GPU…

But I still have a fundamental flaw in my understanding.
The structure I was trying to get to work, is really used in a a vector to make a database of parameters that change over time. This database is searched for each “time” lookup to find the best parameters. I need to access this database on the GPU, where the time vector is the CuArray and can be distributed because it’s calculation are independent. How do I represent the database which is the vector of structures containing the parameters on the GPU? The database vector will always be searched sequentially to find the best parameter entry for the calculation’s parallelly calculated time vector.

I thought initially that every vector needed to be a CuArray, but don’t really only the ones that are going to be distributed need to be CuArray. The database vector just needs to be looked up for each element of the CuArray time vector. How do I represent it?

Does that make sense?

Perhaps I need to start a new issue for this.

If I understand correctly… your database vector also needs to be a CuArray, as you can only access CuArrays from within a GPU kernel. Unless the database vector is really small, in which case it can also be a StaticArray.

Is there an easy way to see if it is running on the CPU. I think I have CuArrays and a structure which should contain another CuArray which is the database, and when I try to do a broadcast operation I get a:
ERROR: This function is not intended for use on the CPU

RROR: This function is not intended for use on the CPU
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] arrayref(A::CuDeviceVector{Waveform{Float64}, 1}, index::Int64)
    @ CUDA C:\Users\user\.julia\packages\CUDA\DfvRa\src\device\utils.jl:42
  [3] getindex
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\device\array.jl:192 [inlined]
  [4] copyto_unaliased!
    @ .\abstractarray.jl:1038 [inlined]
  [5] copyto!
    @ .\abstractarray.jl:1018 [inlined]
  [6] copyto_axcheck!
    @ .\abstractarray.jl:1127 [inlined]
  [7] Vector{Waveform{Float64}}(x::CuDeviceVector{Waveform{Float64}, 1})
    @ Base .\array.jl:626
  [8] Array
    @ .\boot.jl:484 [inlined]
  [9] convert
    @ .\array.jl:617 [inlined]
 [10] CuArray
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\array.jl:292 [inlined]
 [11] CuArray
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\array.jl:296 [inlined]
 [12] (CuArray{Waveform{Float64}})(xs::CuDeviceVector{Waveform{Float64}, 1})
    @ CUDA C:\Users\user\.julia\packages\CUDA\DfvRa\src\array.jl:303
 [13] convert
    @ C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\construction.jl:4 [inlined]
 [14] TransmissionBufferGPU(buffer::CuDeviceVector{Waveform{Float64}, 1}, num::Int64, emptyWaveform::Waveform{Float64})
    @ rfEnviroSim.EnvTransmissionBuffer c:\Users\user\Documents\2020\Julia\development\rfEnviroSim\src\EnvTransmissionBuffer.jl:45
 [15] adapt_structure(to::CUDA.Adaptor, obj::TransmissionBufferGPU)
    @ rfEnviroSim.EnvTransmissionBuffer C:\Users\user\.julia\packages\Adapt\LAQOx\src\macro.jl:11
 [16] adapt
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\Adapt.jl:40 [inlined]
 [17] Fix1
    @ .\operators.jl:1096 [inlined]
 [18] map
    @ .\tuple.jl:221 [inlined]
 [19] adapt_structure
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\base.jl:3 [inlined]
 [20] adapt
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\Adapt.jl:40 [inlined]
 [21] adapt_structure
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\base.jl:18 [inlined]
 [22] adapt
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\Adapt.jl:40 [inlined]
 [23] adapt_structure
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\base.jl:30 [inlined]
 [24] adapt
    @ C:\Users\user\.julia\packages\Adapt\LAQOx\src\Adapt.jl:40 [inlined]
 [25] cudaconvert
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\compiler\execution.jl:152 [inlined]
 [26] map
    @ .\tuple.jl:223 [inlined]
 [27] map
    @ .\tuple.jl:224 [inlined]
 [28] macro expansion
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\compiler\execution.jl:100 [inlined]
 [29] #launch_heuristic#248
    @ C:\Users\user\.julia\packages\CUDA\DfvRa\src\gpuarrays.jl:17 [inlined]
 [30] _copyto!
    @ C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\broadcast.jl:63 [inlined]
 [31] copyto!
    @ C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\broadcast.jl:46 [inlined]
 [32] copy
    @ C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\broadcast.jl:37 [inlined]
 [33] materialize
    @ .\broadcast.jl:860 [inlined]
 [34] getSignal(obj::TransmissionBufferGPU, s_time::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
    @ rfEnviroSim.EnvTransmissionBuffer c:\Users\user\Documents\2020\Julia\development\rfEnviroSim\src\EnvTransmissionBuffer.jl:267
 [35] top-level scope
    @ c:\Users\user\Documents\2020\Julia\development\rfEnviroSim\test\runtest_EnvTransmissionBuffer.jl:49

The source, looked like this:

struct TransmissionBufferGPU
	buffer::CuArray{Waveform{Float64}}  # Must be Organized in order of finish-to-start
	num::Int
	emptyWaveform::Waveform{Float64}  # Need this because the GPU can't create a blank
end
Adapt.@adapt_structure TransmissionBufferGPU

function TransmissionBufferGPU(TB::TransmissionBuffer{Float64}) 
	TransmissionBufferGPU(CuArray{Waveform{Float64}}(TB.buffer[TB.finishToStart]), TB.num,emptyWaveform(Float64))
end

function returnWaveformAt(tb::TransmissionBufferGPU,time::Float64)::Waveform{Float64} 
	if tb.num>0
		for wf in tb.buffer  # GPU Transmission Buffer is static, so in the correct order
			if (time >= wf.s_startWaveform)
				return wf
			end
		end
	end
	# Found nothing, return an empty waveform that produces no signals
	return tb.emptyWaveform 
end

function getSignal(obj::TransmissionBufferGPU,s_time::CuArray{Float64})
	function kernel(t::Float64)
		wf =returnWaveformAt(obj,t)
		getSignal(wf,t,0.0)
	end
	kernel.(s_time)
end


# And to call it... (I'm leaving out the creation of the tb buffer.)
tb_gpu=cu(TransmissionBufferGPU(tb))
		Fs=100e6
		Rref=50.0
		stime=CuArray(collect(0.19999:1e-8:(0.19999+1e-8*2047)))
		s_gpu=getSignal(tb_gpu,stime)  ###<<<< This line generates stack trace error above







I just realized, I hadn’t tried a simpler test.

 wf=tb.buffer[1]
s_gpu=map(x->getSignal.(wf,x),stime)

This gives me an error which I need to run down:

ERROR: GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:35
 [2] copy
   @ C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\broadcast.jl:34 [inlined]
 [3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, var"#5#6", Tuple{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}})
   @ Base.Broadcast .\broadcast.jl:860
 [4] map(::Function, ::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
   @ GPUArrays C:\Users\user\.julia\packages\GPUArrays\fqD8z\src\host\broadcast.jl:82
 [5] top-level scope
   @ REPL[13]:1

Even after fixing the smaller example, the database array doesn’t work.
I think it has to do with the building of the kernel to be broadcasted against. I don’t know how to do this correctly. The wfv is a database and not the same size as the parallel values that need to be called. How do I get the database to the GPU for use in the kernel?

function getSignal(wfv::CuArray{Waveform},s_time::CuArray{Float64},s_d::CuArray{Complex{Float64}})
	
	function kernel(t::Float64)
		index =returnWaveformIndexAt(wfv,t)
		if index>0
			getSignal(wfv[index],t,0.0)
		else
			Complex(0.0)
		end
	end
	@sync s_d.=kernel.(s_time)
end

function returnWaveformIndexAt(wfv::CuArray{Waveform},time::Float64)::Int64
	if length(wfv)>0
		for ii in 1:lastindex(wfv)  # GPU Transmission Buffer is static, so in the correct order
			if (time >= wfv[ii].s_startWaveform)
				return ii
			end
		end
	end
	# Found nothing, return an empty waveform that produces no signals
	return 0
end

Then I get this error:

julia> getSignal(wfv,stime,s_d)
ERROR: InvalidIRError: compiling kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{ComplexF64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, rfEnviroSim.EnvTransmissionBuffer.var"#kernel#26"{CuDeviceVector{Waveform, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float64, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to returnWaveformIndexAt)

Ok. I think I accidentally solved it. Removing the type constraint has it run.
So… I guess I need to figure out what the correct constraint is.
Not sure what happened.

function returnWaveformIndexAt(wfv,time::Float64)::Int64
	if length(wfv)>0
		for ii in 1:lastindex(wfv)  # GPU Transmission Buffer is static, so in the correct order
			if (time >= wfv[ii].s_startWaveform)
				return ii
			end
		end
	end
	# Found nothing, return an empty waveform that produces no signals
	return 0
end

I believe the issue with the type constraint is that kernel calls returnWaveformIndexAt(::CuDeviceArray, ::Float64) and not returnWaveformIndexAt(::CuArray, ::Float64). The GPU compile error isn’t great, but were it able to CUDA.jl would probably throw a MethodError here.

1 Like