How to properly use Tullio + CUDA on a package extension?

Hi!

I have a package that defines many functions which are the evaluation of complex expressions over a multidimensional grid. Sometimes I want the expressions to be allocating, and sometimes not, so there are a lot of tiny variations. I found that Tullio.jl is very useful to evaluate those expressions, since it provides an easy syntax and automatically handles parallelization. Here is an example:

function _free_propagation!(ψ₀,xs,ys,zs,qxs,qys,scaling,k)
    @tullio direct_phases[i,j] := k * ( xs[j]^2 + ys[i]^2 ) / 2
    @tullio ψ[i,j,l] := ψ₀[i,j] * cis( direct_phases[i,j] * ( 1 - scaling[l] ) / zs[l] ) / scaling[l]

    fft!(ψ,(1,2))

    @tullio reciprocal_phases[i,j] := - ( qxs[j]^2 + qys[i]^2 ) / 2k
    @tullio ψ[i,j,l] *= cis( reciprocal_phases[i,j] * zs[l] / scaling[l] )

    ifft!(ψ,(1,2))
    @tullio ψ[i,j,l] *= cis( - direct_phases[i,j] * ( 1 - scaling[l] ) * scaling[l] / zs[l])
end

Tullio also allows me to perform these computations on GPU, which is great! The thing is that I want to put this extra functionality on a package extension, in order for it not to burden the loading time of someone who doesn’t have a GPU. I was able to make it work, but I found a few drawbacks:

  • Tullio needs KernelAbstractions and CUDAKernels to perform calculations on the GPU. I could only make things work if those packages are direct dependencies of the original package.

  • I need to repeat all @tullio calls on the extension files. For example, in order to extend _free_propagation! I need to completely rewrite it:

function StructuredLight._free_propagation!(ψ₀::CuArray,xs::CuArray,ys::CuArray,zs::CuArray,qxs::CuArray,qys::CuArray,scaling::CuArray,k)
    @tullio direct_phases[i,j] := k * ( xs[j]^2 + ys[i]^2 ) / 2
    @tullio ψ[i,j,l] := ψ₀[i,j] * cis( direct_phases[i,j] * ( 1 - scaling[l] ) / zs[l] ) / scaling[l]

    fft!(ψ,(1,2))

    @tullio reciprocal_phases[i,j] := - ( qxs[j]^2 + qys[i]^2 ) / 2k
    @tullio ψ[i,j,l] *= cis( reciprocal_phases[i,j] * zs[l] / scaling[l] )

    ifft!(ψ,(1,2))
    @tullio ψ[i,j,l] *= cis( - direct_phases[i,j] * ( 1 - scaling[l] ) * scaling[l] / zs[l])
end

The calculations are exactly the same as in the previous case, the only difference is that now the function is defined with explicit CuArray argument types. This is a huge drawback, since, if i want to change the internals of this function, I need to perform the exact same change in two places, which makes to code harder to maintain.

I was wondering if there were any ways to circumvent these two problems. Thanks!

1 Like

The code is exactly the same except the CuArray dispatches?

What is the exact error message if you don’t copy paste the code but instead simply put the extra dependencies?

Isn’t the dispatch mechanism of the functions inside _free_propagation able to handle everything?

Btw: Are you doing free space light propagation? Since you are using the convolutional approach, an angular spectrum would be the same performance price but more accurate than Fresnel.

I haven’t tried but I don’t see a nice way, sadly.

When the macro runs, it checks whether :CuArray (etc.) is defined in the calling scope. If not, it doesn’t write CUDA methods – it can’t, as they want to dispatch on x::CuArray. This is a slightly hacky pre-extensions way of dealing with only some users loading CUDA.jl.

Now that we have extensions, Tullio could probably use one as a more reliable way to detect whether CUDA.jl is loaded. But I don’t think this will quite solve your problem, as this detection will still happen when the macro expansion runs. If someone loads your package without CUDA, @tullio must be expanded before it is usable; if they later load CUDA then it won’t be run again. The macro is still violating what Julia really expects of macros, that they always expand out to the same code, given the same source.

To generate code later, I believe it would need to be done by a generated function, not a macro. This is what I think @turbo does, at macro expansion time it simply stores everything in some struct, which is turned into code at compile time. Debugging what code is generated by such things is much harder, which is why I gave up on trying to write Tullio to work this way.

It’s just like you said, the code is exactly the same except the CuArray dispatches.

If I don’t copy the code, I get the scalar indexing warning, which means that the code that is being executed is the CPU version.

As for your last comment, I do indeed implement an angular spectrum version. The version that I showed is a little bit slower, but has the advantage of allowing to scale the final grid, which can also be useful. Here is a link to the package if you want to take a look: StrucutredLight.jl.

Doesn’t it solve the problem if the users load CUDA and KernelAbstractions before loading StructuredLight then?

Hi,
I face the same issue, I tried to play around with dependencies but I was not able to make Tullio run without Scalar indexing on CuArray excepted when I put CUDA and KernelAbstraction in the dependencies of my package (which is annoying because of the size of CUDA).
Is there any way to re-run the macro expansion somehow when CUDA is loaded?