AdaptiveCpp integration?

Hi all,

I’ve recently started looking at AdaptiveCpp for some of my C++ work, and‌ I’m starting to get a feeling that this would be an excellent fit for Julia’s GPGPU.

For those who don’t know, AdaptiveCpp, formerly known as hipSYCL, is a SYCL single-pass compiler and runtime with support for several backends, including OpenMP (for native execution on CPU), CUDA, HIP/ROCm, OpenCL and Level Zero/OneAPI. It even supports building CUDA and HIP code directly. Even though its primary modus operandi is as a single-pass compilers, for kernels this generally produces only an IR that is finalized (and cached) at runtime depending on the specific backend and execution environment.

Although I don’t have in-depth knowledge of the Julia (or acpp) internals, I was inevitably led to draw a parallel (eh) with the Julia AOT compilation, and suspect that the two would fit each other rather well. My understanding is that at the moment SYCL support in Julia is mainly via OneAPI.jl, and there are separate packages to interface with CUDA, HSA, AMD ROCm. The new version of the OpenCL.jl package featuring support for Native Julia kernels is also at adds with the primary proprietary OpenCL platforms (NVIDIA and AMD) due to their lack of support for SPIR-V —something that acpp works around by transpiling the IR of its kernels to something that can be fed to the respective platforms.

I was wondering: is perchance anybody working on interfacing Julia with AdaptiveCpp? I feel like this could help bring the GPGPU support in Julia towards a more homogeneous interface.

(P.S. I wanted to tag this as sycl but apparently there isn’t a tag for that yet?)

It looks like there’s no Metal backend.

True. There’s some relevant issues here:

Interestingly, some of the discussion also mentions JuliaGPU and Metal.jl

Probably because most of what constitutes “SYCL” is the C++ interface + compilers, which wouldn’t make sense on a Julia forum. As you noted, Julia GPU libraries already interface with the sample underlying runtime components that a SYCL implementation like those in AdaptiveCpp/doc/sycl-ecosystem.md at develop · AdaptiveCpp/AdaptiveCpp · GitHub interface with.

Making this distinction lets us clarify your second paragraph:

Not really. Intel uses sycl in the naming of a lot of their runtime APIs/libraries, but if you look at the AdaptiveCpp runtime it will also be wrapping differently-named APIs from AMD, Nvidia, Apple, etc. The Julia ecosystem just happens to split those wrappers between a few packages, so that people can program around an interface like Base’s AbstractArray functions/LinearAlgebra/GitHub - JuliaGPU/GPUArrays.jl: Reusable array functionality for Julia's various GPU backends. and plug in the backend(s) they want later.

The Julia GPU stack already does this via GitHub - JuliaGPU/GPUCompiler.jl: Reusable compiler infrastructure for Julia GPU backends.. That’s what lets one write a single kernel using GitHub - JuliaGPU/KernelAbstractions.jl: Heterogeneous programming in Julia and run it across multiple different GPU vendors.

If you want to understand how GPU programming interfaces are structured in Julia, the presentation I’ve been pulling screenshots from is https://www.youtube.com/watch?v=Q8fj8QbVpZM. The https://juliagpu.org/ site also has links to plenty of good resources.

2 Likes

Hi @ToucheSir and thank you very much for the reply. I had already gone through some of the material before posting (not the presentation though, thank you for sharing it), but the impression I got was that the Julia stack was largely fragmented. Your reply does highlight that behind the scenes there’s a unification process going on (with some growing pains, if I read some of the announcements correctly, but that’s expected), but if I’m understanding this correctly, this doesn’t reach the “frontend”: the generic GPU packages carry big bold warnings that they are

not intended for end users!

which is … kind of the opposite of what I’m looking for to really go all-in with GPGPU for Julia :smiling_face_with_sunglasses:

From my perspective (which is of someone who has been developing large-scale GPGPU software for nearly two decades, and has been struggling in the last decade to move away from the CUDA vendor lock in) the main advantage of SYCL (especially with SYCL 2020 forgoing on the OpenCL dependency) isn’t so much the single-source C++ DSL, but its definition of a solid “front-facing” hardware-independent (if not agnostic) GPU acceleration framework —in SYCL, I don’t have to worry about the details of the backend (unless I want to), and —at least with AdaptiveCpp, which delivers on a promise that IMO OneAPI made but failed to deliver on— I can write my program once and have it run on any of my hardware without even caring about where it’s running (unless I want to). The fact that a sycl::buffer or sycl::malloc_device will allocate on an NVIDIA‌ or Intel GPU isn’t something I have to care about —but my understanding is that in Julia I currently have to decide if I want a CuArray from CUDA.jl or a oneArray from OneAPI.jl

I understand the need for vendor-specific modules when interfacing with vendor-specific libraries, but what I’m missing is a vendor-agnostic user-facing module. My thought was that interfacing with AdaptiveCpp would provide this, but now I’m guessing that it would be unnecessary, since in Julia much of the needed infrastructure is already in place —what’s missing is just the higher-level vendor-agnostic interface? Or am I still misunderstanding something?

I feel like you’re missing some bits

  • all the GPU packages share the same compiler infrastructure, GPUCompiler.jl, so there’s lots of collaboration behind the scenes
  • vendor-specific tools are necessary at some point of a vendor-agnostic stack in any case
  • you keep advocating for AdaptiveCpp, which is a project with its own merits, but have you looked at KernelAbstractions.jl (already mentioned above) which is a pure Julia project? What do you find it’s missing (besides the fact it supports Metal GPUs, unlike AdaptiveCpp)?
1 Like

Perhaps you read some out-of-date documentation?

The unification process has already happened. There is one, singular, GPU array interface. Most of the ongoing work is around code sharing and bringing certain backends up to a higher level of functionality, which is exactly the same type of ongoing work that happens with C++ abstraction layers like AdaptiveCpp.

This is exactly the goal of the abstraction layers and interfaces in the Julia GPU ecosystem. I can write a function like

function f(x::AbstractArray)
  return sum(sin.(x) + cos.(x))
end

Which uses only standard library functions and no external dependencies. Then call it with a CPU Array, Metal MtlArray, CUDA CuArray, or any other GPU array type. Just like with SYCL, things like callbacks and lambdas work too.

But if by

you meant something lower-level than the above, that’s exactly what Home · KernelAbstractions.jl is for. As a C++ GPGPU person, you may be interested in GitHub - JuliaGPU/AcceleratedKernels.jl: Cross-architecture parallel algorithms for Julia's CPU and GPU backends. Targets multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA., which uses KernelAbstractions to create an entire library of vendor-agnostic algorithm implementations and templates. If Julia’s standard libraries are not “parallel STL” enough for you on GPU, then libraries like AcceleratedKernels are there to fill in the gaps.

2 Likes

I am reading the most up-to-date documentation I can find, which AFAICS is the most up-to-date available. At the very least, the principles of a completely backend-agnostic GPGPU Julia program aren’t as obvious from the documentation as they could be. And I still feel like something is still missing, even after what I’ve learned from this thread.

(On the upside, if it’s just a documentation matter, then I’ll be happy to give a hand where I can, and if there is indeed still some aspect missing maybe I can contribute to that too!)

To show what I mean, let me try with an example. I need to anticipate that what follows isn’t to diss on the authors of any of the packages —I can easily guess the titanic efforts that go into their development. But I want to clarify what someone coming from my background faces when trying to write a vendor-neutral GPU program in Julia.

Let’s say that I want to write a simple Julia program that runs a simple kernel on GPU. The program should be such that

  1. a device is selected automatically if the user doesn’t specify one (possibly a performant one, but this is just a bonus);
  2. the user can override the automatic selection, and should be able to choose any computational device available on the machine and supported by the runtime;
  3. bonus points, the software informs the user about the device that is going to be used.

Aside from the device selection, the program does something very simple:

  1. allocates an array (on GPU);
  2. initializes the array (on GPU);
  3. verifies (on host) that the array is correctly initialized, informing the user in case of a mismatch (showing the first index, expected and computed values).
  4. bonus points, it shows the kernel runtime in case of successful execution.

I have some sample code to do this in SYCL here (the sample-select program allows selection of the platform and device by assigning the corresponding ordinals to environment variables SYCL_PLATFORM and SYCL_DEVICE; the details of how the user specifies the device override are —hopefully— not particularly relevant).

One of the key points I want to highlight about the above-linked SYCL examples is that at no point in my SYCL code there’s any mention at all of any specific backend. Which backends are available and which devices they expose to the user is something which is entirely managed by the SYCL runtime (and e.g. if I compile the code against different runtimes —CodePlay vs Intel’s vs AdaptiveCpp— I will end up seeing different devices —but that’s a runtime limitation, not a “my code” limitation.)

So let’s say that I want to port this to Julia.

I go to juliagpu.org and in the Learn section I find that the solution to vendor-neutral GPU programming is KernelAbstractions.jl —so far so good. The 3-hour workshop video has a section on its use, and from it I learn that at least for memory allocations, “vendor neutrality” is achieved with an if cascade that remaps constants. Not exactly what I’m looking for, but maybe the situation has changed since 2021, when the seminar was recorded.

So I go read the KA QuickStart documentation (I assume this is the latest version) and once again I’m shown different ways to do things for the CPU, CUDA, AMDGPU and oneAPI backends, each using its own vendor-specific device type to allocate memory.
(FWIW, the same goes for AcceleratedKernels.jl: the README shows an example that explicitly depends on one specific backend —Metal).

The first mention I see of a vendor-neutral way to handle memory allocation is the memcopy kernel example for KA that shows the existence of KernelAbstractions.zeroes and KernelAbstractions.ones. To learn about the non-initializing KernelAbstractions.allocate I have to browse the API section.

These allocations depend on a backend selection. So it looks like I can write the kernel, the kernel-calling function and even the allocation management in a vendor-neutral way, once the backend has been selected. Excellent, this is pretty close to what I can do in SYCL, where the device-specificity can be relegated (usually) to the queue construction.

Now the backend selection is the only thing missing. In SYCL writing a custom selector, while non-trivial, is rather straightforward and still vendor-neutral: you just iterate over the platforms (i.e. backends) announced by the SYCL runtime.

So no I go to the utils.lj file in the example, which is documented as the one choosing the backend. It’s an if cascade (with only two entries: CUDA is attempted, fallback is to CPU).

So unless I’m still missing something, at the very least to be able to implement in JuliaGPU something with the requirements I mentioned above what’s still needed is a higher-level module (or something in KA) that tries to load each of the supported backends (I say try-load because rather than having all of them as hard dependencies, it’d be more sensible for the user to only have packages relevant to their hardware installed —e.g. no Metal.jl except on macOS, no AMDGPU if there’s no AMD GPUs, etc), and for each of the available backends finds which ones actually expose viable devices, and selects one of these by some internal criteria in addition to offering to the user a way to choose a different one (backend and device) if they so desire.

There’s more after this (for example, I’d appreciate a consistent interface to access device properties regardless of backend), but at the very least, to answer @giordano’s question too, this is what I find it’s missing. (And truly, if there’s any way I can help, I’d be happy to give a try.)

To give you a concrete example of use of KernelAbstractions.jl, this Pluto notebook has a section computing a matmul:

It works on 3 different backends: CPU, Nvidia GPU, Intel GPU (my laptop happens to have all of them). All the code is completely backend-agnostic, the only place where the device used at runtime is selected is through the backend variable passed as argument to construct the output array (backend is set via the dropdown menu before the setting of a)

Thanks for sharing the notebook, @giordano, I think it serves very well to illustrate what I was saying concerning the platform selection: on my machine, oneAPI isn’t installed, and I have an AMD GPU that the noteboook can’t leverage without additional changes. Similarly, anyone running your notebook on macOS would be missing the Metal backend. Moreover, by depending explicitly on CUDA.jl and oneAPI.jl, running the notebook leads to their installation (a rather expensive process) even when the system doesn’t have an NVIDIA or Intel GPU.

Of course your notebook is just an example, but these same downsides affect any JuliaGPU “user facing” package or program, where the backend needs to be known/selected. A sampling of the Showcases in the juliagpu.org site confirms this: they all have an explicit dependency on (typically one: CUDA) backend.

Coming from the OpenCL/SYCL world, this is definitely one place where the JuliaGPU ecosystem is lacking (in addition to a still-somewhat-inconsistent interface to querying backend and device properties).

I’ve tried writing some sample code that implements the idea I mentioned above, as a possible way to fill this gap:

module GPUSelector

import Pkg

const known = [:Metal, :CUDA, :AMDGPU, :oneAPI, :OpenCL]

function try_load_avail_backends()

    loaded = Vector{Symbol}()
    for sym in known
        if sym == :Metal && !Sys.isapple()
            continue
        end
        try
            pkg = Base.identify_package(String(sym))
            if isnothing(pkg)
                @debug "$(sym) is not installed"
                continue
            end
            @eval using $sym
            push!(loaded, sym)
        catch e
            @warn "Error enabling $(sym): $(e)"
            #Base.showerror(stdout, e)
        end
    end
    return loaded
end

function enumerate_avail_backends()
    avail = Vector{Symbol}()
    for sym in loaded
        if sym == :OpenCL
            # the OpenCL module doesn't have functional(),
            # assume good
            @debug "Assuming $(sym) functional"
            push!(avail, sym)
            continue
        end
        mod = getfield(GPUSelector, sym)
        if mod.functional()
            push!(avail, sym)
        else
            @debug "$(sym) not functional"
            continue
        end
    end
    push!(avail, :CPU)
    return avail
end

const loaded = try_load_avail_backends()
const avail = enumerate_avail_backends()

get_available_backends() = avail

# This is needed because the backend for each module
# isn't available simply as module.backend()
function get_backend(sym::Symbol)
    if sym == :CPU
        @eval import KernelAbstractions
        return KernelAbstractions.CPU()
    elseif !(sym in known)
        throw(ArgumentError("Unknown backend $(sym)"))
    elseif !(sym in avail)
        return missing
    elseif sym == :CUDA
        return CUDABackend()
    elseif sym == :AMDGPU
        return ROCBackend()
    elseif sym == :oneAPI
        return oneAPIBackend()
    elseif sym == :OpenCL
        return OpenCLBackend()
    elseif sym == :Metal
        return MetalBackend()
    else
        error("Somebody forgot to update this with the backend class for $(sym)")
    end
end

select_backend() = get_backend(first(avail))
select_backend(::Nothing) = select_backend()
select_backend(sym::Symbol) = get_backend(sym)
select_backend(f::Function) = select_backend(f())

end

function main()
    @show GPUSelector.get_available_backends()
    backend = GPUSelector.select_backend() do
        if isempty(ARGS)
            nothing
        else
            backend_str = lowercase(first(ARGS))
            avail = GPUSelector.get_available_backends()
            lc_avail = Dict(lowercase(String(s)) => s for s in avail)
            if :AMDGPU in avail
                lc_avail["amd"] = :AMDGPU
            end
            if :CUDA in avail
                lc_avail["nvidia"] = :CUDA
                lc_avail["nv"] = :CUDA
            end
            found = get(lc_avail, backend_str, nothing)
            if isnothing(found)
                @warn "Unknown backend $(found), using fallback"
            end
            found
        end
    end
    @show backend
end

if abspath(PROGRAM_FILE) == @__FILE__
    using .GPUSelector
    main()
end

The developer-facing function would just be select_backend, and get_available_backends (the latter mostly just to be used in the function form of select_backend).

FWIW, I’d be more than happy to contribute this code to the JuliaGPU community (after some cleanup and review from more competent people 8-D; suggestions welcome), either as part of an existing module (maybe KernelAbstractions.jl itself?) or as its own standalone module.

(BTW, I think this sample also shows what I meant by “still-somewhat-inconsistent interface”: ideally, there should be no need for the get_backend function, because the backend exposed by each module would be available under module.backend() or something like that. Likewise, I had to “except” OpenCL on the functional() call because the version of the module installed on my system (0.10.2) does not have it. Not shown here: I’ve also had problems getting e.g. the available devices and their names from different backend modules, although that might have been just a bug.)

Anyway, this has veered a bit off topic. I still think that a SYCL backend independent from Intel’s OneAPI would be NiceToHave™, and the AdaptiveCpp runtime would be an excellent choice if it’s ever developed, but not anymore as a requirement for vendor-neutral GPU coding in Julia.

1 Like