How does a kernel function in KernelAbstractions.jl work when the backend is a CPU?

frankwswang · February 22, 2025, 5:01am

I’m trying to add GPU-based parallelization to my code. I found KernelAbstractions.jl and liked its backend (vendor) agnostic approach, especially the fall-back support for CPU multi-threading.

However, I didn’t find detailed documentation on how to configure thread scheduling for CPU-based parallel computation. Is it dynamic or static? How efficient is it compared to the Julia-native approaches, such as using macros @threads or @spawn + @sync?

Also, how many CPU threads does a kernel function use by default? Is it the same as the number of threads enabled in a Julia process, meaning if I set the environmental variable JULIA_NUM_THREADS to 1, will it only use one thread? Or does it always spawn the task on the number of threads equal to the number of physical cores on the CPU?

It would be really appreciated if someone could direct me to any material that functions as a tutorial for using KernelAbstractions.jl other than the official documentation. Thanks!!

danielwe · February 22, 2025, 4:59pm

With the caveat that I haven’t used or contributed to KernelAbstractions.jl myself, it looks like the implementation for CPU is here: KernelAbstractions.jl/src/cpu.jl at main · JuliaGPU/KernelAbstractions.jl · GitHub

As you can see, it spawns Threads.nthreads() tasks and partitions the iteration space evenly between them. If the static_threads parameter is true, the tasks are spawned using Threads.@threads :static such that they are prevented from migrating between threads; otherwise, @sync and Threads.@spawn are used.

This is almost exactly how Threads.@threads itself works, with or without the :static parameter. It’s just reimplemented here in a way that fits with the internal abstractions and interfaces in KernelAbstractions.jl. Certainly, it only uses the threads that Julia was started with, there’s no dark magic to create its own thread pool or anything like that.

As for overhead, it looks like KernelAbstractions.jl will have the same overhead as Threads.@threads when static_threads is true and as @sync + Threads.@spawn otherwise. In the past, I’ve measured Threads.@threads to have slightly less overhead than the equivalent @sync + Threads.@spawn, which I think is because the @sync mechanism allocates a Channel to hold all the tasks. However, the difference shouldn’t be significant assuming your computation is large enough to warrant multithreading in the first place.

Topic		Replies	Views
Why does julia launch kernel threads? Performance	5	334	April 22, 2024
Gcc vs Threads.@threads vs Threads.@spawn for large loops Julia at Scale	14	2955	June 26, 2020
KernelAbstractions is slower than CUDA GPU gpu , cuda , kernelabstractions	8	1322	November 10, 2022
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1751	April 26, 2022
Several questions about KernelAbstractions GPU gpu , cuda , kernelabstractions	6	1591	January 18, 2022

How does a kernel function in KernelAbstractions.jl work when the backend is a CPU?

Related topics