Hi people,
I’m excited to announce the release of my new package, MoYe.jl, a powerful package for performing Layout Algebra for gpu kernel programming.
Why this package?
Index bookkeeping on GPU can be challenging. MoYe.jl abstracts it away with Layout
to do the tedious job for you.
Key Concepts
Layout
At the core is the Layout
struct. Mathematically, a Layout
represents a function that maps logical coordinates to one-dimensional physical index spaces. It comprises a Shape
and a Stride
, where the Shape
defines the domain, and the Stride
creates the mapping via an inner product. It is essential to note that both shape and stride can be hierarchical. Here are some examples:
julia> @Layout (2, (2,2)) (1, (4,2))
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))
julia> print_layout(ans)
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))
1 2 3 4
+---+---+---+---+
1 | 1 | 5 | 3 | 7 |
+---+---+---+---+
2 | 2 | 6 | 4 | 8 |
+---+---+---+---+
This example demonstrates that when we access the array using one-dimensional coordinates 1, 2, …, 8, the actual arrangement of memory addresses is 1, 2, 5, 6, 3, 4, 7, 8.
Tiling
Two primary macros, @tile
and @parallelize
, are used.
@tile
is simply used to split an array into blocks, and then access those blocks with a coordinate:
julia> a = MoYeArray(pointer([i for i in 1:48]), @Layout((6,8)))
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
1 7 13 19 25 31 37 43
2 8 14 20 26 32 38 44
3 9 15 21 27 33 39 45
4 10 16 22 28 34 40 46
5 11 17 23 29 35 41 47
6 12 18 24 30 36 42 48
julia> @tile a (static(3), static(2)) (1, 1)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
1 7
2 8
3 9
julia> @tile a (static(3), static(2)) (1, 2)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
13 19
14 20
15 21
@parallelize
means using multiple threads to process the elements of an array in parallel.
julia> threadtile1 = @parallelize a (static(3), static(2)) (1, 1) # 6 threads with layout 3 x 2
2×4 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{2}, Static.StaticInt{4}}, Tuple{Static.StaticInt{3}, Static.StaticInt{12}}}}:
1 13 25 37
4 16 28 40
Once the set of elements managed by the first thread is obtained, we can perform computations on them as if they were a regular array:
julia> for i in eachindex(threadtile1)
threadtile1[i] = -threadtile1[i]
end
julia> a
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
-1 7 -13 19 -25 31 -37 43
2 8 14 20 26 32 38 44
3 9 15 21 27 33 39 45
-4 10 -16 22 -28 34 -40 46
5 11 17 23 29 35 41 47
6 12 18 24 30 36 42 48
Indeed, there is no need to consider the mapping from local index to global index during computation!
For more information on applying this paradigm in GPU programming, please refer to the documentation.
We welcome contributions and suggestions from the Julia community.