[ANN] MoYe.jl: Layout Algebra on GPU

Hi people,

I’m excited to announce the release of my new package, MoYe.jl, a powerful package for performing Layout Algebra for gpu kernel programming.

Why this package?

Index bookkeeping on GPU can be challenging. MoYe.jl abstracts it away with Layout to do the tedious job for you.

Key Concepts

Layout

At the core is the Layout struct. Mathematically, a Layout represents a function that maps logical coordinates to one-dimensional physical index spaces. It comprises a Shape and a Stride , where the Shape defines the domain, and the Stride creates the mapping via an inner product. It is essential to note that both shape and stride can be hierarchical. Here are some examples:

julia> @Layout (2, (2,2)) (1, (4,2))
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))

julia> print_layout(ans)
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))
      1   2   3   4
    +---+---+---+---+
 1  | 1 | 5 | 3 | 7 |
    +---+---+---+---+
 2  | 2 | 6 | 4 | 8 |
    +---+---+---+---+

This example demonstrates that when we access the array using one-dimensional coordinates 1, 2, …, 8, the actual arrangement of memory addresses is 1, 2, 5, 6, 3, 4, 7, 8.

Tiling

Two primary macros, @tile and @parallelize , are used.

@tile is simply used to split an array into blocks, and then access those blocks with a coordinate:

julia> a = MoYeArray(pointer([i for i in 1:48]), @Layout((6,8)))
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
 1   7  13  19  25  31  37  43
 2   8  14  20  26  32  38  44
 3   9  15  21  27  33  39  45
 4  10  16  22  28  34  40  46
 5  11  17  23  29  35  41  47
 6  12  18  24  30  36  42  48

julia> @tile a (static(3), static(2)) (1, 1)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
 1  7
 2  8
 3  9

julia> @tile a (static(3), static(2)) (1, 2)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
 13  19
 14  20
 15  21

@parallelize means using multiple threads to process the elements of an array in parallel.

julia> threadtile1 =  @parallelize a (static(3), static(2)) (1, 1) # 6 threads with layout 3 x 2
2×4 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{2}, Static.StaticInt{4}}, Tuple{Static.StaticInt{3}, Static.StaticInt{12}}}}:
 1  13  25  37
 4  16  28  40

Once the set of elements managed by the first thread is obtained, we can perform computations on them as if they were a regular array:

julia> for i in eachindex(threadtile1)
           threadtile1[i] = -threadtile1[i]
           end

julia> a
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
 -1   7  -13  19  -25  31  -37  43
  2   8   14  20   26  32   38  44
  3   9   15  21   27  33   39  45
 -4  10  -16  22  -28  34  -40  46
  5  11   17  23   29  35   41  47
  6  12   18  24   30  36   42  48

Indeed, there is no need to consider the mapping from local index to global index during computation!

For more information on applying this paradigm in GPU programming, please refer to the documentation.

We welcome contributions and suggestions from the Julia community.

16 Likes