# [ANN] MoYe.jl: Layout Algebra on GPU

Hi people,

I’m excited to announce the release of my new package, MoYe.jl, a powerful package for performing Layout Algebra for gpu kernel programming.

## Why this package?

Index bookkeeping on GPU can be challenging. MoYe.jl abstracts it away with `Layout` to do the tedious job for you.

## Key Concepts

### Layout

At the core is the `Layout` struct. Mathematically, a `Layout` represents a function that maps logical coordinates to one-dimensional physical index spaces. It comprises a `Shape` and a `Stride` , where the `Shape` defines the domain, and the `Stride` creates the mapping via an inner product. It is essential to note that both shape and stride can be hierarchical. Here are some examples:

``````julia> @Layout (2, (2,2)) (1, (4,2))
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))

julia> print_layout(ans)
(static(2), (static(2), static(2))):(static(1), (static(4), static(2)))
1   2   3   4
+---+---+---+---+
1  | 1 | 5 | 3 | 7 |
+---+---+---+---+
2  | 2 | 6 | 4 | 8 |
+---+---+---+---+
``````

This example demonstrates that when we access the array using one-dimensional coordinates 1, 2, …, 8, the actual arrangement of memory addresses is 1, 2, 5, 6, 3, 4, 7, 8.

### Tiling

Two primary macros, `@tile` and `@parallelize` , are used.

`@tile` is simply used to split an array into blocks, and then access those blocks with a coordinate:

``````julia> a = MoYeArray(pointer([i for i in 1:48]), @Layout((6,8)))
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
1   7  13  19  25  31  37  43
2   8  14  20  26  32  38  44
3   9  15  21  27  33  39  45
4  10  16  22  28  34  40  46
5  11  17  23  29  35  41  47
6  12  18  24  30  36  42  48

julia> @tile a (static(3), static(2)) (1, 1)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
1  7
2  8
3  9

julia> @tile a (static(3), static(2)) (1, 2)
3×2 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{3}, Static.StaticInt{2}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
13  19
14  20
15  21
``````

`@parallelize` means using multiple threads to process the elements of an array in parallel.

``````julia> threadtile1 =  @parallelize a (static(3), static(2)) (1, 1) # 6 threads with layout 3 x 2
2×4 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{2}, Static.StaticInt{4}}, Tuple{Static.StaticInt{3}, Static.StaticInt{12}}}}:
1  13  25  37
4  16  28  40
``````

Once the set of elements managed by the first thread is obtained, we can perform computations on them as if they were a regular array:

``````julia> for i in eachindex(threadtile1)
end

julia> a
6×8 MoYeArray{Int64, 2, ViewEngine{Int64, Ptr{Int64}}, Layout{2, Tuple{Static.StaticInt{6}, Static.StaticInt{8}}, Tuple{Static.StaticInt{1}, Static.StaticInt{6}}}}:
-1   7  -13  19  -25  31  -37  43
2   8   14  20   26  32   38  44
3   9   15  21   27  33   39  45
-4  10  -16  22  -28  34  -40  46
5  11   17  23   29  35   41  47
6  12   18  24   30  36   42  48
``````

Indeed, there is no need to consider the mapping from local index to global index during computation!