CUDA AES implementation help: Parallelizing execution on Array{UInt8, 16} type inputs

I’m trying to do AES on the GPU. My latest attempt looks something like this:

# AES has a similar function signature to this
function blockAdd(b::Array{UInt8, 16})::Array{UInt8, 16}
    a = UInt8[0 for i in 1:16]
    for i in 1:16
        a[(i%16) + 1] = b[i]*2 + 1
    end
    return a
end


function AESKernel!(ctextOut::CuArray, ctext::CuArray)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    last = 16*(i+1)
    if (last < length(ctext))
        start = 16*i + 1
        block = UInt8[0 for i in 1:16]
        block[start:last] = block .+ ctext[start:last]
        a = blockAdd(block)
        ctextOut[start:last] =  a .+ ctextOut[start:last]
    end
end


function AESGPUTest()
    randTextBlock = UInt8[i for i in 1:32]
    ctext = CuArray(randTextBlock)
    ctextOut = CuArray(UInt8[0 for i in 1:length(randTextBlock)])
    @cuda AESKernel!(ctextOut, ctext)
    print(ctextOut)
end

I’m using blockAdd as a stand-in for AES.

I am confident that I have a working AES implementation on the CPU side, and I was hoping it would be possible to:

  1. input a series of Array{UInt8,16} “blocks” (1 block per thread)
  2. have each thread mix either the input “block”, or a local copy of the “block”
  3. return the results to an output buffer

I have also had trouble finding tutorials which go over this kind of thing.

Either help with the topic or pointing to tutorials would be of great help.

1 Like

Can I ask which parameters?

I have spent xmas on CUDA, and I have written this
https://juliateachingctu.github.io/Scientific-Programming-in-Julia/dev/lecture_11/lecture/
for students. May-be, it would help you.

Tomas

3 Likes

Specifically I have Array{UInt8,16} “blocks” which I am trying to perform various “mixing” operations on. I am having difficultly figuring out how to get Julia to compile code for the GPU which takes the block and mixes it according to the AES standard.

I am confident that I have a working AES implementation on the CPU side, and I was hoping it would be possible to:

  1. input a series of “blocks” (1 block per thread)
  2. have each thread mix either the input block, or a local copy of the block
  3. return the results to an output buffer

In my latest attempt I was trying to make a local copy of the block to see if that would work, but I ran into compilation issues.

To answer your question: I am having trouble wrangling the Array{UInt8,16} blocks as an input parameter to a particular threads.

I can see that the algorithm is possible to run on a GPU architecture from this post by Nvidia:
Chapter 36. AES Encryption and Decryption on the GPU.

After some help from @Tomas_Pevny as well as some research of my own, it looks like the solutions involves two realizations:

  1. GPU’s want inputs that are values, not pointers. Luckily, since our inputs are so small, we can pack each set of 16 UInt8’s into UInt128s with the reinterpret function so that we can pass an array of UInt128’s as inputs to the Kernel
  2. To allocate memory on the GPU side, we want to use the CuStaticSharedArray function to allocate an array that we can perform operations on the GPU side.

Both of these together gives me the following working code:

# Something like this, but more complicated
function blockAdd(in, out)
    for i in 1:16
        @inbounds out[(i%16)+1] = 2*in[i] + 1
    end
end


function AESKernel!(in, out)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # Allocate our blocks of memory
    block_in  = CuStaticSharedArray(UInt8, 16)
    block_out = CuStaticSharedArray(UInt8, 16)

    # Unpack our block from our UInt128
    for j in 1:16
        @inbounds block_in[j]     = in[i] >> (8*(j-1)) % 256
    end

    # Now we can use our abstracted code
    blockAdd(block_in, block_out)

    # Repack the blocks
    sum = 0
    for j in 1:16
        @inbounds sum += UInt128(block_out[j]) << (8*(j-1))
    end
    out[i] = sum

    return nothing
end

function AESGPUTest()
    randTextBlock = UInt8[i for i in 1:32]

    i1 = reinterpret(UInt128, randTextBlock)
    cu_i1 = CuArray(i1)

    arraySize = length(i1)
    cu_o1 = CuArray(UInt128[0 for i in 1:arraySize])

    @cuda threads=2 AESKernel!(cu_i1, cu_o1)
end

Your code runs on GPU, but you are not using it to the full power. Threads = 2 means that you are underutilizing GPU. By quicklu looking to your code, you should be able to replace this

for j in 1:16
        @inbounds block_in[j]     = in[i] >> (8*(j-1)) % 256
    end

    # Now we can use our abstracted code
    blockAdd(block_in, block_out)

by running on 2 *16 threads, where each thread would process on memory element. The reduction sum is a pain point. You can read about it in my lecture noters.

AES performs highly coupled operations on the memory blocks of 16 Bytes.

For proof of concept I did a rotation (check the indices on the blockAdd), multiply, add.

In practice, a researcher would be XORing the input cipher text with a particular key.

The target application is some sort of brute force guessing of the key using a GPU.

Given that keys are 128+ bits in an AES, I think there will be more than plenty of space to spin up more threads for more keys once you input an array of potential keys to the kernel.