What does the relative performance look like if the matrix aren’t so high aspect ratio? (Last week I was working on GPU-ifying some code which was bottlenecked by square matrix multiplication, and found big speed increases from moving to CUDA.)
What does the relative performance look like if the matrix aren’t so high aspect ratio? (Last week I was working on GPU-ifying some code which was bottlenecked by square matrix multiplication, and found big speed increases from moving to CUDA.)