CUDA matmul performance

What does the relative performance look like if the matrix aren’t so high aspect ratio? (Last week I was working on GPU-ifying some code which was bottlenecked by square matrix multiplication, and found big speed increases from moving to CUDA.)