CUDA v2 - performance regression on matrix multiplication

Yeah 2.x is consistently faster on my system, as reported above.

For the profiling results, better run the multiplication a couple of times to average out timings. This large difference could be a fluke.