Did you compare the runtime of CLBlast.jl and basic BLAS from Base? If so, was BLAS using multiple threads? Or did you compare CLBlast.jl on the CPU and the GPU?
In my experience, using OpenCL on an Intel CPU directly is often similarly fast or even faster than running the same code on the integrated GPU. There can be a really significant difference if you have a dedicated GPU.