Slow code to compute b=A*x

sum2 and sum3 effectively do not have an inner loop. The compiler will optimize it away since the results of the inner loop are not used.