As I commented in another thread, micro-optimizing a single vectorized function that does a single operation (sums the sqrt of array elements in your case) is not really the best way to get performance improvements:
sqrt