I want to store a matrix in Float32 and do several solves with a Float64 right hand sides. In LINPACK, (and I think LAPACK) this just works. In Matlab (see below) it’s less clear. However in something bizarre happens with Julia in the solve phase. Here’s my story.
I’m reporting on lu!
. qr!
is similar with more allocations.
What is happening? Is there a way I can fix this (MKL)? Is there something I’m missing about LAPACK?
n=1000; A=rand(n,n); B=Float32.(A); x=rand(n); b=A*x;
AC=copy(A);
AF=lu!(A); BF=lu!(B);
y1=AF\b; y2=BF\b; y3=BF\Float32.(b);
[norm(y1-x), norm(y2-x), norm(y3-x)]./norm(x)
[norm(AC*y1-b), norm(AC*y2-b), norm(AC*y3-b)]./norm(b)
julia> n=1000; A=rand(n,n); B=Float32.(A); x=rand(n); b=A*x;
julia> AC=copy(A);
julia> AF=lu!(A); BF=lu!(B);
julia> y1=AF\b; y2=BF\b; y3=BF\Float32.(b);
julia> [norm(y1-x), norm(y2-x), norm(y3-x)]./norm(x)
3-element Array{Float64,1}:
2.45032e-12
4.97286e-05
2.86294e-04
julia> [norm(AC*y1-b), norm(AC*y2-b), norm(AC*y3-b)]./norm(b)
3-element Array{Float64,1}:
7.71294e-16
6.27307e-08
3.17710e-07
The residual and error norms look reasonable and it seems that
like y2 is more accurate. That is no surprise and that is the
case in all the testing I’ve done. I’d be happy, but something is
funny with the solve. I’d expect y2 and y3 to cost about the same.
But y3 is much faster than y2 …
julia> @btime $y1=$AF\$b;
278.196 μs (1 allocation: 7.94 KiB)
julia> @btime $y2=$BF\$b;
269.095 μs (1 allocation: 7.94 KiB)
julia> @btime $y3=$BF\Float32.($b);
88.929 μs (2 allocations: 8.13 KiB)
Why?
Matlab gives consistent results and seems to get y3, even if I ask for y2.
>> n=1000; A=rand(n,n); B=single(A); x=rand(n,1); b=A*x;
>> [l,u]=lu(A); [ls,us]=lu(B);
>> tic; y1=u\(l\b); toc
Elapsed time is 0.009094 seconds.
>> tic; y2=us\(ls\b); toc
Elapsed time is 0.006940 seconds.
>> tic; y3=us\(ls\single(b)); toc
Elapsed time is 0.006864 seconds.
>> [norm(y2-y3), norm(y1-x), norm(y2-x)]/norm(x)
ans =
1x3 single row vector
0 5.6362e-13 2.7937e-04