BigFloat performance, allocates like crazy!


#1

I have a program which allocates less than 35 MB and takes under 2 seconds with Float64 and Float32, but allocates 350 GB and takes more than 10 minutes with BigFloat, is this expected? The code is seemingly type stable. Are there other recommendations for high precision calculations without all the allocations, perhaps not as big as the 256 bits of BigFloat but still more than 64 bits? I would appreciate your comments.


#2

I think DoubleDouble is what I am looking for:

julia> using DoubleDouble

julia> using BenchmarkTools

julia> a = Double{Float64}(1); b = Double{Float64}(2);

julia> @btime a + b
  20.421 ns (1 allocation: 32 bytes)
Double(3.0, 0.0)
 - value: 3

julia> a = Float64(1); b = Float64(2);

julia> @btime a + b
  17.868 ns (1 allocation: 16 bytes)
3.0

julia> a = BigFloat(1); b = BigFloat(2);

julia> @btime a + b
  91.168 ns (2 allocations: 88 bytes)
3.000000000000000000000000000000000000000000000000000000000000000000000000000000

julia> a = Double{Float64}(1); b = Double{Float64}(2);

julia> @btime a * b
  32.091 ns (1 allocation: 32 bytes)
Double(2.0, 0.0)
 - value: 2

julia> a = Float64(1); b = Float64(2);

julia> @btime a * b
  19.327 ns (1 allocation: 16 bytes)
2.0

julia> a = BigFloat(1); b = BigFloat(2);

julia> @btime a * b
  111.090 ns (2 allocations: 88 bytes)
2.000000000000000000000000000000000000000000000000000000000000000000000000000000

julia> a = Double{Float64}(1); b = Double{Float64}(2);

julia> @btime a / b
  38.482 ns (1 allocation: 32 bytes)
Double(0.5, 0.0)
 - value: 0.5

julia> a = Float64(1); b = Float64(2);

julia> @btime a / b
  19.692 ns (1 allocation: 16 bytes)
0.5

julia> a = BigFloat(1); b = BigFloat(2);

julia> @btime a / b
  195.918 ns (2 allocations: 88 bytes)
5.000000000000000000000000000000000000000000000000000000000000000000000000000000e-01


#3

Also https://github.com/saschatimme/HigherPrecision.jl, should have probably done a better search :smile:


#4

MPFR Bigfloats are actually mutable because, since they are arbitrary precision, they are implemented on an array. Working with them with full mutability is kind of crazy though from what I understand, so they are exposed in Julia like any other number: immutable. But this does mean allocating a lot, and there’s no good solution for now except for using a smaller fixed precision number.


#5

Your timings are actually dramatically off, because you’re not interpolating the variables:

a, b = 2.0, 3.0
A, B = BigFloat(a), BigFloat(b)
aa, bb = Double(a), Double(b)

julia> @btime $a + $b
  1.456 ns (0 allocations: 0 bytes)
5.0

julia> @btime $A + $B
  95.115 ns (2 allocations: 104 bytes)
5.000000000000000000000000000000000000000000000000000000000000000000000000000000

julia> @btime $aa + $bb
  4.615 ns (0 allocations: 0 bytes)
Double(5.0, 0.0)
 - value: 5

As you can see, the difference is much more dramatic than your benchmarks indicated. It should, however, raise a big red flag when scalar float addition allocates and takes 20ns!

Takeaway lesson: interpolate non-constant variables!


#6

Of course, thanks for pointing that out, I always to forget to do that! New benchmarks:

julia> using HigherPrecision

julia> using BenchmarkTools

julia> a = Float32(1); b = Float32(2);

julia> A = BigFloat(1); B = BigFloat(2);

julia> aa = DoubleFloat64(1); bb = DoubleFloat64(2);

julia> @btime $a + $b
  1.093 ns (0 allocations: 0 bytes)
3.0f0

julia> @btime $A + $B
  70.460 ns (2 allocations: 88 bytes)
3.000000000000000000000000000000000000000000000000000000000000000000000000000000

julia> @btime $aa + $bb
  6.564 ns (0 allocations: 0 bytes)
3

julia> @btime $a * $b
  1.093 ns (0 allocations: 0 bytes)
2.0f0

julia> @btime $A * $B
  88.870 ns (2 allocations: 88 bytes)
2.000000000000000000000000000000000000000000000000000000000000000000000000000000

julia> @btime $aa * $bb
  1.823 ns (0 allocations: 0 bytes)
2

julia> @btime $a / $b
  1.093 ns (0 allocations: 0 bytes)
0.5f0

julia> @btime $A / $B
  169.787 ns (2 allocations: 88 bytes)
5.000000000000000000000000000000000000000000000000000000000000000000000000000000e-01

julia> @btime $aa / $bb
  7.658 ns (0 allocations: 0 bytes)
0.5

DoubleDouble doesn’t currently support some operations I am using like ^ but HigherPrecision looks incredible! I think the 60-150 times slowdown in scalar arithmetic compared to Float32 could explain the time difference I am observing, there is probably also some SIMD and caching magic for the Float32 going on behind the scenes in my real program. But HigherPrecision promises at least a 10 times speedup from BigFloat so looking forward to trying that after ironing a few things out.


#7

No, SIMD only applies to multiple operations that are done together. There’s no possible SIMD here. That difference is likely because the multi-precision type has to do 2-4 operations on 64-bit floats. Basically, they get higher precision by having two pieces, and smartly doing an extra operation or two to catch the overflowing error of the 1st set of numbers into the second set. The reason this is done is because quad precision isn’t actually supported on the die, so this is an implementation which just uses normal floating point registers instead of software emulation which would be slow.


#8

Yes I have a bunch of array operations and simd-able loops in the actual code which I believe the compiler maybe vectorizing for me. And a cache line and a register would hold more Float32s than DoubleFloat64s so that’s another source of speed difference.


#9

Okay then yes, that would probably add a difference that the microbenchmarks don’t show.


#10

Yup.


#11

Another one to try would be Dec128