Nextfloat is slower on 2D array than on 1D array

Running the following function with a 2D array is much slower than running it with the same array but flattened to a 1D array, and both return the same result. Why is the 2D version slower?

function test(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            next = nextfloat(output[i])
            if next < 0.5
                output[i] = next
            end
        end
    end
    return output
end

which can be compared like this

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
@time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds

test(arr_1D, iterations = 200) == test(arr_2D, iterations = 200)[:] # true
julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  0.198706 seconds (8.83 k allocations: 8.078 MiB, 9.40% gc time, 12.31% compilation time: 100% of which was recompilation)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.237346 seconds (11.62 k allocations: 8.216 MiB, 1.40% compilation time: 100% of which was recompilation)

julia> test(arr_1D, iterations = 200) == test(arr_2D, iterations = 200)[:] # true
true

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  0.160560 seconds (3 allocations: 7.629 MiB, 2.95% gc time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.207420 seconds (3 allocations: 7.629 MiB)

I am unable to determine if this performance difference is only related to the use of 1D vs 2D array, the use of nextfloat, or a combination of both.

Welcome! I can’t imagine how there could be a difference β€” and I don’t see one. Did you try in a fresh session? Maybe you have some other (older) methods of test defined?

REPL session

julia> function test(data::Array{Float64}; iterations::Int = 100)
           output = copy(data)
           iteration = 0
           while iteration < iterations
               iteration += 1
               for i in eachindex(output)
                   next = nextfloat(output[i])
                   if next < 0.5
                       output[i] = next
                   end
               end
           end
           return output
       end
test (generic function with 1 method)

julia> arr_2D = rand(Float64, 1000, 1000);

julia> arr_1D = arr_2D[:]; # flat 1000 * 1000

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  1.462853 seconds (5.77 k allocations: 8.047 MiB, 0.51% compilation time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.472072 seconds (8.17 k allocations: 8.193 MiB, 0.52% compilation time)

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  1.462221 seconds (2 allocations: 7.629 MiB, 0.22% gc time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.462457 seconds (2 allocations: 7.629 MiB)
2 Likes

I did try in a fresh REPL session, and I also tried renaming the function to something different, and I still get the same result.

I did however notice that small changes to the function would change the result, e.g:

function test2(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            if output[i] < 0.5
                output[i] = nextfloat(output[i])
            end
        end
    end
    return output
end

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test2(arr_1D, iterations = 200);   # slow, e.g. 0.9 seconds
@time test2(arr_2D, iterations = 200);   # slow, e.g. 0.9 seconds

test2(arr_1D, iterations = 200) == test2(arr_2D, iterations = 200)[:] # true
julia> @time test2(arr_1D, iterations = 200);   # slow, e.g. 0.9 seconds
  0.872082 seconds (9.40 k allocations: 8.118 MiB, 2.42% compilation time: 100% of which was recompilation)

julia> @time test2(arr_2D, iterations = 200);   # slow, e.g. 0.9 seconds
  0.954726 seconds (12.20 k allocations: 8.257 MiB, 5.36% gc time, 2.05% compilation time: 100% of which was recompilation)

so now they are both slow, but that does not really explain why the original test function is much faster.

My initial hypothesis was that nextfloat is slow, as can be demonstrated by comparing with this almost equivalent test:

function test3(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            next = output[i] + eps()
            if next < 0.5
                output[i] = next
            end
        end
    end
    return output
end

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test3(arr_1D, iterations = 200);   # very fast, e.g. 0.1 seconds
@time test3(arr_2D, iterations = 200);   # very fast, e.g. 0.1 seconds

test3(arr_1D, iterations = 200) == test3(arr_2D, iterations = 200)[:] # true
julia> @time test3(arr_1D, iterations = 200);   # very fast, e.g. 0.1 seconds
  0.100805 seconds (7.50 k allocations: 8.005 MiB, 18.88% gc time, 15.73% compilation time: 100% of which was recompilation)

julia> @time test3(arr_2D, iterations = 200);   # very fast, e.g. 0.1 seconds
  0.081458 seconds (10.09 k allocations: 8.135 MiB, 22.16% compilation time: 100% of which was recompilation)

The significant difference in speed between nextfloat and eps was what led me to this in the first place. Any suggestions as to how to make nextfloat same speed as eps while preserving the benefits of nextfloat which adjusts to the value it is increasing?

output[i] + eps() isn’t the same as nextfloat β€” it’s gonna take bigger steps! And with bigger steps, you’re going to update the array less. In general, the duration here is going to be data-dependent. One random array might take longer or shorter than another.

I can promise you the β€œslowness” isn’t coming from 2D vs 1D or nextfloat or some other Julia problem. Julia is doing exactly what you asked it to do.

2 Likes

I agree that output[i] + eps() and nextfloat(output[i]) is not comparable, but it was the best I could come up with to simplify a bigger function where I first observed this problem (without random data) and where the difference between using eps() and nextfloat makes no difference (other than speed) for the result of the function (only problem is that adding eps() can become too small).

I can promise you the β€œslowness” isn’t coming from 2D vs 1D or nextfloat or some other Julia problem. Julia is doing exactly what you asked it to do.

Is the speed difference related to randomness, my hardware or JIT then?

I simply suspect you’re not measuring what you think you’re measuring.

If you’re comparing two iterative algorithms where one steps by eps() and the other steps by nextfloat, the latter is going to take twice as many steps for data in [0.5, 1.0). It’ll take 4x more steps in [0.25, 0.5). 8x more in …

And then if you accidentally call one in place of the other when you think you’re testing the opposite, you’re going to get confused.

This makes no sense at all, but I can reproduce this.

2 Likes

for a slightly better MWE:

julia> function test3(data::Array{Float64})
           output = copy(data)
           for i in eachindex(output)
               output[i] = nextfloat(output[i])
           end
           return output
       end
julia> @benchmark test3(arr_1D)
BenchmarkTools.Trial: 92 samples with 1 evaluation per sample.
 Range (min … max):  45.105 ms … 87.348 ms  β”Š GC (min … max):  1.90% … 42.24%
 Time  (median):     49.644 ms              β”Š GC (median):     4.18%
 Time  (mean Β± Οƒ):   54.777 ms Β± 11.319 ms  β”Š GC (mean Β± Οƒ):  13.74% Β± 13.55%

   β–‚ β–…β–‡β–ˆβ–„β–„                                                     
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–ƒβ–ƒβ–ƒβ–β–β–…β–ƒβ–β–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–…β–…β–ƒβ–β–ƒβ–ƒβ–ƒβ–‡β–…β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒ ▁
  45.1 ms         Histogram: frequency by time        87.3 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 2000006.

julia> @benchmark test3(arr_2D)
BenchmarkTools.Trial: 46 samples with 1 evaluation per sample.
 Range (min … max):   92.110 ms … 151.183 ms  β”Š GC (min … max):  1.77% … 21.01%
 Time  (median):     106.151 ms               β”Š GC (median):     5.72%
 Time  (mean Β± Οƒ):   109.749 ms Β±  14.033 ms  β”Š GC (mean Β± Οƒ):  10.83% Β±  7.46%

     ▄▄▄▁   β–ˆβ–  β–ˆ β–„    ▁▁               ▁▁                       
  β–†β–β–†β–ˆβ–ˆβ–ˆβ–ˆβ–β–†β–β–ˆβ–ˆβ–†β–β–ˆβ–†β–ˆβ–β–†β–†β–β–ˆβ–ˆβ–β–β–β–β–†β–β–β–β–β–β–β–β–†β–†β–β–ˆβ–ˆβ–β–†β–β–β–†β–β–β–†β–β–β–β–β–β–β–β–β–β–β–β–β–† ▁
  92.1 ms          Histogram: frequency by time          151 ms <

 Memory estimate: 129.70 MiB, allocs estimate: 4000008.

I can reproduce this too, on 1.11.5.

Add @inbounds to the loop, and the difference disappears.

Right. The difference is where the length field (which we use to boundscheck is). For Vector, the length is a field of the Vector, while for Matrix we get the length of the Memory iirc.

It is quite surprising that this code doesn’t eliminate the boundscheck automatically

1 Like

Hunh, sure enough. I was testing on 1.10. I see the difference in that @benchmark on 1.11, but I’m still not seeing that 10x slow/fast difference on the original example. There’s a branch in there, perhaps there’s some wild branch-predictor shenanigans going on for some CPUs? I’m on an M1.

I think the difference might be that M1 has much smaller Vectors (and more out of order capacity) than x86. My test3 removes the branches, so I think it is just that on x86 the bounds-checks hurt more.