Nextfloat is slower on 2D array than on 1D array

hawahe · April 24, 2025, 5:02pm

Running the following function with a 2D array is much slower than running it with the same array but flattened to a 1D array, and both return the same result. Why is the 2D version slower?

function test(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            next = nextfloat(output[i])
            if next < 0.5
                output[i] = next
            end
        end
    end
    return output
end

which can be compared like this

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
@time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds

test(arr_1D, iterations = 200) == test(arr_2D, iterations = 200)[:] # true

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  0.198706 seconds (8.83 k allocations: 8.078 MiB, 9.40% gc time, 12.31% compilation time: 100% of which was recompilation)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.237346 seconds (11.62 k allocations: 8.216 MiB, 1.40% compilation time: 100% of which was recompilation)

julia> test(arr_1D, iterations = 200) == test(arr_2D, iterations = 200)[:] # true
true

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  0.160560 seconds (3 allocations: 7.629 MiB, 2.95% gc time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.207420 seconds (3 allocations: 7.629 MiB)

I am unable to determine if this performance difference is only related to the use of 1D vs 2D array, the use of nextfloat, or a combination of both.

mbauman · April 24, 2025, 5:15pm

Welcome! I can’t imagine how there could be a difference — and I don’t see one. Did you try in a fresh session? Maybe you have some other (older) methods of test defined?

REPL session


julia> function test(data::Array{Float64}; iterations::Int = 100)
           output = copy(data)
           iteration = 0
           while iteration < iterations
               iteration += 1
               for i in eachindex(output)
                   next = nextfloat(output[i])
                   if next < 0.5
                       output[i] = next
                   end
               end
           end
           return output
       end
test (generic function with 1 method)

julia> arr_2D = rand(Float64, 1000, 1000);

julia> arr_1D = arr_2D[:]; # flat 1000 * 1000

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  1.462853 seconds (5.77 k allocations: 8.047 MiB, 0.51% compilation time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.472072 seconds (8.17 k allocations: 8.193 MiB, 0.52% compilation time)

julia> @time test(arr_1D, iterations = 200);   # fast, e.g. 0.2 seconds
  1.462221 seconds (2 allocations: 7.629 MiB, 0.22% gc time)

julia> @time test(arr_2D, iterations = 200);   # slow, e.g. 1.3 seconds
  1.462457 seconds (2 allocations: 7.629 MiB)

hawahe · April 24, 2025, 5:50pm

I did try in a fresh REPL session, and I also tried renaming the function to something different, and I still get the same result.

I did however notice that small changes to the function would change the result, e.g:

function test2(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            if output[i] < 0.5
                output[i] = nextfloat(output[i])
            end
        end
    end
    return output
end

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test2(arr_1D, iterations = 200);   # slow, e.g. 0.9 seconds
@time test2(arr_2D, iterations = 200);   # slow, e.g. 0.9 seconds

test2(arr_1D, iterations = 200) == test2(arr_2D, iterations = 200)[:] # true

julia> @time test2(arr_1D, iterations = 200);   # slow, e.g. 0.9 seconds
  0.872082 seconds (9.40 k allocations: 8.118 MiB, 2.42% compilation time: 100% of which was recompilation)

julia> @time test2(arr_2D, iterations = 200);   # slow, e.g. 0.9 seconds
  0.954726 seconds (12.20 k allocations: 8.257 MiB, 5.36% gc time, 2.05% compilation time: 100% of which was recompilation)

so now they are both slow, but that does not really explain why the original test function is much faster.

My initial hypothesis was that nextfloat is slow, as can be demonstrated by comparing with this almost equivalent test:

function test3(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        for i in eachindex(output)
            next = output[i] + eps()
            if next < 0.5
                output[i] = next
            end
        end
    end
    return output
end

arr_2D = rand(Float64, 1000, 1000);
arr_1D = arr_2D[:]; # flat 1000 * 1000

@time test3(arr_1D, iterations = 200);   # very fast, e.g. 0.1 seconds
@time test3(arr_2D, iterations = 200);   # very fast, e.g. 0.1 seconds

test3(arr_1D, iterations = 200) == test3(arr_2D, iterations = 200)[:] # true

julia> @time test3(arr_1D, iterations = 200);   # very fast, e.g. 0.1 seconds
  0.100805 seconds (7.50 k allocations: 8.005 MiB, 18.88% gc time, 15.73% compilation time: 100% of which was recompilation)

julia> @time test3(arr_2D, iterations = 200);   # very fast, e.g. 0.1 seconds
  0.081458 seconds (10.09 k allocations: 8.135 MiB, 22.16% compilation time: 100% of which was recompilation)

The significant difference in speed between nextfloat and eps was what led me to this in the first place. Any suggestions as to how to make nextfloat same speed as eps while preserving the benefits of nextfloat which adjusts to the value it is increasing?

mbauman · April 24, 2025, 6:07pm

output[i] + eps() isn’t the same as nextfloat — it’s gonna take bigger steps! And with bigger steps, you’re going to update the array less. In general, the duration here is going to be data-dependent. One random array might take longer or shorter than another.

I can promise you the “slowness” isn’t coming from 2D vs 1D or nextfloat or some other Julia problem. Julia is doing exactly what you asked it to do.

hawahe · April 24, 2025, 6:11pm

I agree that output[i] + eps() and nextfloat(output[i]) is not comparable, but it was the best I could come up with to simplify a bigger function where I first observed this problem (without random data) and where the difference between using eps() and nextfloat makes no difference (other than speed) for the result of the function (only problem is that adding eps() can become too small).

I can promise you the “slowness” isn’t coming from 2D vs 1D or nextfloat or some other Julia problem. Julia is doing exactly what you asked it to do.

Is the speed difference related to randomness, my hardware or JIT then?

mbauman · April 24, 2025, 6:33pm

I simply suspect you’re not measuring what you think you’re measuring.

If you’re comparing two iterative algorithms where one steps by eps() and the other steps by nextfloat, the latter is going to take twice as many steps for data in [0.5, 1.0). It’ll take 4x more steps in [0.25, 0.5). 8x more in …

And then if you accidentally call one in place of the other when you think you’re testing the opposite, you’re going to get confused.

Oscar_Smith · April 24, 2025, 6:38pm

This makes no sense at all, but I can reproduce this.

Oscar_Smith · April 24, 2025, 6:44pm

for a slightly better MWE:

julia> function test3(data::Array{Float64})
           output = copy(data)
           for i in eachindex(output)
               output[i] = nextfloat(output[i])
           end
           return output
       end
julia> @benchmark test3(arr_1D)
BenchmarkTools.Trial: 92 samples with 1 evaluation per sample.
 Range (min … max):  45.105 ms … 87.348 ms  ┊ GC (min … max):  1.90% … 42.24%
 Time  (median):     49.644 ms              ┊ GC (median):     4.18%
 Time  (mean ± σ):   54.777 ms ± 11.319 ms  ┊ GC (mean ± σ):  13.74% ± 13.55%

   ▂ ▅▇█▄▄                                                     
  ▆███████▆▆▃▃▃▁▁▅▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▅▅▃▁▃▃▃▇▅▁▁▃▁▁▁▁▁▁▁▁▁▃ ▁
  45.1 ms         Histogram: frequency by time        87.3 ms <

 Memory estimate: 68.66 MiB, allocs estimate: 2000006.

julia> @benchmark test3(arr_2D)
BenchmarkTools.Trial: 46 samples with 1 evaluation per sample.
 Range (min … max):   92.110 ms … 151.183 ms  ┊ GC (min … max):  1.77% … 21.01%
 Time  (median):     106.151 ms               ┊ GC (median):     5.72%
 Time  (mean ± σ):   109.749 ms ±  14.033 ms  ┊ GC (mean ± σ):  10.83% ±  7.46%

     ▄▄▄▁   █▁  █ ▄    ▁▁               ▁▁                       
  ▆▁▆████▁▆▁██▆▁█▆█▁▆▆▁██▁▁▁▁▆▁▁▁▁▁▁▁▆▆▁██▁▆▁▁▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  92.1 ms          Histogram: frequency by time          151 ms <

 Memory estimate: 129.70 MiB, allocs estimate: 4000008.

DNF · April 24, 2025, 6:46pm

I can reproduce this too, on 1.11.5.

Add @inbounds to the loop, and the difference disappears.

Oscar_Smith · April 24, 2025, 6:48pm

Right. The difference is where the length field (which we use to boundscheck is). For Vector, the length is a field of the Vector, while for Matrix we get the length of the Memory iirc.

It is quite surprising that this code doesn’t eliminate the boundscheck automatically

mbauman · April 24, 2025, 6:51pm

Hunh, sure enough. I was testing on 1.10. I see the difference in that @benchmark on 1.11, but I’m still not seeing that 10x slow/fast difference on the original example. There’s a branch in there, perhaps there’s some wild branch-predictor shenanigans going on for some CPUs? I’m on an M1.

Oscar_Smith · April 24, 2025, 6:58pm

I think the difference might be that M1 has much smaller Vectors (and more out of order capacity) than x86. My test3 removes the branches, so I think it is just that on x86 the bounds-checks hurt more.

hawahe · April 25, 2025, 12:42am

That is an interesting test. I looked some more into nextfloat and it seems to be much slower around 0 than elsewhere, but not always (not sure why).

First I wanted to see if I could replace nextfloat with something almost similar:

using Plots
x = -1000.0 : 1000.0;
y1 = nextfloat.(x) .- x;
y2 = eps.(x);
y3 = abs.(x .* eps());
plot(x, [y1 y2 y3], label = ["nextfloat" "eps" "abs & eps"], line = (5, [:solid :dash :solid]))

I ended up with a reference function test_nextfloat and a function test_nextabseps which I was hoping could be better than test_nextfloat. test_nextabseps is faster when the values are non-zero, but MUCH slower for 0. Any idea why?

function test_nextabseps(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    next = eps(Float64)
    next_zero = nextfloat(0.0)
    iteration = 0
    while iteration < iterations
        iteration += 1
        @inbounds for i in eachindex(output)
            output[i] += abs(output[i] * next) + next_zero
        end
    end
    return output
end

function test_nextfloat(data::Array{Float64}; iterations::Int = 100)
    output = copy(data)
    iteration = 0
    while iteration < iterations
        iteration += 1
        @inbounds for i in eachindex(output)
            output[i] = nextfloat(output[i])
        end
    end
    return output
end

zeros_1D = zeros(Float64, 1000^2);
ones_1D = ones(Float64, 1000^2);

julia> @time test_nextabseps(zeros_1D, iterations = 500);
  3.812250 seconds (8.50 k allocations: 8.050 MiB, 0.10% gc time, 0.42% compilation time)

julia> @time test_nextabseps(ones_1D, iterations = 500);
  0.177587 seconds (3 allocations: 7.629 MiB)

julia> @time test_nextfloat(zeros_1D, iterations = 500);
  0.313671 seconds (8.43 k allocations: 8.060 MiB, 5.93% compilation time)

julia> @time test_nextfloat(ones_1D, iterations = 500);
  0.301931 seconds (3 allocations: 7.629 MiB, 1.50% gc time)

Sukera · April 25, 2025, 11:47am

This might be due to subnormal numbers.

danielwe · April 25, 2025, 4:27pm

Another simple reproducer in this thread: EXCEPTION_ACCESS_VIOLATION on Windows but not MacOS - #16 by lewis

Seems worth an issue?

~~EDIT: is this the one, or is it too old? regression on inferring inbounds for `eachindex()` · Issue #45507 · JuliaLang/julia · GitHub _~~ No, that’s different and was actually fixed in 1.11.

Topic		Replies	Views
Why is copying using a loop is much slower than `copy` for large arrays? Performance question , copy	10	1414	November 27, 2022
Why is this simple function twice as slow as its Python version Performance question	97	4339	April 12, 2021
Looping over float-range is faster than looping over int-range? Performance	9	1821	August 19, 2018
Comparing performance of 2 simple averaging functions - why is one faster? Performance	5	500	August 31, 2020
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3563	October 24, 2017

Nextfloat is slower on 2D array than on 1D array

Related topics