EXCEPTION_ACCESS_VIOLATION on Windows but not MacOS

I am running the exact same code on Windows and Mac with Julia 11.4 installed on both. I have to assume this error message relates to an out of bounds request. I’ll paste the huge error message below. I can share the code, but it’s a lot of code. The error relates to the use of the @view macro in 2 places that you can see in the code. Once again, it’s surprising that this works on Mac and not windows. Is the underlying Julia code or the underlying libraries for linear algebra that different between the 2 implementations?

My purpose for getting a windows machine is to have a machine with a high end NVIDIA graphics card for working with cuda with both Julia and Python (PyTorch). Too bad Windows is so often flawed relative to MacOS, despite offering a wider variety of hardware.

After this initial inquiry, perhaps someone on the team can point me to some way to help debug and narrow down the problem in a way that is more helpful to diagnosing it.

In the code below (too bad we can’t paste line numbers) the error occurs on this line, which is at the end of the innermost loop body:

                    layer.grad_weight[fi, fj, ic, oc] += sum(local_patch .* err)

Here is the complete function:

function compute_grad_weight!(layer, n_samples)
    H_out, W_out, _, batch_size = size(layer.eps_l)
    f_h, f_w, _, _ = size(layer.grad_weight)
    # @assert f_h == 3 && f_w == 3  # given 3x3 filters (for clarity)

    # Initialize grad_weight to zero
    fill!(layer.grad_weight, 0.0) # no allocations; faster than assignment

    # Use @views to avoid copying subarrays
    @inbounds for oc in axes(layer.eps_l, 3)      # 1:out_channels
        # View of the error for this output channel (all spatial positions, all batches)
        err = @view layer.eps_l[:, :, oc, :]      # size H_out × W_out × batch_size
        for ic in axes(layer.a_below, 3)          # 1:in_channels
            # View of the input activation for this channel
            # (We'll slide this view for each filter offset)
            input_chan = @view layer.a_below[:, :, ic, :]   # size H_in × W_in × batch_size
            for fj in axes(layer.weight,2)
                for fi in axes(layer.weight,1)
                    # Extract the overlapping region of input corresponding to eps_l[:, :, oc, :]
                    local_patch = @view input_chan[fi:fi+H_out-1, fj:fj+W_out-1, :]
                    # Accumulate gradient for weight at (fi,fj, ic, oc)
                    layer.grad_weight[fi, fj, ic, oc] += sum(local_patch .* err)
                end
            end
        end
    end

    # Average over batch (divide by batch_size)
    layer.grad_weight .*= (1 / n_samples)
    return   # nothing
end

Here is the voluminous error message:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.       
Exception: EXCEPTION_ACCESS_VIOLATION at 0x2172d2a3166 -- getindex at .\essentials.jl:917 [inlined]
getindex at .\array.jl:930 [inlined]
getindex at .\subarray.jl:320 [inlined]
_getindex at .\abstractarray.jl:1358 [inlined]
getindex at .\abstractarray.jl:1312 [inlined]
_broadcast_getindex at .\broadcast.jl:644 [inlined]
_getindex at .\broadcast.jl:674 [inlined]
_broadcast_getindex at .\broadcast.jl:650 [inlined]
getindex at .\broadcast.jl:610 [inlined]
macro expansion at .\broadcast.jl:973 [inlined]
macro expansion at .\simdloop.jl:77 [inlined]
copyto! at .\broadcast.jl:972 [inlined]
copyto! at .\broadcast.jl:925 [inlined]
copy at .\broadcast.jl:897 [inlined]
materialize at .\broadcast.jl:872 [inlined]
compute_grad_weight! at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:597
in expression starting at REPL[10]:1
getindex at .\essentials.jl:917 [inlined]
getindex at .\array.jl:930 [inlined]
getindex at .\subarray.jl:320 [inlined]
_getindex at .\abstractarray.jl:1358 [inlined]
getindex at .\abstractarray.jl:1312 [inlined]
_broadcast_getindex at .\broadcast.jl:644 [inlined]
_getindex at .\broadcast.jl:674 [inlined]
_broadcast_getindex at .\broadcast.jl:650 [inlined]
getindex at .\broadcast.jl:610 [inlined]
macro expansion at .\broadcast.jl:973 [inlined]
macro expansion at .\simdloop.jl:77 [inlined]
copyto! at .\broadcast.jl:972 [inlined]
copyto! at .\broadcast.jl:925 [inlined]
copy at .\broadcast.jl:897 [inlined]
materialize at .\broadcast.jl:872 [inlined]
compute_grad_weight! at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:597
layer_backward! at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:566
unknown function (ip: 000002172d2aa8b7)
backprop! at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:801
#train_loop!#17 at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:864
train_loop! at C:\Users\lewis\code\Convolution\chatgpt_conv_code\src\sample_code.jl:821
unknown function (ip: 0000021703b9be70)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_stmt_value at C:/workdir/src\interpreter.c:174 [inlined]
eval_body at C:/workdir/src\interpreter.c:684
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:824
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:943
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:886
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:886
ijl_toplevel_eval at C:/workdir/src\toplevel.c:952 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:994
eval at .\boot.jl:430 [inlined]
eval_user_input at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:245
repl_backend_loop at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:342
#start_repl_backend#59 at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:327
start_repl_backend at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:324
#run_repl#72 at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:483
run_repl at C:\workdir\usr\share\julia\stdlib\v1.11\REPL\src\REPL.jl:469
jfptr_run_repl_10360.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\share\julia\compiled\v1.11\REPL\u0gqU_hz07T.dll (unknown line)
#1150 at .\client.jl:446
jfptr_YY.1150_15097.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\share\julia\compiled\v1.11\REPL\u0gqU_hz07T.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
_start at .\client.jl:541
jfptr__start_75324.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
true_main at C:/workdir/src\jlapi.c:900
jl_repl_entrypoint at C:/workdir/src\jlapi.c:1059
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
_start at .\client.jl:541
jfptr__start_75324.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
true_main at C:/workdir/src\jlapi.c:900
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:875
jl_f__call_latest at C:/workdir/src\builtins.c:875
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
_start at .\client.jl:541
jfptr__start_75324.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:875
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
_start at .\client.jl:541
jfptr__start_75324.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
#invokelatest#2 at .\essentials.jl:1055 [inlined]
invokelatest at .\essentials.jl:1052 [inlined]
run_main_repl at .\client.jl:430
repl_main at .\client.jl:567 [inlined]
_start at .\client.jl:541
jfptr__start_75324.1 at C:\Users\lewis\AppData\Local\Programs\Julia-1.11.4\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:2157 [inlined]
true_main at C:/workdir/src\jlapi.c:900
jl_repl_entrypoint at C:/workdir/src\jlapi.c:1059
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
true_main at C:/workdir/src\jlapi.c:900
jl_repl_entrypoint at C:/workdir/src\jlapi.c:1059
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 39319545 (Pool: 39318320; Big: 1225); GC: 63

Well since your code claims to always be in bounds, if an out of bounds access happens, you’ll naturally get a segmentation fault. Remove the @inbounds and you should get a better error message, or run with --check-bounds=true.

I suspect it has to do with your indexing - you iterate over the third axis, but index with that into the fourth dimension of layer.grad_weight.

2 Likes

An unrelated comment. Since you’re concerned about unnecessary copying, this construction does unnecessary allocation of the result of local_patch .* err. You can avoid it by doing mapreduce(splat(*), +, zip(local_patch, err)) instead. Or possibly dot(local_patch, err).

Of course. I’ll see if I get a better message.

But, I’ll point out that this code was developed and tested using Julia on a Mac.

The problem is in the Julia runtime for Windows. I don’t expect to be able to fix this.

Thanks. Great suggestions. I’ve been trying to purge the allocations and still miss obvious alternatives.

If you run this code without that @inbounds on your Mac, I’d expect that you’d get a BoundsError just the same. Windows & Mac don’t necessarily react the same when it comes to accessing data out of bounds. They are different operating systems with possibly different page sizes & thus different access bounds before the OS itself errors, which is what EXCEPTION_ACCESS_VIOLATION is communicating. Just because you don’t get an exception on one OS doesn’t mean that the code itself is correct.

This difference in behavior is not generally due to an implementation detail in Julia itself.

That is quite a strong claim, and a minimal example of that would certainly be appreciated!

1 Like

Fair enough.

The code provided runs on mac without segfaulting and with correct answers. If macOS was allowing an out of bounds memory access (bad, bad) then it would be a remarkable coincidence that the answers were always correct. Removing @inbounds on the Mac code would not fix anything, but it would enable Julia to terminate the running code gracefully, after catching the out of bounds condition, and report the error without abend’ing (ancient terminology). Will try it and report back.

I was not trying to be provocative. The code provided is the example. Hard to break it down, I realize.

1 Like

You were right! Much more useful message because it shows the array indices being attempted, which points to the problem directly.

Also, the suggestion to use map(…) or dot did run without a problem but don’t produce convergence for training the model. So what the Mac code was doing was strange indeed: implicit zero padding—not literally but “helpfully” returning zeros? Who knows.

Back to the drawing board a little. A good discovery given how the error had been masked.

Thanks, all.

I am using email. I can go into Discourse and mark closed or go ahead and really fix the code and then report back.

2 Likes

The insidious thing about out of bounds accesses is that not all of them will always produce a segmentation fault, even on Mac or Linux - or Windows!

Under the hood, an OS chunks the available memory into what are called pages, which are blocks of memory usually a few Kilobyte in size. When you read memory, the OS checks if the page that memory is on is physically present in RAM and loads it if necessary (this is partly how processes are isolated from one another, as well as how you can allocate memory larger than your physical RAM). A segmentation fault (or EXCEPTION_ACCESS_VIOLATION on Windows) happens when you’re trying to access memory that isn’t allocated to your program - but the OS can only check that on the granularity of a single page! If that check were to happen on every memory access, everything would be terribly slow. Due to differences in sizes of those memory pages on different OS, this can cause faulty code to produce correct results on some OS, since there the memory may be (coincidentally!) zeroed, or the access does not need to touch a new page, masking the problem.

You would get faulty results even without a segmentation fault in case the memory previously contained other data; the memory may after all be reused from a previous allocation, without being zeroed inbetween. It really is just a coincidence that the code worked on Mac - it may be the case that it would not work on a different machine!

Apologies for the snarky reply, I was just really sure that the @inbounds was masking the actual problem :slight_smile: I hope you can find the real bug!

Fixed.

“Same” padding requires explicit padding for the backpropagation of convolution weights, though not for the activations (layer loss). Simple to add. Works on Mac and Windows.

Would never have discovered this problem if I hadn’t run the code on Windows. Even on windows the error segfault with @inbounds was interesting. About 20-30 iterations would run before the crash. Fascinating because I pre-allocate all the training arrays in advance. i’ve tried to get rid of allocations, but not all the way yet. So, probably on Windows some allocations created the need to move some stuff in memory and after moving to the new page(s), we crossed a page boundary.

Interesting that this hadnt (yet) occurred on Mac, not because I was so brilliant to avoid all allocations but maybe the page size is different and the pages were zero initialized (does the OS even do that?) so looked like the effect of (slightly misaligned) zero padding.

Gradient descent survives some surprising sloppiness as long as the sloppiness is consistent across feed forward and back prop. But, sometimes it’s not so resilient. Amazing that a new industry is built-on such ad hoc (less politely, hacky) technology. (shhh… …Don’t let the public or VCs know!)

That’s the thing about out of bounds access. It’s undefined behaviour. Anything can happen. The reason is that your out of bounds access can hit anything. It can overwrite data, so you get a wrong result. It can overwrite internal book-keeping data in julia, counters, pointer, whatever, which may cause some other access to be wrong, much later in the execution. If you’re lucky it hits non-mapped memory directly, and you get an exception.

I quickly used dot(local_patch, err) because it’s easy and obvious. Allocations go down from 160 bytes to 16. (I am forgetting the speed difference, but I think it’s .6 of the time. Dot runs in ~19 ns on my quick machine using @benchmark.

But, mapreduce(…) is even better. 16 bytes allocated but 1 allocation instead of 3. But runs in half the time of dot at ~10.3 ns.

Writing an explicit loop using @inbounds is a tiny bit faster at 9ns, 1 allocation, 16 bytes. But, it’s less general.

Always something to learn…

EDIT: Reading your first post again it looks like you may eventually want this code to run on GPU. In that case, dot is probably the way to go, and if not, definitely mapreduce and not one of the sum variants below. But I don’t think having a zip in there is GPU friendly/compatible—I think the mapreduce invocation you want is the following:

mapreduce(*, +, local_patch, err)

The mapreduce expression is kind of obscure, and mapreduce is generally not as well optimized as it should be (it often creates unnecessary intermediate allocations). I’d recommend a simple and easy to understand generator sum:

sum(l * e for (l, e) in zip(local_patch, err))

or even

sum(local_patch[i] * err[i] for i in eachindex(local_patch, err))

Curious how they affect your benchmarks.


And to drive home the point from the thread above: @inbounds is dangerous, precisely because you can’t rely on anyone saving your bacon if the index is actually out of bounds. Sometimes you get a segfault, sometimes you corrupt some program state, and sometimes you simply get an arbitrary value, often but not always zero. Avoid @inbounds as much as you can, and if your benchmarks show that you really need it for performance, add a line of code in the same function, but outside the hot loop, that checks that all the indices that will be visited during the loop are inbounds. See this post for an example of how that might look: When does @inbounds increase performance? - #3 by danielwe

For your dining, dancing and micro-benchmarking pleasure!

Drumroll please!

The comprehensions win by a nose!

The original sum of element-wise multiplication is much the worst. Even the better expressions have a single allocation of 16 bytes. I can live with that. (Sorry the miniature histogram plots wouldn’t survive copy/paste into my email client).


> local_patch=fill(0.5, 3,3); err = fill(0.4,3,3)
3×3 Matrix{Float64}:
0.4 0.4 0.4
0.4 0.4 0.4
0.4 0.4 0.4

> @benchmark sum(l * e for (l, e) in zip($local_patch, $err)) (setup = (local_patch=fill(0.5,3,3), err=(fill(0.4,3,3))))
BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
Range (min … max): 10.719 ns … 1.137 μs ┊ GC (min … max): 0.00% … 97.79%
Time (median): 12.303 ns ┊ GC (median): 0.00%
Time (mean ± σ): 12.765 ns ± 14.614 ns ┊ GC (mean ± σ): 2.75% ± 3.68%
Memory estimate: 16 bytes, allocs estimate: 1.

> @benchmark dot($local_patch, $err) (setup = (local_patch=fill(0.5,3,3), err=(fill(0.4,3,3))))

BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
Range (min … max): 13.471 ns … 289.414 ns ┊ GC (min … max): 0.00% … 90.43%
Time (median): 17.267 ns ┊ GC (median): 0.00%
Time (mean ± σ): 17.334 ns ± 5.765 ns ┊ GC (mean ± σ): 0.98% ± 2.93%
Memory estimate: 16 bytes, allocs estimate: 1.

> @benchmark sum($local_patch .* $err) (setup = (local_patch=fill(0.5,3,3), err=(fill(0.4,3,3))))

BenchmarkTools.Trial: 10000 samples with 997 evaluations per sample.
Range (min … max): 21.773 ns … 654.213 ns ┊ GC (min … max): 0.00% … 94.03%
Time (median): 24.824 ns ┊ GC (median): 0.00%
Time (mean ± σ): 27.657 ns ± 31.318 ns ┊ GC (mean ± σ): 10.80% ± 8.90%
Memory estimate: 160 bytes, allocs estimate: 3.

> @benchmark mapreduce(splat(*), +, zip($err, $local_patch)) (setup = (local_patch=fill(0.5,3,3), err=(fill(0.4,3,3))))

BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
Range (min … max): 11.386 ns … 297.255 ns ┊ GC (min … max): 0.00% … 93.35%
Time (median): 11.887 ns ┊ GC (median): 0.00%
Time (mean ± σ): 12.370 ns ± 4.880 ns ┊ GC (mean ± σ): 0.65% ± 1.60%
Memory estimate: 16 bytes, allocs estimate: 1.

> @benchmark sum(local_patch[i] * err[i] for i in eachindex(local_patch, err)) (setup = (local_patch=fill(0.5,3,3), err=(fill(0.4,3,3))))

BenchmarkTools.Trial: 10000 samples with 999 evaluations per sample.
Range (min … max): 10.928 ns … 305.222 ns ┊ GC (min … max): 0.00% … 93.48%
Time (median): 13.514 ns ┊ GC (median): 0.00%
Time (mean ± σ): 13.810 ns ± 5.085 ns ┊ GC (mean ± σ): 0.59% ± 1.60%
Memory estimate: 16 bytes, allocs estimate: 1.

Your setup argument isn’t doing what you intend to; you’re creating a namedtuple instead of a block containing two expressions. As a result, the actual benchmark is grabbing local_patch and err from global scope, which explains your allocation. The fix is to insert a semicolon in place of the comma, as follows:

julia> @benchmark sum(local_patch[i] * err[i] for i in eachindex(local_patch, err)) setup=(local_patch = fill(0.5,3,3); err = fill(0.4,3,3))

You should adjust the other benchmarking expressions similarly, and make sure you do not interpolate ($) variables that are created during setup, in other words, you should not use $ at all in these benchmarks.

Just to be sure, do this in a fresh Julia session where you have not created variables named local_patch and err in the global scope.

You should see significantly smaller numbers, especially for the versions that only have a single allocation in your current benchmarks.

Once again, awesome input. I have to re-study the BenchmarkTools docs.

You are so right. The order of fastest outcomes changes a bit. No allocations at all for each except the original sum of element-wise multiplication. Execution times less than half of the erroneously constructed benchmarks.

Comprehension with zip wins, then mapreduce, then dot. The absolute differences are so small that among the top 3, even in a hot loop, choice might be style. But the comprehension is brief and clear.

The array indices in comprehension is more costly and sum of element-wise multiplication is looking even worse than before.

OK. This exercise was very instructive and we can call it done.

@benchmark sum(l * e for (l, e) in zip(local_patch, err)) (setup = (local_patch=fill(0.5,3,3); err=(fill(0.4,3,3))))

BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  3.541 ns … 13.708 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.625 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.740 ns ±  0.261 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ █  █ ▅  ▂           ▃  ▇ ▅  ▂                            ▂
  █▁█▁▁█▁█▁▁█▁▇▁▁▆▁▆▁▁▆▁█▁▁█▁█▁▁█▁█▁▁▆▁▅▁▁▃▁▁▁▁▄▁▁▁▁▃▁▄▁▁▅▁▃ █
  3.54 ns      Histogram: log(frequency) by time      4.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark mapreduce(splat(*), +, zip(err, local_patch)) (setup = (local_patch=fill(0.5,3,3); err=(fill(0.4,3,3))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  3.500 ns … 18.583 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.792 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.749 ns ±  0.268 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃  █  ▇  ▃     ▁  █  █  ▆ ▁▂                             ▂
  ▄▁█▁▁█▁▁█▁▁█▁▁▇▁▁█▁▁█▁▁█▁▁█▁██▁▇▁▁▆▁▁▅▁▁▅▁▁▃▁▁▃▁▁▄▁▁▄▁▁▄▁▃ █
  3.5 ns       Histogram: log(frequency) by time     4.33 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
using LinearAlgebra

julia> @benchmark dot(local_patch, err) (setup = (local_patch=fill(0.5,3,3); err=(fill(0.4,3,3))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  5.541 ns … 20.833 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.666 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.838 ns ±  0.418 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃██ █▄ ▁            ▂ ▇▇ ▃▂                                ▂
  ███▁██▁██▁█▇▁▇▆▁▇▇▁▇█▁██▁██▁▇▇▁▅▅▁▄▅▁▃▁▁▅▅▁▁▁▁▅▄▁▇▅▁▆▆▁▅▃▄ █
  5.54 ns      Histogram: log(frequency) by time     7.17 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark sum(local_patch[i] * err[i] for i in eachindex(local_patch, err)) (setup = (local_patch=fill(0.5,3,3); err=(fill(0.4,3,3))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  4.625 ns … 21.292 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.166 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.988 ns ±  0.362 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄  █  ▇  ▅   ▂  ▁                  ▂  ▇   █  ▄  ▂       ▂
  ▆▁▁█▁▁█▁▁█▁▁█▁▁▁█▁▁█▁▁▇▁▁▇▁▁▇▁▁▁▇▁▁▇▁▁█▁▁█▁▁▁█▁▁█▁▁█▁▁▇▁▁▅ █
  4.62 ns      Histogram: log(frequency) by time     5.38 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark sum(local_patch .* err) (setup = (local_patch=fill(0.5,3,3); err=(fill(0.4,3,3))))


BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
 Range (min … max):  14.988 ns … 719.857 ns  ┊ GC (min … max):  0.00% … 95.53%
 Time  (median):     16.492 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   19.495 ns ±  32.341 ns  ┊ GC (mean ± σ):  13.73% ±  8.06%

     ▆▅▂       █▁                                               
  ▂▂▃███▇▅▄▄▃▄███▆▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂ ▃
  15 ns           Histogram: frequency by time         25.2 ns <

 Memory estimate: 144 bytes, allocs estimate: 2.

Oh, interesting, this is a regression in Julia 1.11. In Julia 1.10, the eachindex and zip versions perform identically. What’s happening is probably the same as in this thread: Nextfloat is slower on 2D array than on 1D array - #8 by Oscar_Smith; that is, inboundsness is no longer inferred when using eachindex on 2D (or higher-dimensional) arrays.

Is it possible to introduce an explicit @inbounds annotation within the comprehension?