Nested loop is slow when indices are in a struct

I have a function that loops over a 2D array. For the second dimension only some of the indices are used, and they are passed in as a unit range. It seems to make a big difference to performance if the range is part of a struct, does anyone know what cause the difference?

using BenchmarkTools

struct MyStruct
  inds :: UnitRange
end

A    = rand(25000,4)
inds = 1:3
str  = MyStruct(inds)

function f1(A, inds)
  for i in inds
    for k = axes(A,1)
      A[k,i] = A[k,i] ^ 2
    end
  end
end

function f2(A, str)
  inds = str.inds
  for i in inds
    for k = axes(A,1)
      A[k,i] = A[k,i] ^ 2
    end
  end
end

@btime f1($A,$inds)
@btime f2($A,$str)

Output:

  33.200 μs (0 allocations: 0 bytes)
  5.780 ms (296937 allocations: 4.53 MiB)

Also, I notice it gets a bit faster if I use the first version but create an intermediate view:

function f3(A,inds)
  for i in inds
    B = view(A,:,i)
    for k = axes(A,1)
      B[k] = B[k] ^ 2
    end
  end
end
@btime f3($A,$inds)
  5.817 μs (0 allocations: 0 bytes)

Is it a bad practice to collect things in structs? Or am I missing something?

UnitRange is an abstract type. Remember to define the specific type,UnitRange{Int64}, within the structure.

After the modifications, these three versions have no difference in time on my computer.

f1 → 3.462 μs (0 allocations: 0 bytes)
f2 → 3.550 μs (0 allocations: 0 bytes)
f3 → 3.487 μs (0 allocations: 0 bytes)

There should be no difference between f1 and f3.

julia> versioninfo()
Julia Version 1.10.10
Commit 95f30e51f4 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1

The problem is that the type of inds in MyStruct is not fully specified. If you would use ::UnitRange{Int64} or generics, you get the same performance:

struct MyStruct2{T}
  inds :: UnitRange{T}
end

str2 = MyStruct2(1:3)
@btime f1($A, $inds)  # 33.600 μs (0 allocations: 0 bytes)
@btime f2($A, $str2)  # 33.700 μs (0 allocations: 0 bytes)

Presumably the view helps because it moves some bounds checks out of the loop:

function f1ib(A, inds)
  for i in inds
    for k = axes(A,1)
      @inbounds A[k,i] = A[k,i] ^ 2
    end
  end
end


@btime f1ib($A, $inds)  # 10.000 μs (0 allocations: 0 bytes)
@btime f3($A, $inds)    # 10.100 μs (0 allocations: 0 bytes)