a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

I would like to iterate over array b, calculate a result based on element value and its 2D coordinates and put it in c. The order of elements in c is irrelevant and it can be constructed in a different way than shown above.

The options I see are:

Use push(c, result) to initially empty array c, which is very inefficient, unless there is a way to reserve memory for empty array.

Use two iterators of different shape in parallel, which I donāt know how to.

Use 2D iterator over array b and calculate 1D index (or iterator) from it to put the result in array c

What is a clear and concise way to do it? Performance is not critical here.

I donāt think the performance of push! is so bad (I donāt believe it naively allocates on every push but does something smarter to reduce the churn), especially for something that is not performance critical. I think it often makes for simpler code.

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)
foo(x) = 2 * x
foo(x, i) = x + i[2]
# this one doesn't care about the 2D index of b, but uses a linear index for both
for i in eachindex(b, c)
c[i] = foo(b[i])
end
# if you need the 2D array indices
ind = CartesianIndices(b)
for i in eachindex(b, c)
c[i] = foo(b[i], ind[i])
end

Thank you all for suggestions. Thinking my problem through, I noticed that array dimensionality in Julia is very fungible through reshape, which seems to be computationally cheap. I was stuck in a different language mindset where this was not a simple and cheap task. As a consequence, my initial requirement about iterating simultaneously over two arrays with different dimensions, but the same number of elements, is not necessary. I can restate it as:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(b)

so I will be iterating over arrays b and c with the same dimensions, which makes it simpler. When different dimensions are required by a consuming function, I can just reshape resultant array c :

Here is my first stab at this with f0 using a plain integer indexes and f1 using CartesianIndices

depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)
function f0(depth2d, depth4d)
@inbounds for i = 1:width, j = 1:height
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end
function f1(depth2d, depth4d)
for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
end
end
julia> judge(median(@benchmark f0(depth2d, depth4d)), median(@benchmark f1(depth2d, depth4d)))
BenchmarkTools.TrialJudgement:
time: -16.82% => improvement (5.00% tolerance)

Employing CartesianIndices doesnāt make this simple, tiny function any clearer, and is less performant. Was this expected?

This is not a self-contained example, so I canāt run it for proper comparison.

But, at least, you are using @inbounds in one function and not in the other. Also, remember to interpolate variables with $ when benchmarking with BenchmarkTools.

(I think nearly a third of my posts on this forum contain the phrase āremember to interpolateā.)

I thought that @inbounds in f1 was redundant, since CartesianIndices(depth2d) are guaranteed to be in range by design. That assumption was obviously wrong, because adding it makes f1 much faster now:

mutable struct Point3f
x::Float32
y::Float32
z::Float32
end
const rangeinf = 10.5f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width = 640
const height = 480
depth1d = rand(Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)
function f0(depth2d, depth4d)
@inbounds for i = 1:width, j = 1:height
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end
function f1(depth2d, depth4d)
@inbounds for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
end
end
println(judge(median(@benchmark f1($depth2d, $depth4d)), median(@benchmark f0($depth2d, $depth4d))))

I switched the arguments in judge:

TrialJudgement(-23.80% => improvement)

Interpolating variables with $ when benchmarking, doesnāt affect the timings in this case, but itās good to know that this is a standard practice.

I know youāve moved targets a bit here, but this is pretty easy with zip. You can always do cleverer things, but this is probably the simplest and clearest:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)
for (ib, ic) in zip(eachindex(b), eachindex(c))
c[ic] = f(b[ib])
end

Itās not necessarily inbounds when used to index into depth4d, though! What you can see and what the compiler can prove are two different things, which is precisely why we allow you to use @inbounds.

Iterating over CartesianIndices will never be faster than the for loops over the constituent indicesā¦ that is, if you have the loops in the right order! Your f0 is row-major. Fixing that makes it 40% faster than f1.

julia> function f0c(depth2d, depth4d)
@inbounds for j = 1:height, i = 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end
f0c (generic function with 1 method)
julia> @btime f0($depth2d, $depth4d)
3.385 ms (307200 allocations: 9.38 MiB)
julia> @btime f0c($depth2d, $depth4d) # with for loops swapped
1.740 ms (307200 allocations: 9.38 MiB)
julia> @btime f1($depth2d, $depth4d)
2.124 ms (307200 allocations: 9.38 MiB)

For a thing as simple as Point3f, though, Iād highly recommend using an immutable struct. If you can do so, you get to:

Finally, note that iterating over CartesianIndices has special @simd support that can often help its performance, even if the loop itself isnāt simd-able:

julia> function f1s(depth2d, depth4d)
@inbounds @simd for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
end
end
f1s (generic function with 1 method)
julia> @btime f1s($depth2d, $depth4d)
387.986 Ī¼s (0 allocations: 0 bytes)

Excellent! This is the kind of response I was hoping for, thanks a lot. Iām going to benchmark this function against C++, so this improvement helps a lot.

using StructArrays, LoopVectorization
depth4d_SA = StructArray((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))
function f2_SA(depth2d, depth4d)
ii, jj = axes(depth2d)
@avx for i in ii, j in jj
d = min(rangeinf, depth2d[i,j])
depth4ds.x[i,j] = d*(1.0*i - cx)/fl
depth4ds.y[i,j] = d*(1.0*j - cy)/fl
depth4ds.z[i,j] = d
end
end

Thatās interesting, but I donāt think I can use it. The consumer of depth4d requires:

eltype(depth4d) == Point3f

I made an improvement by precalculating the coefficients. Here is the version with repeatable input and more realistic rangeinf value for benchmarking:

using BenchmarkTools, Random
struct Point3f
x::Float32
y::Float32
z::Float32
end
const rangeinf = 0.95f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width = 640
const height = 480
depth1d = rand(MersenneTwister(12345), Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)
function f7(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]
@inbounds for j in 1:height
@inbounds for i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
end
end
end

I was curious how the timings will look like across different CPUs and Julia versions. Here are three functions benchmarked (the same globals as above):

function f7(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]
@inbounds for j in 1:height
@inbounds for i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
end
end
end
function f8(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]
@inbounds @simd for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
end
end
function f9(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]
ii, jj = axes(depth2d)
@avx for i in ii, j in jj
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
end
end

In conclusion, no performance difference between Julia v1.4.2, v1.5.0-rc1 and v1.6.0-DEV.305; significant performance jump on f7 and f8 from Haswell to Skylake CPU.

julia> depth4ds = StructArray{Point3f}((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))
julia> eltype(depth4ds)
Point3f

With @avx, preallocating coefficients actually seems to slow things down a bit:

function f10(depth2d, depth4ds)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]
@avx for j in 1:height, i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4ds.x[i, j] = d*dx[i]
depth4ds.y[i, j] = d*dy[j]
depth4ds.z[i, j] = d
end
end
function f11(depth2d, depth4ds)
@avx for j in 1:height, i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4ds.x[i, j] = d*(1.0*i - cx)/fl
depth4ds.y[i, j] = d*(1.0*j - cy)/fl
depth4ds.z[i, j] = d
end
end

On such an inexpensive calculation, performance is going to be limited mostly by memory bandwidth/allocations, and instantiating those coefficient arrays causes allocation. How does f11 perform on your Skylake machine?

Incidentaly, the new functions crash Julia VS Code extension REPL:

julia> Internal Error: ERROR: MethodError: no method matching size(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64})
The applicable method may be too new: running in world age 27143, while current world is 27185.
Closest candidates are:
size(::StructArray) at /home/paul/.julia/packages/StructArrays/OtfvU/src/structarray.jl:130 (method too new to be called from this world context.)
size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:38
size(::BitArray{1}) at bitarray.jl:99
...
Stacktrace:
[1] treerender(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:125
[2] treerender(::VSCodeServer.SubTree) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:54
[3] repl_getvariables_request(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::Nothing) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/request_handlers.jl:88
[4] dispatch_msg(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::VSCodeServer.JSONRPC.MsgDispatcher, ::Dict{String,Any}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/JSONRPC/src/typed.jl:63
[5] macro expansion at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/VSCodeServer.jl:83 [inlined]
[6] (::VSCodeServer.var"#53#55"{Bool,String})() at ./task.jl:358

They work fine in terminal. Iām going to file an issue.