# Iterating over 2D array and placing results in 1D array with the same number of elements

I have:

``````a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)
``````

I would like to iterate over array `b`, calculate a result based on element value and its 2D coordinates and put it in `c`. The order of elements in `c` is irrelevant and it can be constructed in a different way than shown above.

The options I see are:

1. Use `push(c, result)` to initially empty array `c`, which is very inefficient, unless there is a way to reserve memory for empty array.
2. Use two iterators of different shape in parallel, which I don’t know how to.
3. Use 2D iterator over array `b` and calculate 1D index (or iterator) from it to put the result in array `c`

What is a clear and concise way to do it? Performance is not critical here.

Use option (3), but create a new 2D array `b_new` of the same shape as `b`, and then do

`c = vec(b_new)`.

EDIT: `vec` does not allocate a new array; it just gives a view that looks like a vector.

1 Like

I don’t think the performance of `push!` is so bad (I don’t believe it naively allocates on every push but does something smarter to reduce the churn), especially for something that is not performance critical. I think it often makes for simpler code.

Also, are you aware that you can index into an array using both 2D and 1D indices?
For example

``````julia> a = randn(3,5)
3×5 Array{Float64,2}:
0.180949   1.04227    0.59108   -1.29255   0.766903
-1.05146    0.878143  -0.201572   1.16771  -0.603789
-0.763806  -1.60095   -0.759488   1.00734   0.089463

julia> a[2,3]
-0.20157212679471334

julia> a
-0.20157212679471334
``````
3 Likes

`sizehint!(v, 100)` reserves memory. But indeed `push!` is (very) clever about memory use.

1 Like

You could do this:

``````a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

foo(x) = 2 * x
foo(x, i) = x + i

# this one doesn't care about the 2D index of b, but uses a linear index for both
for i in eachindex(b, c)
c[i] = foo(b[i])
end

# if you need the 2D array indices
ind = CartesianIndices(b)
for i in eachindex(b, c)
c[i] = foo(b[i], ind[i])
end
``````
1 Like

Thank you all for suggestions. Thinking my problem through, I noticed that array dimensionality in Julia is very fungible through `reshape`, which seems to be computationally cheap. I was stuck in a different language mindset where this was not a simple and cheap task. As a consequence, my initial requirement about iterating simultaneously over two arrays with different dimensions, but the same number of elements, is not necessary. I can restate it as:

``````a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(b)
``````

so I will be iterating over arrays `b` and `c` with the same dimensions, which makes it simpler. When different dimensions are required by a consuming function, I can just reshape resultant array `c` :

``````reshape(c, h*w)
``````

Or use `vec(c)`, which is more readable and doesn’t require you to keep track of the sizes.

4 Likes

Thanks. `vec(c)` is much better than `reshape(c, h*w)`.

Here is my first stab at this with `f0` using a plain integer indexes and `f1` using `CartesianIndices`

``````depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
@inbounds for i = 1:width, j = 1:height
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end

function f1(depth2d, depth4d)
for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i-cx)/fl, d*(i-cy)/fl, d)
end
end

julia> judge(median(@benchmark f0(depth2d, depth4d)), median(@benchmark f1(depth2d, depth4d)))
BenchmarkTools.TrialJudgement:
time:   -16.82% => improvement (5.00% tolerance)
``````

Employing `CartesianIndices` doesn’t make this simple, tiny function any clearer, and is less performant. Was this expected?

This is not a self-contained example, so I can’t run it for proper comparison.

But, at least, you are using `@inbounds` in one function and not in the other. Also, remember to interpolate variables with `\$` when benchmarking with BenchmarkTools.

(I think nearly a third of my posts on this forum contain the phrase “remember to interpolate”.)

2 Likes

I thought that `@inbounds` in `f1` was redundant, since `CartesianIndices(depth2d)` are guaranteed to be in range by design. That assumption was obviously wrong, because adding it makes `f1` much faster now:

``````mutable struct Point3f
x::Float32
y::Float32
z::Float32
end

const rangeinf = 10.5f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
@inbounds for i = 1:width, j = 1:height
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end

function f1(depth2d, depth4d)
@inbounds for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i-cx)/fl, d*(i-cy)/fl, d)
end
end

println(judge(median(@benchmark f1(\$depth2d, \$depth4d)), median(@benchmark f0(\$depth2d, \$depth4d))))
``````

I switched the arguments in `judge`:

``````TrialJudgement(-23.80% => improvement)
``````

Interpolating variables with `\$` when benchmarking, doesn’t affect the timings in this case, but it’s good to know that this is a standard practice.

1 Like

I know you’ve moved targets a bit here, but this is pretty easy with `zip`. You can always do cleverer things, but this is probably the simplest and clearest:

``````a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

for (ib, ic) in zip(eachindex(b), eachindex(c))
c[ic] = f(b[ib])
end
``````

It’s not necessarily inbounds when used to index into `depth4d`, though! What you can see and what the compiler can prove are two different things, which is precisely why we allow you to use `@inbounds`.

Iterating over `CartesianIndices` will never be faster than the `for` loops over the constituent indices… that is, if you have the loops in the right order! Your `f0` is row-major. Fixing that makes it 40% faster than `f1`.

``````julia> function f0c(depth2d, depth4d)
@inbounds for j = 1:height, i = 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
end
end
f0c (generic function with 1 method)

julia> @btime f0(\$depth2d, \$depth4d)
3.385 ms (307200 allocations: 9.38 MiB)

julia> @btime f0c(\$depth2d, \$depth4d) # with for loops swapped
1.740 ms (307200 allocations: 9.38 MiB)

julia> @btime f1(\$depth2d, \$depth4d)
2.124 ms (307200 allocations: 9.38 MiB)
``````

For a thing as simple as `Point3f`, though, I’d highly recommend using an immutable `struct`. If you can do so, you get to:

``````julia> @btime f0(\$depth2d, \$depth4d)
1.015 ms (0 allocations: 0 bytes)

julia> @btime f0c(\$depth2d, \$depth4d)
378.240 μs (0 allocations: 0 bytes)

julia> @btime f1(\$depth2d, \$depth4d)
799.280 μs (0 allocations: 0 bytes)
``````

Finally, note that iterating over `CartesianIndices` has special `@simd` support that can often help its performance, even if the loop itself isn’t simd-able:

``````julia> function f1s(depth2d, depth4d)
@inbounds @simd for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*(i-cx)/fl, d*(i-cy)/fl, d)
end
end
f1s (generic function with 1 method)

julia> @btime f1s(\$depth2d, \$depth4d)
387.986 μs (0 allocations: 0 bytes)
``````
4 Likes

Excellent! This is the kind of response I was hoping for, thanks a lot. I’m going to benchmark this function against C++, so this improvement helps a lot.

1 Like

Further improvements with LoopVectorization:

``````using StructArrays, LoopVectorization

depth4d_SA = StructArray((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

function f2_SA(depth2d, depth4d)
ii, jj = axes(depth2d)
@avx for i in ii, j in jj
d = min(rangeinf, depth2d[i,j])
depth4ds.x[i,j] = d*(1.0*i - cx)/fl
depth4ds.y[i,j] = d*(1.0*j - cy)/fl
depth4ds.z[i,j] = d
end
end
``````
``````julia> @btime f1s(\$depth2d, \$depth4d) # 378 μs on mbauman's machine
657.600 μs (0 allocations: 0 bytes)

julia> @btime f2_SA(\$depth2d, \$depth4d_SA)
213.100 μs (0 allocations: 0 bytes)
``````
2 Likes

That’s interesting, but I don’t think I can use it. The consumer of `depth4d` requires:

``````eltype(depth4d) == Point3f
``````

I made an improvement by precalculating the coefficients. Here is the version with repeatable input and more realistic `rangeinf` value for benchmarking:

``````using BenchmarkTools, Random

struct Point3f
x::Float32
y::Float32
z::Float32
end

const rangeinf = 0.95f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(MersenneTwister(12345), Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f7(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]

@inbounds for j in 1:height
@inbounds for i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
end
end
end
``````
``````julia> @btime f7(\$depth2d, \$depth4d)
153.568 μs (2 allocations: 4.61 KiB)
``````

I was curious how the timings will look like across different CPUs and Julia versions. Here are three functions benchmarked (the same globals as above):

``````function f7(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]

@inbounds for j in 1:height
@inbounds for i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
end
end
end

function f8(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]

@inbounds @simd for i in CartesianIndices(depth2d)
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*dx[i], d*dy[i], d)
end
end

function f9(depth2d, depth4d)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]

ii, jj = axes(depth2d)
@avx for i in ii, j in jj
d = min(rangeinf, depth2d[i])
depth4d[i] = Point3f(d*dx[i], d*dy[i], d)
end
end
``````

and the results:

``````Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-4690S CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

1.4.2 f7:  250.133 μs (2 allocations: 4.61 KiB)
1.4.2 f8:  250.449 μs (2 allocations: 4.61 KiB)
1.4.2 f9:  269.958 μs (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  251.361 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  252.037 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  269.340 μs (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  250.898 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  251.979 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  269.641 μs (2 allocations: 4.61 KiB)
``````
``````Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

1.4.2 f7:  150.748 μs (2 allocations: 4.61 KiB)
1.4.2 f8:  151.543 μs (2 allocations: 4.61 KiB)
1.4.2 f9:  203.714 μs (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  150.754 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  150.387 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  203.499 μs (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  150.611 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  150.571 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  203.472 μs (2 allocations: 4.61 KiB)
``````

In conclusion, no performance difference between Julia v1.4.2, v1.5.0-rc1 and v1.6.0-DEV.305; significant performance jump on f7 and f8 from Haswell to Skylake CPU.

You can specify eltype of a StructArray:

``````julia> depth4ds = StructArray{Point3f}((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

julia> eltype(depth4ds)
Point3f
``````

With `@avx`, preallocating coefficients actually seems to slow things down a bit:

``````function f10(depth2d, depth4ds)
dx = [(i-cx)/fl for i in 1:width]
dy = [(j-cy)/fl for j in 1:height]

@avx for j in 1:height, i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4ds.x[i, j] = d*dx[i]
depth4ds.y[i, j] = d*dy[j]
depth4ds.z[i, j] = d
end
end

function f11(depth2d, depth4ds)
@avx for j in 1:height, i in 1:width
d = min(rangeinf, depth2d[i, j])
depth4ds.x[i, j] = d*(1.0*i - cx)/fl
depth4ds.y[i, j] = d*(1.0*j - cy)/fl
depth4ds.z[i, j] = d
end
end
``````
``````julia> @btime f10(\$depth2d, \$depth4ds)
206.199 μs (2 allocations: 4.61 KiB)

julia> @btime f11(\$depth2d, \$depth4ds)
200.400 μs (0 allocations: 0 bytes)
``````

On such an inexpensive calculation, performance is going to be limited mostly by memory bandwidth/allocations, and instantiating those coefficient arrays causes allocation. How does `f11` perform on your Skylake machine?

1 Like

I’m getting:

``````1.5.0-rc1.0 f10:  128.923 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f11:  108.357 μs (0 allocations: 0 bytes)
``````

on Skylake machine. Much faster!

Incidentaly, the new functions crash Julia VS Code extension REPL:

``````julia> Internal Error: ERROR: MethodError: no method matching size(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64})
The applicable method may be too new: running in world age 27143, while current world is 27185.
Closest candidates are:
size(::StructArray) at /home/paul/.julia/packages/StructArrays/OtfvU/src/structarray.jl:130 (method too new to be called from this world context.)
size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:38
size(::BitArray{1}) at bitarray.jl:99
...
Stacktrace:
 treerender(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:125
 treerender(::VSCodeServer.SubTree) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:54
 repl_getvariables_request(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::Nothing) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/request_handlers.jl:88
 dispatch_msg(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::VSCodeServer.JSONRPC.MsgDispatcher, ::Dict{String,Any}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/JSONRPC/src/typed.jl:63
 macro expansion at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/VSCodeServer.jl:83 [inlined]