Iterating over 2D array and placing results in 1D array with the same number of elements

I have:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

I would like to iterate over array b, calculate a result based on element value and its 2D coordinates and put it in c. The order of elements in c is irrelevant and it can be constructed in a different way than shown above.

The options I see are:

  1. Use push(c, result) to initially empty array c, which is very inefficient, unless there is a way to reserve memory for empty array.
  2. Use two iterators of different shape in parallel, which I donā€™t know how to.
  3. Use 2D iterator over array b and calculate 1D index (or iterator) from it to put the result in array c

What is a clear and concise way to do it? Performance is not critical here.

How about:

Use option (3), but create a new 2D array b_new of the same shape as b, and then do

c = vec(b_new).

EDIT: vec does not allocate a new array; it just gives a view that looks like a vector.

1 Like

I donā€™t think the performance of push! is so bad (I donā€™t believe it naively allocates on every push but does something smarter to reduce the churn), especially for something that is not performance critical. I think it often makes for simpler code.

Also, are you aware that you can index into an array using both 2D and 1D indices?
For example

julia> a = randn(3,5)
3Ɨ5 Array{Float64,2}:
  0.180949   1.04227    0.59108   -1.29255   0.766903
 -1.05146    0.878143  -0.201572   1.16771  -0.603789
 -0.763806  -1.60095   -0.759488   1.00734   0.089463

julia> a[2,3]
-0.20157212679471334

julia> a[8]
-0.20157212679471334
3 Likes

sizehint!(v, 100) reserves memory. But indeed push! is (very) clever about memory use.

1 Like

You could do this:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

foo(x) = 2 * x
foo(x, i) = x + i[2]

# this one doesn't care about the 2D index of b, but uses a linear index for both
for i in eachindex(b, c)
    c[i] = foo(b[i])
end

# if you need the 2D array indices
ind = CartesianIndices(b)
for i in eachindex(b, c)
    c[i] = foo(b[i], ind[i])
end
1 Like

Thank you all for suggestions. Thinking my problem through, I noticed that array dimensionality in Julia is very fungible through reshape, which seems to be computationally cheap. I was stuck in a different language mindset where this was not a simple and cheap task. As a consequence, my initial requirement about iterating simultaneously over two arrays with different dimensions, but the same number of elements, is not necessary. I can restate it as:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(b)

so I will be iterating over arrays b and c with the same dimensions, which makes it simpler. When different dimensions are required by a consuming function, I can just reshape resultant array c :

reshape(c, h*w)

Or use vec(c), which is more readable and doesnā€™t require you to keep track of the sizes.

4 Likes

Thanks. vec(c) is much better than reshape(c, h*w).

Here is my first stab at this with f0 using a plain integer indexes and f1 using CartesianIndices

depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
  @inbounds for i = 1:width, j = 1:height
    d = min(rangeinf, depth2d[i, j])
    depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
  end
end

function f1(depth2d, depth4d)
  for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
  end
end

julia> judge(median(@benchmark f0(depth2d, depth4d)), median(@benchmark f1(depth2d, depth4d)))
BenchmarkTools.TrialJudgement: 
  time:   -16.82% => improvement (5.00% tolerance)

Employing CartesianIndices doesnā€™t make this simple, tiny function any clearer, and is less performant. Was this expected?

This is not a self-contained example, so I canā€™t run it for proper comparison.

But, at least, you are using @inbounds in one function and not in the other. Also, remember to interpolate variables with $ when benchmarking with BenchmarkTools.

(I think nearly a third of my posts on this forum contain the phrase ā€œremember to interpolateā€.)

2 Likes

I thought that @inbounds in f1 was redundant, since CartesianIndices(depth2d) are guaranteed to be in range by design. That assumption was obviously wrong, because adding it makes f1 much faster now:

mutable struct Point3f
  x::Float32
  y::Float32
  z::Float32
end

const rangeinf = 10.5f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
  @inbounds for i = 1:width, j = 1:height
    d = min(rangeinf, depth2d[i, j])
    depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
  end
end

function f1(depth2d, depth4d)
  @inbounds for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
  end
end

println(judge(median(@benchmark f1($depth2d, $depth4d)), median(@benchmark f0($depth2d, $depth4d))))

I switched the arguments in judge:

TrialJudgement(-23.80% => improvement)

Interpolating variables with $ when benchmarking, doesnā€™t affect the timings in this case, but itā€™s good to know that this is a standard practice.

1 Like

I know youā€™ve moved targets a bit here, but this is pretty easy with zip. You can always do cleverer things, but this is probably the simplest and clearest:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

for (ib, ic) in zip(eachindex(b), eachindex(c))
    c[ic] = f(b[ib])
end

Itā€™s not necessarily inbounds when used to index into depth4d, though! What you can see and what the compiler can prove are two different things, which is precisely why we allow you to use @inbounds.

Iterating over CartesianIndices will never be faster than the for loops over the constituent indicesā€¦ that is, if you have the loops in the right order! Your f0 is row-major. Fixing that makes it 40% faster than f1.

julia> function f0c(depth2d, depth4d)
         @inbounds for j = 1:height, i = 1:width
           d = min(rangeinf, depth2d[i, j])
           depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
         end
       end
f0c (generic function with 1 method)

julia> @btime f0($depth2d, $depth4d)
  3.385 ms (307200 allocations: 9.38 MiB)

julia> @btime f0c($depth2d, $depth4d) # with for loops swapped
  1.740 ms (307200 allocations: 9.38 MiB)

julia> @btime f1($depth2d, $depth4d)
  2.124 ms (307200 allocations: 9.38 MiB)

For a thing as simple as Point3f, though, Iā€™d highly recommend using an immutable struct. If you can do so, you get to:

julia> @btime f0($depth2d, $depth4d)
  1.015 ms (0 allocations: 0 bytes)

julia> @btime f0c($depth2d, $depth4d)
  378.240 Ī¼s (0 allocations: 0 bytes)

julia> @btime f1($depth2d, $depth4d)
  799.280 Ī¼s (0 allocations: 0 bytes)

Finally, note that iterating over CartesianIndices has special @simd support that can often help its performance, even if the loop itself isnā€™t simd-able:

julia> function f1s(depth2d, depth4d)
         @inbounds @simd for i in CartesianIndices(depth2d)
           d = min(rangeinf, depth2d[i])
           depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
         end
       end
f1s (generic function with 1 method)

julia> @btime f1s($depth2d, $depth4d)
  387.986 Ī¼s (0 allocations: 0 bytes)
4 Likes

Excellent! This is the kind of response I was hoping for, thanks a lot. Iā€™m going to benchmark this function against C++, so this improvement helps a lot.

1 Like

Further improvements with LoopVectorization:

using StructArrays, LoopVectorization

depth4d_SA = StructArray((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

function f2_SA(depth2d, depth4d)
    ii, jj = axes(depth2d)
    @avx for i in ii, j in jj
        d = min(rangeinf, depth2d[i,j])
        depth4ds.x[i,j] = d*(1.0*i - cx)/fl
        depth4ds.y[i,j] = d*(1.0*j - cy)/fl
        depth4ds.z[i,j] = d
    end
end
julia> @btime f1s($depth2d, $depth4d) # 378 Ī¼s on mbauman's machine
  657.600 Ī¼s (0 allocations: 0 bytes)

julia> @btime f2_SA($depth2d, $depth4d_SA)
  213.100 Ī¼s (0 allocations: 0 bytes)
2 Likes

Thatā€™s interesting, but I donā€™t think I can use it. The consumer of depth4d requires:

eltype(depth4d) == Point3f

I made an improvement by precalculating the coefficients. Here is the version with repeatable input and more realistic rangeinf value for benchmarking:

using BenchmarkTools, Random

struct Point3f
  x::Float32
  y::Float32
  z::Float32
end

const rangeinf = 0.95f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(MersenneTwister(12345), Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f7(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds for j in 1:height
    @inbounds for i in 1:width
      d = min(rangeinf, depth2d[i, j])
      depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
    end
  end
end
julia> @btime f7($depth2d, $depth4d)
  153.568 Ī¼s (2 allocations: 4.61 KiB)

I was curious how the timings will look like across different CPUs and Julia versions. Here are three functions benchmarked (the same globals as above):

function f7(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds for j in 1:height
    @inbounds for i in 1:width
      d = min(rangeinf, depth2d[i, j])
      depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
    end
  end
end

function f8(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds @simd for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
  end
end

function f9(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  ii, jj = axes(depth2d)
  @avx for i in ii, j in jj
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
  end
end

and the results:

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4690S CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

1.4.2 f7:  250.133 Ī¼s (2 allocations: 4.61 KiB)
1.4.2 f8:  250.449 Ī¼s (2 allocations: 4.61 KiB)
1.4.2 f9:  269.958 Ī¼s (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  251.361 Ī¼s (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  252.037 Ī¼s (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  269.340 Ī¼s (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  250.898 Ī¼s (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  251.979 Ī¼s (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  269.641 Ī¼s (2 allocations: 4.61 KiB)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

1.4.2 f7:  150.748 Ī¼s (2 allocations: 4.61 KiB)
1.4.2 f8:  151.543 Ī¼s (2 allocations: 4.61 KiB)
1.4.2 f9:  203.714 Ī¼s (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  150.754 Ī¼s (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  150.387 Ī¼s (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  203.499 Ī¼s (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  150.611 Ī¼s (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  150.571 Ī¼s (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  203.472 Ī¼s (2 allocations: 4.61 KiB)

In conclusion, no performance difference between Julia v1.4.2, v1.5.0-rc1 and v1.6.0-DEV.305; significant performance jump on f7 and f8 from Haswell to Skylake CPU.

You can specify eltype of a StructArray:

julia> depth4ds = StructArray{Point3f}((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

julia> eltype(depth4ds)
Point3f

With @avx, preallocating coefficients actually seems to slow things down a bit:

function f10(depth2d, depth4ds)
    dx = [(i-cx)/fl for i in 1:width]
    dy = [(j-cy)/fl for j in 1:height]

    @avx for j in 1:height, i in 1:width
        d = min(rangeinf, depth2d[i, j])
        depth4ds.x[i, j] = d*dx[i]
        depth4ds.y[i, j] = d*dy[j]
        depth4ds.z[i, j] = d
    end
end

function f11(depth2d, depth4ds)
    @avx for j in 1:height, i in 1:width
        d = min(rangeinf, depth2d[i, j])
        depth4ds.x[i, j] = d*(1.0*i - cx)/fl
        depth4ds.y[i, j] = d*(1.0*j - cy)/fl
        depth4ds.z[i, j] = d
    end
end
julia> @btime f10($depth2d, $depth4ds)
  206.199 Ī¼s (2 allocations: 4.61 KiB)

julia> @btime f11($depth2d, $depth4ds)
  200.400 Ī¼s (0 allocations: 0 bytes)

On such an inexpensive calculation, performance is going to be limited mostly by memory bandwidth/allocations, and instantiating those coefficient arrays causes allocation. How does f11 perform on your Skylake machine?

1 Like

Iā€™m getting:

1.5.0-rc1.0 f10:  128.923 Ī¼s (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f11:  108.357 Ī¼s (0 allocations: 0 bytes)

on Skylake machine. Much faster!

Incidentaly, the new functions crash Julia VS Code extension REPL:

julia> Internal Error: ERROR: MethodError: no method matching size(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64})
The applicable method may be too new: running in world age 27143, while current world is 27185.
Closest candidates are:
  size(::StructArray) at /home/paul/.julia/packages/StructArrays/OtfvU/src/structarray.jl:130 (method too new to be called from this world context.)
  size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:38
  size(::BitArray{1}) at bitarray.jl:99
  ...
Stacktrace:
 [1] treerender(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:125
 [2] treerender(::VSCodeServer.SubTree) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:54
 [3] repl_getvariables_request(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::Nothing) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/request_handlers.jl:88
 [4] dispatch_msg(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::VSCodeServer.JSONRPC.MsgDispatcher, ::Dict{String,Any}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/JSONRPC/src/typed.jl:63
 [5] macro expansion at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/VSCodeServer.jl:83 [inlined]
 [6] (::VSCodeServer.var"#53#55"{Bool,String})() at ./task.jl:358

They work fine in terminal. Iā€™m going to file an issue.

Sounds like this.