Iterating over 2D array and placing results in 1D array with the same number of elements

I have:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

I would like to iterate over array b, calculate a result based on element value and its 2D coordinates and put it in c. The order of elements in c is irrelevant and it can be constructed in a different way than shown above.

The options I see are:

  1. Use push(c, result) to initially empty array c, which is very inefficient, unless there is a way to reserve memory for empty array.
  2. Use two iterators of different shape in parallel, which I don’t know how to.
  3. Use 2D iterator over array b and calculate 1D index (or iterator) from it to put the result in array c

What is a clear and concise way to do it? Performance is not critical here.

How about:

Use option (3), but create a new 2D array b_new of the same shape as b, and then do

c = vec(b_new).

EDIT: vec does not allocate a new array; it just gives a view that looks like a vector.

1 Like

I don’t think the performance of push! is so bad (I don’t believe it naively allocates on every push but does something smarter to reduce the churn), especially for something that is not performance critical. I think it often makes for simpler code.

Also, are you aware that you can index into an array using both 2D and 1D indices?
For example

julia> a = randn(3,5)
3Ɨ5 Array{Float64,2}:
  0.180949   1.04227    0.59108   -1.29255   0.766903
 -1.05146    0.878143  -0.201572   1.16771  -0.603789
 -0.763806  -1.60095   -0.759488   1.00734   0.089463

julia> a[2,3]
-0.20157212679471334

julia> a[8]
-0.20157212679471334
3 Likes

sizehint!(v, 100) reserves memory. But indeed push! is (very) clever about memory use.

1 Like

You could do this:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

foo(x) = 2 * x
foo(x, i) = x + i[2]

# this one doesn't care about the 2D index of b, but uses a linear index for both
for i in eachindex(b, c)
    c[i] = foo(b[i])
end

# if you need the 2D array indices
ind = CartesianIndices(b)
for i in eachindex(b, c)
    c[i] = foo(b[i], ind[i])
end
1 Like

Thank you all for suggestions. Thinking my problem through, I noticed that array dimensionality in Julia is very fungible through reshape, which seems to be computationally cheap. I was stuck in a different language mindset where this was not a simple and cheap task. As a consequence, my initial requirement about iterating simultaneously over two arrays with different dimensions, but the same number of elements, is not necessary. I can restate it as:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(b)

so I will be iterating over arrays b and c with the same dimensions, which makes it simpler. When different dimensions are required by a consuming function, I can just reshape resultant array c :

reshape(c, h*w)

Or use vec(c), which is more readable and doesn’t require you to keep track of the sizes.

4 Likes

Thanks. vec(c) is much better than reshape(c, h*w).

Here is my first stab at this with f0 using a plain integer indexes and f1 using CartesianIndices

depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
  @inbounds for i = 1:width, j = 1:height
    d = min(rangeinf, depth2d[i, j])
    depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
  end
end

function f1(depth2d, depth4d)
  for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
  end
end

julia> judge(median(@benchmark f0(depth2d, depth4d)), median(@benchmark f1(depth2d, depth4d)))
BenchmarkTools.TrialJudgement: 
  time:   -16.82% => improvement (5.00% tolerance)

Employing CartesianIndices doesn’t make this simple, tiny function any clearer, and is less performant. Was this expected?

This is not a self-contained example, so I can’t run it for proper comparison.

But, at least, you are using @inbounds in one function and not in the other. Also, remember to interpolate variables with $ when benchmarking with BenchmarkTools.

(I think nearly a third of my posts on this forum contain the phrase ā€œremember to interpolateā€.)

2 Likes

I thought that @inbounds in f1 was redundant, since CartesianIndices(depth2d) are guaranteed to be in range by design. That assumption was obviously wrong, because adding it makes f1 much faster now:

mutable struct Point3f
  x::Float32
  y::Float32
  z::Float32
end

const rangeinf = 10.5f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f0(depth2d, depth4d)
  @inbounds for i = 1:width, j = 1:height
    d = min(rangeinf, depth2d[i, j])
    depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
  end
end

function f1(depth2d, depth4d)
  @inbounds for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
  end
end

println(judge(median(@benchmark f1($depth2d, $depth4d)), median(@benchmark f0($depth2d, $depth4d))))

I switched the arguments in judge:

TrialJudgement(-23.80% => improvement)

Interpolating variables with $ when benchmarking, doesn’t affect the timings in this case, but it’s good to know that this is a standard practice.

1 Like

I know you’ve moved targets a bit here, but this is pretty easy with zip. You can always do cleverer things, but this is probably the simplest and clearest:

a = rand(Float32, h*w)
b = reshape(a, h, w)
c = similar(a)

for (ib, ic) in zip(eachindex(b), eachindex(c))
    c[ic] = f(b[ib])
end

It’s not necessarily inbounds when used to index into depth4d, though! What you can see and what the compiler can prove are two different things, which is precisely why we allow you to use @inbounds.

Iterating over CartesianIndices will never be faster than the for loops over the constituent indices… that is, if you have the loops in the right order! Your f0 is row-major. Fixing that makes it 40% faster than f1.

julia> function f0c(depth2d, depth4d)
         @inbounds for j = 1:height, i = 1:width
           d = min(rangeinf, depth2d[i, j])
           depth4d[i, j] = Point3f(d*(i-cx)/fl, d*(j-cy)/fl, d)
         end
       end
f0c (generic function with 1 method)

julia> @btime f0($depth2d, $depth4d)
  3.385 ms (307200 allocations: 9.38 MiB)

julia> @btime f0c($depth2d, $depth4d) # with for loops swapped
  1.740 ms (307200 allocations: 9.38 MiB)

julia> @btime f1($depth2d, $depth4d)
  2.124 ms (307200 allocations: 9.38 MiB)

For a thing as simple as Point3f, though, I’d highly recommend using an immutable struct. If you can do so, you get to:

julia> @btime f0($depth2d, $depth4d)
  1.015 ms (0 allocations: 0 bytes)

julia> @btime f0c($depth2d, $depth4d)
  378.240 μs (0 allocations: 0 bytes)

julia> @btime f1($depth2d, $depth4d)
  799.280 μs (0 allocations: 0 bytes)

Finally, note that iterating over CartesianIndices has special @simd support that can often help its performance, even if the loop itself isn’t simd-able:

julia> function f1s(depth2d, depth4d)
         @inbounds @simd for i in CartesianIndices(depth2d)
           d = min(rangeinf, depth2d[i])
           depth4d[i] = Point3f(d*(i[1]-cx)/fl, d*(i[2]-cy)/fl, d)
         end
       end
f1s (generic function with 1 method)

julia> @btime f1s($depth2d, $depth4d)
  387.986 μs (0 allocations: 0 bytes)
4 Likes

Excellent! This is the kind of response I was hoping for, thanks a lot. I’m going to benchmark this function against C++, so this improvement helps a lot.

1 Like

Further improvements with LoopVectorization:

using StructArrays, LoopVectorization

depth4d_SA = StructArray((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

function f2_SA(depth2d, depth4d)
    ii, jj = axes(depth2d)
    @avx for i in ii, j in jj
        d = min(rangeinf, depth2d[i,j])
        depth4ds.x[i,j] = d*(1.0*i - cx)/fl
        depth4ds.y[i,j] = d*(1.0*j - cy)/fl
        depth4ds.z[i,j] = d
    end
end
julia> @btime f1s($depth2d, $depth4d) # 378 μs on mbauman's machine
  657.600 μs (0 allocations: 0 bytes)

julia> @btime f2_SA($depth2d, $depth4d_SA)
  213.100 μs (0 allocations: 0 bytes)
2 Likes

That’s interesting, but I don’t think I can use it. The consumer of depth4d requires:

eltype(depth4d) == Point3f

I made an improvement by precalculating the coefficients. Here is the version with repeatable input and more realistic rangeinf value for benchmarking:

using BenchmarkTools, Random

struct Point3f
  x::Float32
  y::Float32
  z::Float32
end

const rangeinf = 0.95f0
const fl = 554.3f0
const cx = 320.5f0
const cy = 240.5f0
const width  = 640
const height = 480

depth1d = rand(MersenneTwister(12345), Float32, width*height)
depth2d = reshape(depth1d, width, height)
depth4d = similar(depth2d, Point3f)

function f7(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds for j in 1:height
    @inbounds for i in 1:width
      d = min(rangeinf, depth2d[i, j])
      depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
    end
  end
end
julia> @btime f7($depth2d, $depth4d)
  153.568 μs (2 allocations: 4.61 KiB)

I was curious how the timings will look like across different CPUs and Julia versions. Here are three functions benchmarked (the same globals as above):

function f7(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds for j in 1:height
    @inbounds for i in 1:width
      d = min(rangeinf, depth2d[i, j])
      depth4d[i, j] = Point3f(d*dx[i], d*dy[j], d)
    end
  end
end

function f8(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  @inbounds @simd for i in CartesianIndices(depth2d)
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
  end
end

function f9(depth2d, depth4d)
  dx = [(i-cx)/fl for i in 1:width]
  dy = [(j-cy)/fl for j in 1:height]

  ii, jj = axes(depth2d)
  @avx for i in ii, j in jj
    d = min(rangeinf, depth2d[i])
    depth4d[i] = Point3f(d*dx[i[1]], d*dy[i[2]], d)
  end
end

and the results:

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4690S CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

1.4.2 f7:  250.133 μs (2 allocations: 4.61 KiB)
1.4.2 f8:  250.449 μs (2 allocations: 4.61 KiB)
1.4.2 f9:  269.958 μs (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  251.361 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  252.037 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  269.340 μs (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  250.898 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  251.979 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  269.641 μs (2 allocations: 4.61 KiB)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

1.4.2 f7:  150.748 μs (2 allocations: 4.61 KiB)
1.4.2 f8:  151.543 μs (2 allocations: 4.61 KiB)
1.4.2 f9:  203.714 μs (2 allocations: 4.61 KiB)

1.5.0-rc1.0 f7:  150.754 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f8:  150.387 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f9:  203.499 μs (2 allocations: 4.61 KiB)

1.6.0-DEV.305 f7:  150.611 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f8:  150.571 μs (2 allocations: 4.61 KiB)
1.6.0-DEV.305 f9:  203.472 μs (2 allocations: 4.61 KiB)

In conclusion, no performance difference between Julia v1.4.2, v1.5.0-rc1 and v1.6.0-DEV.305; significant performance jump on f7 and f8 from Haswell to Skylake CPU.

You can specify eltype of a StructArray:

julia> depth4ds = StructArray{Point3f}((x = similar(depth2d), y = similar(depth2d), z = similar(depth2d)))

julia> eltype(depth4ds)
Point3f

With @avx, preallocating coefficients actually seems to slow things down a bit:

function f10(depth2d, depth4ds)
    dx = [(i-cx)/fl for i in 1:width]
    dy = [(j-cy)/fl for j in 1:height]

    @avx for j in 1:height, i in 1:width
        d = min(rangeinf, depth2d[i, j])
        depth4ds.x[i, j] = d*dx[i]
        depth4ds.y[i, j] = d*dy[j]
        depth4ds.z[i, j] = d
    end
end

function f11(depth2d, depth4ds)
    @avx for j in 1:height, i in 1:width
        d = min(rangeinf, depth2d[i, j])
        depth4ds.x[i, j] = d*(1.0*i - cx)/fl
        depth4ds.y[i, j] = d*(1.0*j - cy)/fl
        depth4ds.z[i, j] = d
    end
end
julia> @btime f10($depth2d, $depth4ds)
  206.199 μs (2 allocations: 4.61 KiB)

julia> @btime f11($depth2d, $depth4ds)
  200.400 μs (0 allocations: 0 bytes)

On such an inexpensive calculation, performance is going to be limited mostly by memory bandwidth/allocations, and instantiating those coefficient arrays causes allocation. How does f11 perform on your Skylake machine?

1 Like

I’m getting:

1.5.0-rc1.0 f10:  128.923 μs (2 allocations: 4.61 KiB)
1.5.0-rc1.0 f11:  108.357 μs (0 allocations: 0 bytes)

on Skylake machine. Much faster!

Incidentaly, the new functions crash Julia VS Code extension REPL:

julia> Internal Error: ERROR: MethodError: no method matching size(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64})
The applicable method may be too new: running in world age 27143, while current world is 27185.
Closest candidates are:
  size(::StructArray) at /home/paul/.julia/packages/StructArrays/OtfvU/src/structarray.jl:130 (method too new to be called from this world context.)
  size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:38
  size(::BitArray{1}) at bitarray.jl:99
  ...
Stacktrace:
 [1] treerender(::StructArray{Point3f,2,NamedTuple{(:x, :y, :z),Tuple{Array{Float32,2},Array{Float32,2},Array{Float32,2}}},Int64}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:125
 [2] treerender(::VSCodeServer.SubTree) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/trees.jl:54
 [3] repl_getvariables_request(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::Nothing) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/request_handlers.jl:88
 [4] dispatch_msg(::VSCodeServer.JSONRPC.JSONRPCEndpoint, ::VSCodeServer.JSONRPC.MsgDispatcher, ::Dict{String,Any}) at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/JSONRPC/src/typed.jl:63
 [5] macro expansion at /home/paul/.vscode/extensions/julialang.language-julia-0.16.8/scripts/packages/VSCodeServer/src/VSCodeServer.jl:83 [inlined]
 [6] (::VSCodeServer.var"#53#55"{Bool,String})() at ./task.jl:358

They work fine in terminal. I’m going to file an issue.

Sounds like this.