1d arrays vs 1-column matrices

Does this mean that storing vectors as x(n,) rather than x(n,1) has performance benefits?

I’ll try again: Is there any advantage in having 1-D arrays as x(n,) over x(n,1)? Matlab programmers use x(n,1) by instinct. Are these (we) people paying a performance penalty for that?

(I think you’re confusing people by using Fortran notation here.)

The memory storage format is exactly the same for a Vector (1d Array) or a Matrix (2d Array) with one column, and in both cases Julia allows you to do 1d “linear” indexing x[i] with no performance penalty. In fact, vectors as well as 1-column matrices in Julia can also be indexed as x[i,1], though this could be very slightly slower because of the extra index computation (which may get optimized away by the compiler in the Vector case, however). But there is a useful semantic distinction between 1d arrays and matrices that is erased in Matlab where nearly everything (including scalars!) is a matrix.

(Julia only allows you to resize a 1d Array, but it is possible to implement resizable multidimensional arrays; see ElasticArrays.jl.)

6 Likes

Yes, except when they use 1xn by instinct. The inconsistent use of 1xn and nx1 ‘vectors’ in Matlab is a major source of confusion and frustration (and bugs), which you avoid completely by consistently using a proper vector type. I highly recommend it, independently of any performance considerations.

3 Likes

Is there any advantage in having 1-D arrays as x(n,) over x(n,1)?

For both Julia and Matlab, there shouldn’t be any performance penalty for a 1-D array, as everything is contiguous in memory. So even in Matlab, it doesn’t really matter if Nx1 vs. 1xN. (Note also that in Matlab, 1:N defaults to a 1xN.)

I suspect you are getting at multidimensional arrays, where this matters more. Memory access is slow, and CPU caches usually fetch contiguous memory. You gain more cache benefit by accessing contiguous memory, which for both Matlab and Julia would be column order. So an NxM array might be quicker to access as columns than rows, and you might design the array or algorithm accordingly.

Note that this depends on what you’re storing. Julia can have Array{Any,1} with elements of varying type and size. Speed of memory access can be trickier, e.g. when a single element exceeds cache size. But if we’re talking about N-dimensional arrays of uniform, “typical” (e.g., Float64s and other isbits) elements, it is helpful to anticipate column ordering.

Nah. I was asking something far more mundane. I’m a pretty decent matlab programmer and it never occurred to me that (n,) was any different from (n,1) in Julia. This thread, as the Zen people say, gave rise to doubt. So I asked, got a pretty clear no, and have walked away happy.

One thing I liked about the older versions of Matlab was that the only data type was the double precision complex array. This is Fortran-like brutality, which some of us appreciate.

May all your constants be Hollerith,

— Tim

2 Likes

But they are different.

2 Likes

The compiler has more information about a vector than a matrix: it knows the second dimension is 1. This can yield better performance in cases such as:

julia> x = rand(128); X = reshape(x, (length(x),1));

julia> typeof.((x, X))
(Array{Float64,1}, Array{Float64,2})

julia> A = rand(32,128); B = similar(A);

julia> @benchmark @. $B = $A + $x'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     773.423 ns (0.00% GC)
  median time:      784.433 ns (0.00% GC)
  mean time:        792.276 ns (0.00% GC)
  maximum time:     1.226 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     104

julia> @benchmark @. $B = $A + $X'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     805.562 ns (0.00% GC)
  median time:      816.124 ns (0.00% GC)
  mean time:        821.296 ns (0.00% GC)
  maximum time:     1.146 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     89

LoopVectorization is more extreme:

julia> using LoopVectorization

julia> @benchmark @avx @. $B = $A + $x'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     592.760 ns (0.00% GC)
  median time:      594.268 ns (0.00% GC)
  mean time:        597.365 ns (0.00% GC)
  maximum time:     800.821 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     179

julia> @benchmark @avx @. $B = $A + $X'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.022 μs (0.00% GC)
  median time:      1.027 μs (0.00% GC)
  mean time:        1.028 μs (0.00% GC)
  maximum time:     3.066 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

LLVM will often do “multiversioning”, which means creating extra loops for special cases that can be optimized more, such as the matrix being Nx1. When LLVM does create code optimized for that special case, the performance difference might not be so large.
LoopVectorization doesn’t, hence we get the larger difference there.

But you have to pick and choose the cases that perform well.
I think LLVM will be faster than LoopVectorization for this next example if you don’t have AVX512 (I’m working on something that should let me address this in the next few weeks!), but if you do:

julia> C = rand(128,32);

julia> @benchmark @. $B = $A + $C'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.534 μs (0.00% GC)
  median time:      2.543 μs (0.00% GC)
  mean time:        2.546 μs (0.00% GC)
  maximum time:     3.700 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark @avx @. $B = $A + $C'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.299 μs (0.00% GC)
  median time:      1.308 μs (0.00% GC)
  mean time:        1.309 μs (0.00% GC)
  maximum time:     3.969 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

My point being that there are a lot of trade offs in the kinds of cases you optimize for.
Giving the compiler more information, e.g. by using a Vector instead of a matrix where one axis equals 1, will help it.

9 Likes