Strange performance of literal array constructor

DNF · October 22, 2024, 9:19am

I was surprised by this performance measurement (v.1.11.1):

julia> foo() = Int8[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 1 method)

julia> bar() = Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar (generic function with 1 method)

julia> @btime foo()
  203.167 ns (3 allocations: 176 bytes)
3×3 Matrix{Int8}:
 1  2  3
 4  5  6
 7  8  9

julia> @btime bar()
  68.238 ns (4 allocations: 240 bytes)
3×3 Matrix{Int8}:
 1  2  3
 4  5  6
 7  8  9

bar creates a Matrix{Int64} which is later converted to Matrix{Int8}, while foo directly constructs the Matrix{Int8}. So why is the performance opposite to what i expect?

It scales badly too. For 9x9 matrices the difference is almost a factor of 10.

For Vector I see the expected behaviour:

julia> foo() = Int8[1, 2, 3, 4, 5, 6, 7, 8, 9]
foo (generic function with 1 method)

julia> bar() = Vector{Int8}([1, 2, 3, 4, 5, 6, 7, 8, 9])
bar (generic function with 1 method)

julia> @btime foo();
  20.020 ns (2 allocations: 80 bytes)

julia> @btime bar();
  44.646 ns (3 allocations: 176 bytes)

abraemer · October 22, 2024, 12:11pm

This is strange indeed. Maybe this is due some inlining failure and used to be the way you’d expect? If you add @inline foo speeds up rough x10 for me, while bar still gains ~20%.

Without inlining:

julia> foo() = Int8[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 1 method)
julia> bar() = Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar (generic function with 1 method)

julia> @btime foo();
  191.750 ns (3 allocations: 176 bytes)
julia> @btime bar();
  52.826 ns (4 allocations: 240 bytes)

With @inline:

julia> foo_inline() = @inline Int8[1 2 3; 4 5 6; 7 8 9]
foo_inline (generic function with 1 method)
julia> bar_inline() = @inline Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar_inline (generic function with 1 method)

julia> @btime foo_inline();
  23.982 ns (2 allocations: 96 bytes)
julia> @btime bar_inline();
  41.018 ns (4 allocations: 240 bytes)

These numbers (with inlining) are comparable (but a bit slower) to the variant with Vector:

julia> foo_vec() = Int8[1, 2, 3, 4, 5, 6, 7, 8, 9]
foo_vec (generic function with 1 method)
julia> bar_vec() = Vector{Int8}([1, 2, 3, 4, 5, 6, 7, 8, 9])
bar_vec (generic function with 1 method)

julia> @btime foo_vec();
  15.676 ns (2 allocations: 80 bytes)
julia> @btime bar_vec();
  35.027 ns (3 allocations: 176 bytes)

EDIT:
Had quick check across some versions: Numbers above are 1.11.1. On Julia 1.10.5, it was the same. However on the old 1.6 where everything is quite a bit slower, foo is indeed a bit faster than bar:

# Julia 1.6.7
julia> @btime foo()
  144.099 ns (2 allocations: 176 bytes)

julia> @btime bar()
  167.344 ns (3 allocations: 336 bytes)

jling · October 22, 2024, 12:22pm

this is the opposite right? foo is faster here, and if you look at @code_typed it’s not very surprising

jling · October 22, 2024, 12:25pm

Back to the original case, this is what’s happening:

julia> @code_typed foo()
CodeInfo(
1 ─ %1 = invoke Base.typed_hvcat(Main.Int8::Type{Int8}, (3, 3, 3)::Tuple{Int64, Int64, Int64}, 1::Int64, 2::Vararg{Int64}, 3, 4, 5, 6, 7, 8, 9)::Matrix{Int8}
└──      return %1
) => Matrix{Int8}

julia> @code_typed bar()
CodeInfo(
1 ─ %1 = invoke Base.hvcat((3, 3, 3)::Tuple{Int64, Int64, Int64}, 1::Int64, 2::Vararg{Int64}, 3, 4, 5, 6, 7, 8, 9)::Matrix{Int64}
│   %2 = invoke Matrix{Int8}(%1::Matrix{Int64})::Matrix{Int8}
└──      return %2
) => Matrix{Int8}

so it comes down to this:

julia> @b hvcat((3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9)
35.928 ns (2 allocs: 144 bytes)

julia> @b Matrix{Int8}(hvcat((3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9))
78.856 ns (4 allocs: 240 bytes)

julia> @b Base.typed_hvcat(Int8, (3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9)
240.451 ns (3 allocs: 176 bytes)

probably there should be an issue regarding why typed_hcat is slower than hcat then convert.

abraemer · October 22, 2024, 12:42pm

I meant the numbers of the inlined variants are similar to the Vector-based variants. But my statement is certainly ambiguous - will edit for clarity.

Could be that it is anticipated that typed_hcat is inlined. This is a somewhat brittle assumption so perhaps it just broke due to some (unrelated) work on the compiler? As you can see from the timings, when it is inlined then it faster as is should be.

DNF · October 22, 2024, 1:21pm

So, this is perhaps more of a minimal example then:

julia> foo() = Int[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 3 methods)

julia> bar() = [1 2 3; 4 5 6; 7 8 9]
bar (generic function with 1 method)

julia> @btime foo();
  209.524 ns (3 allocations: 224 bytes)

julia> @btime bar();
  36.052 ns (2 allocations: 144 bytes)

I can open an issue.

Topic		Replies	Views
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018
Julia-ism for two-dimensional map? New to Julia	22	3091	May 1, 2018
Performant-wise, what is the best way to define (many) local arrays? New to Julia performance , array	14	1080	January 27, 2022
Vector of matrices vs. multidimensional arrays Performance	13	10776	January 17, 2019
Alternative proposal to getindex pun for typed array literals Internals & Design	14	1297	December 21, 2016

Strange performance of literal array constructor

Related topics