Strange performance of literal array constructor

I was surprised by this performance measurement (v.1.11.1):

julia> foo() = Int8[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 1 method)

julia> bar() = Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar (generic function with 1 method)

julia> @btime foo()
  203.167 ns (3 allocations: 176 bytes)
3×3 Matrix{Int8}:
 1  2  3
 4  5  6
 7  8  9

julia> @btime bar()
  68.238 ns (4 allocations: 240 bytes)
3×3 Matrix{Int8}:
 1  2  3
 4  5  6
 7  8  9

bar creates a Matrix{Int64} which is later converted to Matrix{Int8}, while foo directly constructs the Matrix{Int8}. So why is the performance opposite to what i expect?

It scales badly too. For 9x9 matrices the difference is almost a factor of 10.

For Vector I see the expected behaviour:

julia> foo() = Int8[1, 2, 3, 4, 5, 6, 7, 8, 9]
foo (generic function with 1 method)

julia> bar() = Vector{Int8}([1, 2, 3, 4, 5, 6, 7, 8, 9])
bar (generic function with 1 method)

julia> @btime foo();
  20.020 ns (2 allocations: 80 bytes)

julia> @btime bar();
  44.646 ns (3 allocations: 176 bytes)

This is strange indeed. Maybe this is due some inlining failure and used to be the way you’d expect? If you add @inline foo speeds up rough x10 for me, while bar still gains ~20%.

Without inlining:

julia> foo() = Int8[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 1 method)
julia> bar() = Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar (generic function with 1 method)

julia> @btime foo();
  191.750 ns (3 allocations: 176 bytes)
julia> @btime bar();
  52.826 ns (4 allocations: 240 bytes)

With @inline:

julia> foo_inline() = @inline Int8[1 2 3; 4 5 6; 7 8 9]
foo_inline (generic function with 1 method)
julia> bar_inline() = @inline Matrix{Int8}([1 2 3; 4 5 6; 7 8 9])
bar_inline (generic function with 1 method)

julia> @btime foo_inline();
  23.982 ns (2 allocations: 96 bytes)
julia> @btime bar_inline();
  41.018 ns (4 allocations: 240 bytes)

These numbers (with inlining) are comparable (but a bit slower) to the variant with Vector:

julia> foo_vec() = Int8[1, 2, 3, 4, 5, 6, 7, 8, 9]
foo_vec (generic function with 1 method)
julia> bar_vec() = Vector{Int8}([1, 2, 3, 4, 5, 6, 7, 8, 9])
bar_vec (generic function with 1 method)

julia> @btime foo_vec();
  15.676 ns (2 allocations: 80 bytes)
julia> @btime bar_vec();
  35.027 ns (3 allocations: 176 bytes)

EDIT:
Had quick check across some versions: Numbers above are 1.11.1. On Julia 1.10.5, it was the same. However on the old 1.6 where everything is quite a bit slower, foo is indeed a bit faster than bar:

# Julia 1.6.7
julia> @btime foo()
  144.099 ns (2 allocations: 176 bytes)

julia> @btime bar()
  167.344 ns (3 allocations: 336 bytes)

this is the opposite right? foo is faster here, and if you look at @code_typed it’s not very surprising

Back to the original case, this is what’s happening:

julia> @code_typed foo()
CodeInfo(
1 ─ %1 = invoke Base.typed_hvcat(Main.Int8::Type{Int8}, (3, 3, 3)::Tuple{Int64, Int64, Int64}, 1::Int64, 2::Vararg{Int64}, 3, 4, 5, 6, 7, 8, 9)::Matrix{Int8}
└──      return %1
) => Matrix{Int8}

julia> @code_typed bar()
CodeInfo(
1 ─ %1 = invoke Base.hvcat((3, 3, 3)::Tuple{Int64, Int64, Int64}, 1::Int64, 2::Vararg{Int64}, 3, 4, 5, 6, 7, 8, 9)::Matrix{Int64}
│   %2 = invoke Matrix{Int8}(%1::Matrix{Int64})::Matrix{Int8}
└──      return %2
) => Matrix{Int8}

so it comes down to this:

julia> @b hvcat((3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9)
35.928 ns (2 allocs: 144 bytes)

julia> @b Matrix{Int8}(hvcat((3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9))
78.856 ns (4 allocs: 240 bytes)

julia> @b Base.typed_hvcat(Int8, (3, 3, 3), 1, 2, 3, 4, 5, 6, 7, 8, 9)
240.451 ns (3 allocs: 176 bytes)

probably there should be an issue regarding why typed_hcat is slower than hcat then convert.

I meant the numbers of the inlined variants are similar to the Vector-based variants. But my statement is certainly ambiguous - will edit for clarity.

Could be that it is anticipated that typed_hcat is inlined. This is a somewhat brittle assumption so perhaps it just broke due to some (unrelated) work on the compiler? As you can see from the timings, when it is inlined then it faster as is should be.

So, this is perhaps more of a minimal example then:

julia> foo() = Int[1 2 3; 4 5 6; 7 8 9]
foo (generic function with 3 methods)

julia> bar() = [1 2 3; 4 5 6; 7 8 9]
bar (generic function with 1 method)

julia> @btime foo();
  209.524 ns (3 allocations: 224 bytes)

julia> @btime bar();
  36.052 ns (2 allocations: 144 bytes)

I can open an issue.

1 Like