Using large NTuples makes Julia hang

Julia 1.7.2 on Linux x86_64 seems to hang when instantiating structures containing large NTuple types, as the following code shows:

struct Foo
    id:: UInt64
    tag::NTuple{32, UInt8}
    start_comment::NTuple{4096, UInt8}
    end_comment::NTuple{4096, UInt8}
end

string2ntuple(s::String, len) = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : '\0', len)

println("Going to create an array of Foo…")
dataset = [
    Foo(
        0,
        string2ntuple("test_tag", 32),
        string2ntuple("start_comment", 4096),
        string2ntuple("end_comment", 4096),
    ),
]
println("…done")

This code prints Going to create an array of Foo… and then hangs; using htop, I see that Julia is allocating a awful lot of memory (several GBs), and I have to press Ctrl+C to get back to the REPL.

The reason why I am creating these large NTuple types is because I need to create HDF5 datasets based on compound types, and a few of these require fixed-length strings with the same sizes as the ones in Foo above. To do this, I need to create a compound data type that has the same memory layout as the following C++ struct:

struct Foo {
    uint64_t id;
    uint8_t tag[32];
    uint8_t start_comment[4096];
    uint8_t end_comment[4096];
}

My Julia code closely follows the approach shown in the tests included in the HDF5.jl library, where however the size of the NTuple is significantly smaller (20 elements).

So, I have two questions:

  1. Is the fact that Julia requires a lot of memory to compile that code expected, or should I file a bug?
  2. Is there some other alternative approach that I could employ to define Foo in a way that matches the C++ definition?
4 Likes

strace the Julia process here and it is making many calls to brk()
I guess that is to be expected!

Maybe you need to use primitive types? Types · The Julia Language

1 Like

@cjdoris The only way I can figure is to define Foo with thousands of UInt8 fields:

struct Foo
    id::UInt8
    field0001::UInt8
    field0002::UInt8
    field0003::UInt8
    field0004::UInt8
    ... 
    field8134::UInt8
end

But I would avoid doing that! What was your idea?

I meant

primitive type TagString 256
primitive type CommentString 32768 end

Unfortunately, yes. Large tuples (which those NTuple are) take a heavy toll on the compiler due to the aggressive constant propagation & other optimizations that are going on for them. The trouble is that that NTuple type expands to a regular Tuple with 4096 UInt8 type parameters, for which most definitely no specialized methods in the whole compiler stack exists. Compiling the required methods for that is what sucks up all that memory.

Nevertheless, I think it shouldn’t be this bad, especially since as part of C-interop this is very much supported - the docs do mention using NTuple as the equivalent for that struct after all. I’d file an issue to see if this can be improved in the general case.

Going with primitive types instead is a nice idea at first glance, but IIRC suffers from other problems at the LLVM level as well as having to define a bunch of boilerplate indexing & accessor methods to behave the same as NTuple. Not to mention that conversion to & from those types is going to be very annoying to deal with (though may be eased when defining an appropriate convert method used in the ccall).

2 Likes

If the OP is only interested in converting these to/from String it’s really very little boilerplate. I’m interested what those LLVM issues are though.

Ok, while trying to check the original example in the OP with primitive type, I noticed that your (@Maurizio_Tomasi) string2ntuple function is what’s causing half the ruckus. You’re returning a large mixed tuple instead of NTuple{N, UInt8}. This forces the compiler to insert conversions in your constructor, which probably kills performance at creation of Foo. Julia uses UTF-8, and its Char is a unicode codepoint (which may be larger than a byte), not a UInt8:

julia> s1 = string2ntuple("test_tag", 32)
(0x74, 0x65, 0x73, 0x74, 0x5f, 0x74, 0x61, 0x67, '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0')

The tuple that’s returned above is a Tuple{UInt8, UInt8, UInt8, UInt8, UInt8, UInt8, UInt8, UInt8, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char, Char}, which isn’t the same as NTuple{32, UInt8}.

Fixing that and creation of Foo works like a charm:

julia> struct Foo
           id:: UInt64
           tag::NTuple{32, UInt8}
           start_comment::NTuple{4096, UInt8}
           end_comment::NTuple{4096, UInt8}
       end

julia> string2ntuple(s::String, len) = ntuple(i -> i <= ncodeunits(s) ? UInt8(codeunit(s, i)) : 0x0, len)
string2ntuple (generic function with 1 method)

julia> @time data = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output
  0.009649 seconds (2.93 k allocations: 257.717 KiB, 96.70% compilation time)

julia> @time data = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output, second run so compilation is done
  0.000276 seconds (9 allocations: 89.016 KiB)

julia> data.tag |> collect |> String
"test_tag\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"

But creating the array is still running (which also didn’t work for your string2ntuple version):

julia> @time dataset = Foo[data]

I’ll see what it reports once it’s done, but I’d say there’s a bug here. I’m also not sure whether that conversion Char -> UInt8 should take up this much in your bad string2ntuple, there may be a better way to write that function in the first place.

3 Likes

Ok, digging even further, even creation of the array is (sometimes?!) fine (all in a fresh session):

[sukera@tempman ~]$ julia -q
julia> struct Foo
           id:: UInt64
           tag::NTuple{32, UInt8}
           start_comment::NTuple{4096, UInt8}
           end_comment::NTuple{4096, UInt8}
       end

julia> string2ntuple(s::String, len) = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, len)
string2ntuple (generic function with 1 method)

julia> @time dataset = Vector{Foo}(undef, 1);
  0.000011 seconds (1 allocation: 8.188 KiB)

julia> @time dataset[1] = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output
  0.022675 seconds (4.25 k allocations: 344.307 KiB, 98.39% compilation time)

julia> @time dataset[1] = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output
  0.000272 seconds (9 allocations: 89.016 KiB)

julia> @time data2 = Foo[];
  0.000010 seconds (1 allocation: 64 bytes)

julia> @time push!(data2, Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           )); # silence output
  0.014009 seconds (9.06 k allocations: 700.702 KiB, 97.74% compilation time)

You may notice that I’ve silenced all outputs - seems like this is what’s causing the slowdown, so there’s an issue there! Creating & typing the array inline (like dataset = Foo[data]) is also still borked.

For reference:

julia> @time show(stdout, dataset)
Foo[Foo(0x0000000000000000, [...] # I've ommited all the empty data
)] 32.848714 seconds (588.41 k allocations: 60.553 MiB, 0.04% gc time, 99.28% compilation time)

so that’s just for showing the result to the REPL.

julia> versioninfo()
Julia Version 1.9.0-DEV.171
Commit 842c27fad3 (2022-03-11 06:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
5 Likes

I’ve opened an issue to track this divergent/inconsistent behavior (and hopefully get it fixed, somehow):

https://github.com/JuliaLang/julia/issues/44834


EDIT:

Ok, I had it running in the background and it finally finished:

julia> @time data = Foo[Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           )]; # silence output
1255.934351 seconds (4.65 k allocations: 357.717 KiB, 100.00% compilation time)

Yeah, that definitely shouldn’t be the case here.

3 Likes

Yes, large tuples like this are know to cause very long compile times and should generally be avoided.

You could allocate the data as just plain bits and then make Foo just a wrapper around a pointer. Admittedly, it’s a little bit awkward to write:

julia> struct Foo
           ptr::Ptr{Cvoid}
       end

julia> function Base.getproperty(f::Foo, s::Symbol)
           ptr = getfield(f, :ptr)
           if s === :id
               unsafe_load(Ptr{UInt64}(ptr))
           elseif s === :tag
               unsafe_string(Ptr{UInt8}(ptr + sizeof(UInt64)), 32)
           elseif s === :start_comment
               unsafe_string(Ptr{UInt8}(ptr + sizeof(UInt64) + 32 * sizeof(UInt8)), 4096)
           elseif s === :end_comment
               unsafe_wrap(Ptr{UInt8}(ptr + sizeof(UInt64) + (32 + 4096) * sizeof(UInt8)), 4096)
           else
               error("type Foo has no field $s")
           end
       end

julia> function Base.setproperty!(f::Foo, s::Symbol, x)
           ptr = getfield(f, :ptr)
           if s === :id
               unsafe_store!(Ptr{UInt64}(ptr), UInt64(x))
           elseif s === :tag
               x = String(x)
               n = min(sizeof(x), 32)
               dest = Ptr{UInt8}(ptr + sizeof(UInt64))
               unsafe_copyto!(dest, pointer(x), n)
               for i in n:31
                   unsafe_store!(dest + i, 0x00)
               end
           elseif s === :start_comment
               x = String(x)
               n = min(sizeof(x), 4096)
               dest = Ptr{UInt8}(ptr + sizeof(UInt64) + 32 * sizeof(UInt8))
               unsafe_copyto!(dest, pointer(x), n)
               for i in n:4095
                   unsafe_store!(dest + i, 0x00)
               end
           elseif s === :end_comment
               x = String(x)
               n = min(sizeof(x), 4096)
               dest = Ptr{UInt8}(ptr + sizeof(UInt64) + (32 + 4096) * sizeof(UInt8))
               unsafe_copyto!(dest, pointer(x), n)
               for i in n:4095
                   unsafe_store!(dest + i, 0x00)
               end
           else
               error("type Foo has no field $s")
           end
           f
       end

julia> function Foo(p::Ptr, id, tag="", start_comment="", end_comment="")
           f = Foo(p)
           f.id = id
           f.tag = tag
           f.start_comment = start_comment
           f.end_comment = end_comment
       end
Foo

julia> _sizeof(::Type{Foo}) = sizeof(UInt64) + (32 + 4096 + 4096) * sizeof(UInt8)
_sizeof (generic function with 1 method)

julia> alloc_foos(n) = view(Vector{UInt8}(undef, n * _sizeof(Foo)), (0:n-1) .* _sizeof(Foo) .+ 1);

julia> dataset = alloc_foos(3)
3-element view(::Vector{UInt8}, 1:8232:16465) with eltype UInt8:
 0x70
 0x00
 0x00

julia> f = Foo(pointer(dataset, 2))
Foo(Ptr{Nothing} @0x000000000457da28)

julia> f.id = 2
2

julia> f.id
0x0000000000000002

Also keep in mind that you are responsible for making sure the underlying array is still rooted, so in a function you’d need to enclose any operations you do on a Foo inside GC.@preserve dataset begin ... end. I think it might make sense to have a language feature that would allow you to just insert some padding of a given size into some struct. This would probably make it much less awkward to deal with such cases.

2 Likes

I’d generally agree, but as shown above, creating & compiling the functions using that type generally seems fine, it’s printing and one version of creating an array that’s slow due to compilation. Even compiling & showing the NTuple on its own is reasonably fast:

julia> @time show(stdout, data.start_comment)
(0x73, [....]) 9.174167 seconds (118.85 k allocations: 5.076 MiB, 98.52% compilation time)

There is more going on here than just “the tuple is too large” imo.

2 Likes

How much large tuples effect compile times very much depends on the function. It’s not a constant overhead, the issue is that certain patterns in the compiler don’t scale well to large tuples. So it’s expected that certain functions are much more susceptible to this than others.

you should try this on a recent built of master. there have been a large number of PRs merged in the past 3 months that should speed this up.

Well the timings from above were from a build from early march, but since you’ve asked I built the current master.

Timing results
julia> struct Foo
           id:: UInt64
           tag::NTuple{32, UInt8}
           start_comment::NTuple{4096, UInt8}
           end_comment::NTuple{4096, UInt8}
       end

julia> string2ntuple(s::String, len) = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, len)
string2ntuple (generic function with 1 method)

julia> @time dataset = Vector{Foo}(undef, 1);
  0.000028 seconds (2 allocations: 8.250 KiB)

julia> @time dataset[1] = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output
  0.027287 seconds (3.44 k allocations: 329.291 KiB, 98.10% compilation time)

julia> @time dataset[1] = Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           ); # silence output
  0.000447 seconds (9 allocations: 89.016 KiB)

julia> @time data2 = Foo[];
  0.000013 seconds (2 allocations: 128 bytes)

julia> @time push!(data2, Foo(
               0,
               string2ntuple("test_tag", 32),
               string2ntuple("start_comment", 4096),
               string2ntuple("end_comment", 4096),
           )); # silence output
  0.013863 seconds (6.60 k allocations: 650.905 KiB, 96.77% compilation time)

julia> @time show(stdout, dataset)
Foo[Foo(0x0000000000000000, [...] # zeroed output omitted
)] 42.590245 seconds (452.47 k allocations: 57.910 MiB, 0.03% gc time, 99.29% compilation time)

The timings for everything is about the same, except there’s a 10s regression for show(stdout, ::Vector{Foo}).

This is my versioninfo():

julia> versioninfo()
Julia Version 1.9.0-DEV.306
Commit 5916faf418 (2022-04-02 12:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores

I have @time [Foo(...)]; running in the background. I’ll post the results when it’s done.

EDIT:

Results are in - almost no change.

julia> @time [data]; # data is a `Foo`
1267.163785 seconds (11.92 k allocations: 842.575 KiB, 100.00% compilation time)

I’m not as patient as you, so I reduced the size to 1536 and did my experiments. My version for NTuple

using BenchmarkTools

# string2ntuple(::Val{N}, s::String) where {N} = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, Val(N))
string2ntuple(::Val{N}, s::String) where {N} = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, N)

struct Foo{NT, NC} 
    id:: UInt64
    tag::NTuple{NT, UInt8}
    start_comment::NTuple{NC, UInt8}
    end_comment::NTuple{NC, UInt8}
    function Foo{NT, NC}(id, tag, start_comment, end_comment) where {NT, NC} 
        new{NT, NC}(
            id, 
            string2ntuple(Val(NT), tag),
            string2ntuple(Val(NC), start_comment),
            string2ntuple(Val(NC), end_comment))
    end
end

function test()
    dataset = @btime Vector{Foo{32, 1536}}(undef, 1);

    dataset[1] = @btime Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    ); 

    data = @btime Foo{32, 1536}[]

    @btime push!(data, Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    )) setup=(data = Foo{32, 1536}[]); 

    data = @btime Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    ); 

    dataset = @time Foo{32, 1536}[data]
    dataset = @btime Foo{32, 1536}[$data]
    nothing
end

yields

  77.636 ns (1 allocation: 3.19 KiB)
  69.100 μs (8 allocations: 30.77 KiB)
  24.349 ns (1 allocation: 64 bytes)
  69.900 μs (9 allocations: 55.14 KiB)
  69.100 μs (8 allocations: 30.77 KiB)
 56.395801 seconds (4.64 k allocations: 263.561 KiB, 100.00% compilation time) # critical compilation time
  317.031 ns (1 allocation: 3.19 KiB)

and for StaticVector

using StaticArrays, BenchmarkTools

string2svec(::Val{N}, s::String) where {N} = SVector{N, UInt8}(ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, N))

struct Foo{NT, NC} 
    id:: UInt64
    tag::StaticVector{NT, UInt8}
    start_comment::StaticVector{NC, UInt8}
    end_comment::StaticVector{NC, UInt8}
    function Foo{NT, NC}(id, tag, start_comment, end_comment) where {NT, NC} 
        new{NT, NC}(
            id, 
            string2svec(Val(NT), tag),
            string2svec(Val(NC), start_comment),
            string2svec(Val(NC), end_comment))
    end
end

function test()
    dataset = @btime Vector{Foo{32, 1536}}(undef, 1);

    dataset[1] = @btime Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    ); 

    data = @btime Foo{32, 1536}[]

    @btime push!(data, Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    )) setup=(data = Foo{32, 1536}[]); 

    data = @btime Foo{32, 1536}(
        0, "test_tag", "start_comment", "end_comment"
    ); 

    dataset = @time Foo{32, 1536}[data]
    dataset = @btime Foo{32, 1536}[$data]
    nothing
end

yielding

  25.201 ns (1 allocation: 96 bytes)
  69.300 μs (11 allocations: 34.00 KiB)
  24.900 ns (1 allocation: 64 bytes)
  69.300 μs (12 allocations: 34.27 KiB)
  69.200 μs (11 allocations: 34.00 KiB)
  0.007673 seconds (4.64 k allocations: 260.451 KiB, 99.27% compilation time) # critical compilation time
  28.945 ns (1 allocation: 96 bytes)

This surprises me, so I’m assuming I’m going wrong somewhere?

Edit: this is on Julia Version 1.9.0-DEV.167

Did you measure both in the same session? I think StaticArrays is just tuples underneath, so most of the compilation cost would already have been paid?

1 Like

Foo is redefined, so not the same session I believe.

Note that I’m very deliberately not using BenchmarkTools here, since I’m interested in the compile time which isn’t captured by BenchmarkTools. This is also why this is all “benchmarked” in global scope - compilation only happens once anyway.

The code that’s produced is itself very fast, it’s almost not doing anything anyway.

It’s very probable that the N of NTuple plays a crucial role here - we’re talking about tuple compiler latency after all. I suspect (from what I can tell based on the LLVM IR) this has to do with each element of the tuple being handled as an element of an LLVM struct, complicating compilation, but I’m not well versed enough with the internals of the julia compiler to say for certain. It’s a well-known problem after all, I’m just very curious why this makes a difference for show and getindex(Foo, Foo(...)) but not for other calls. This doesn’t make much sense to me, and looking at the functions involved (at least for getindex) and what the LLVM IR looks like (see here for a gist that produces the IR for you - don’t worry, generating that is surprisingly fast!) I wonder if this is, in part, a reason why large NTuple are a bottleneck in the compiler.


This is sadly as far as I can go with investigating what’s slow, as I have no idea how to profile the compiler properly, how the relevant code paths work, what sort of assumptions/invariants need to be observed by that code etc.

Yeah, I tried to capture the difference in critical compile time between NTuple and StaticVector, also trying to eliminate all possible ‘external’ influence. Interestingly both of my measurements for NTuple and StaticVector via @time show nearly identical allocations, nearly 100% compilation time but quite different wall times (and the examples ‘feel’ quite differently when executing).

Yes. Some observations:

  1. Up to N=1024 everything seems half ways reasonable
  2. As far as I have noticed in this example
string2ntuple(::Val{N}, s::String) where {N} = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, Val(N))

seems to be worse than

string2ntuple(::Val{N}, s::String) where {N} = ntuple(i -> i <= ncodeunits(s) ? codeunit(s, i) : 0x0, N)

for the other test cases.