How to avoid compiling methods for different types when @nospecialize doesn't seem to work?

Consider the following struct:

struct MetaTuple
    data::Tuple{Vararg{Union{Tuple{}, Tuple{Int16}, Tuple{Int16, Int16}, Tuple{Int16, Int16, Int16}, Tuple{Int16, Int16, Int16, Int16}}}}
end

Its data is a tuple of tuples. It is of variable length (on the order of 1000), but the ‘inner’ tuples all have length in 0:4.

Actually, consider that I have millions of them[1], and that I would like to either 1) serialize them directly or 2) transform them into simpler structs for serialization.

I am running into the same problem in either case: Because no two of the metatuples have the same shape, every call to serialize or flatten (defined in a moment) results in compilation time greater than 99\%. It takes multiple seconds per struct to flatten them, no matter how many I have.

I know that this is what @nospecialize is for, but I haven’t been able to figure out how to make it work. I’ve tried a lot of approaches, to no avail. Here’s an example of code that I thought might work, but which doesn’t[2]:

module Scratch
using Random

struct MetaTuple
    data::Tuple{Vararg{Union{Tuple{}, Tuple{Int16}, Tuple{Int16, Int16}, Tuple{Int16, Int16, Int16}, Tuple{Int16, Int16, Int16, Int16}}}}
end

function fakeData()
    ntups = rand(500:1500)
    tlengths = rand(0:4, ntups)
    data = [tlen == 0 ? () : Int16.(Tuple(rand(1:ntups, tlen))) for tlen in tlengths]
    return MetaTuple(Tuple(data))
end

struct FlatStruct
    L::Tuple{Vararg{Int8}}
    data::Tuple{Vararg{Int16}}
end

function flatten(@nospecialize(mt))
    L = length.(mt.data)
    data = Vector{Int16}(undef, sum(L))
    n = 1
    for ix in eachindex(mt.data)
        for x in mt.data[ix]
            data[n] = x
            n += 1
        end
    end
    return FlatStruct(Tuple(L), Tuple(data))
end

mt = fakeData()
@time flatten(mt)

mt = fakeData()
@time flatten(mt)

end

Julia is still compiling from scratch the second (and every subsequent) time I call flatten.

  4.683906 seconds (8.90 M allocations: 539.773 MiB, 4.67% gc time, 99.85% compilation time)
  2.469198 seconds (6.29 M allocations: 362.959 MiB, 3.08% gc time, 99.85% compilation time)

My best guess is that it’s the value of mt.data that needs to be @nospecialized, rather than mt itself. But that approach doesn’t seem to work either:

function flatten(@nospecialize(X))
    L = length.(X)
    data = Vector{Int16}(undef, sum(L))
    n = 1
    for ix in eachindex(X)
        for x in X[ix]
            data[n] = x
            n += 1
        end
    end
    return FlatStruct(Tuple(L), Tuple(data))
end

mt = fakeData()
@time flatten(mt.data)

mt = fakeData()
@time flatten(mt.data)

\Rightarrow

  1.677197 seconds (5.15 M allocations: 290.645 MiB, 3.50% gc time, 99.84% compilation time)
  4.487556 seconds (8.76 M allocations: 529.944 MiB, 3.50% gc time, 99.61% compilation time)

So there’s something I don’t understand about @nospecialize. Do you know what I’m doing wrong?


  1. And in case it matters, they each actually have several such data fields rather than just one. This example has been simplified for “clarity”. ↩︎

  2. Replacing flatten with serialize leads to exactly the same problems. ↩︎

One suggestion I have is defining a type that holds a Tuple of length 4, and has an additional integer indicating it’s length. Then you can write manual dispatches in your code for the 0/1/2/3/4 cases.
This emulates what the compiler would ideally be doing, but being explicit gives you more control.

1 Like

This is a bit orthogonal, but why have an outer tuple with Vararg elements of that size instead of a Vector? In my experience, such large tuples are rarely worth it, especially if their usage ends up type unstable.

1 Like

You want the field to be a concrete type, perhaps you can make a parametric struct:

struct MetaTuple{N}
    t::NTuple{N, Int}
end
function MetaTuple(t...)
    N = length(t)
    N > 4 && error("Blah")
    return MetaTuple{N}(NTuple{N, Int}(t))
end
1 Like

It’s a great question, and the answer is for memory purposes. It’s all tuples because that’s the only way I can keep them all in RAM!

I’m not sure I follow. Vectors are also stored in RAM. Do you mean heap vs. stack? If so, the type instability in your struct will force the tuples to be placed on the heap as well (not that this will matter for performance - access to both is equally fast, the only difference is whether or not the compile can constant fold accesses to tuples away, which it can’t here due to the type instability).

1 Like

Thanks! Your suggestion doesn’t work as a drop-in replacement, but it does point me toward the pieces I was missing. I haven’t yet figured out parametric types, so I’ll give this a look!

It’s specifically the size of a vector vs the size of a tuple.

struct MetaTuple
    data::Tuple{Vararg{Union{Tuple{}, Tuple{Int16}, Tuple{Int16, Int16}, Tuple{Int16, Int16, Int16}, Tuple{Int16, Int16, Int16, Int16}}}}
end
struct VectorTuple
    data::Vector{Union{Tuple{}, Tuple{Int16}, Tuple{Int16, Int16}, Tuple{Int16, Int16, Int16}, Tuple{Int16, Int16, Int16, Int16}}}
end

function fakeData()
    ntups = rand(500:1500)
    tlengths = rand(0:4, ntups)
    data = [tlen == 0 ? () : Int16.(Tuple(rand(1:ntups, tlen))) for tlen in tlengths]
    return (MetaTuple(Tuple(data)), VectorTuple(data))
end

(mt, vt) = fakeData()
println("Tuple: ", Base.summarysize(mt))
println("Vector: ", Base.summarysize(vt))

results in:

Tuple: 4154
Vector: 9156

For my actual use case, this makes the difference between about 30GB of data vs about 60GB.

(this is also the reason I’m using Int16s throughout)

I think you’re seeing a lot of duplication/increase in size due to all of those tuples needing to be boxed - they can’t be stored inline in the array since they’re all different sizes. In contrast, the vararg-tuple version does store a tag with each tuple. So I think a type stable version with a uniform element type can possibly beat both in terms of size (since there’s no type tag for each element to be stored) while likely being faster performance wise (due to not having to pointer chase and all things being close together in memory).

1 Like

Thanks, this is a very helpful comment as well!

Thanks again, @Sukera and @gustaphe!

Both of your answers were very helpful for me to think through the problem. I ended up with a great solution and reduced my deserialization time to 30s (an 80-fold reduction in startup time).

It was not true that I could beat both performance and size using a vector of uniform structs, but that’s because the vast majority of my tuples are actually size one or two (not mentioned in the original question), so it’s a lot of wasted space to have room for four values in each of them. By changing to a struct that packs uniformly into a vector, and nothing else, I achieved my speed target but ended up paging RAM. Luckily, there was some other regularity in the data that I was able to exploit to get the size back down pretty close to where it started, though, and the speedup is more than worth the new complexity.

It’s worth noting that although I found a solution to my underlying problem, the question as asked remains open. Luckily I was able to squeeze some more compression out of my data, and hopefully no future person finds themselves in quite the same predicament.

Cheers!

1 Like