# Advice on over- vs. proper use of type parameters

I am writing an immutable type in Julia and have realized while working on it that I have an opportunity to go a little crazy with the type parameters. Here’s a contrived example that deals with the same problem. I’m curious about advice for when this kind of use of parametric types is worth it and when it might be a bad idea. My own intuitions on the problem, for what they are worth, are below.

This question is a good resource but only a partial answer.

Problem. I want to write an efficient immutable type `StringTrie <: AbstractSet{String}` that implements a set of strings as a trie. A simple implementation might be like this:

``````struct TrieNode
endsword::Bool
children::Tuple
end
function TrieNode(s::String)
if length(s) == 0
return TrieNode(true, ())
else
return TrieNode(false, (s[1] => TrieNode(s[2:end]),))
end
end
function _find(n::TrieNode, c::Char)
for (ii,(k,v)) in enumerate(n.children)
k == c && return ii
end
return 0
end
Base.in(s::String, n::TrieNode) = begin
length(s) == 0 && return n.endsword
(c,s) = (s[1], s[2:end])
ii = _find(n, c)
return ii == 0 ? false : in(s, n.children[ii][2])
end
push(n::TrieNode, s::String) = begin
if length(s) == 0
return n.endsword ? n : TrieNode(true, n.children)
else
(c,s) = (s[1], s[2:end])
ii = _find(n, c)
if ii == 0
return TrieNode(n.endsword, (n.children..., c => TrieNode(s)))
else
(k,v) = n.children[ii]
new_v = push(v, s)
new_v === v && return n
ch = n.children
new_ch = (ch[1:ii-1]..., k=>new_v, ch[ii+1:end]...)
return TrieNode(n.endsword, new_ch)
end
end
end

struct StringTrie <: AbstractSet{String}
n::Int
root::Union{TrieNode,Nothing}
end
StringTrie() = StringTrie(0, nothing)
StringTrie(s::Vararg{String,N}) where {N} = reduce(push, s, init=StringTrie())
Base.length(t::StringTrie) = t.n
Base.in(s::String, t::StringTrie) = in(s, t.root)
push(t::StringTrie, s::String) = begin
t.root === nothing && return StringTrie(1, TrieNode(s))
new_root = push(t.root, s)
return new_root === t.root ? t : StringTrie(t.n+1, new_root)
end
``````

A simple test confirms this mostly works:

``````strings = ["foo", "bar", "baz", "qux", "qu", "ba", "foobar", ""]
st = StringTrie(strings...)
([s in st for s in strings], [reverse(s) in st for s in strings])
#[Out] := (Bool[1, 1, 1, 1, 1, 1, 1, 1], Bool[0, 0, 0, 0, 0, 0, 0, 1])
``````

An alternative to this method might be to lean really further into the type system in order to give the compiler lots of information. For this, one could go with the following code.

``````struct TrieNode{T <: Tuple}
endsword::Bool
children::T
end
function TrieNode(endsword::Bool, children::T) where {T<:Tuple}
return TrieNode{T}(endsword, children)
end
``````

The rest of this implementation is actually almost identical to the original, but it gives the compiler all the type information it could possibly want. As a consequence, however, the `typeof(trie.root)` for even a small trie becomes something awful:

``````strings = ["foo", "bar", "baz", "qux", "qu", "ba", "foobar", ""]
st = StringTrie(strings...)
typeof(st.root)
# [Out] := TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{}}}}}}}}}}}}}}}}}},Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{}}},Pair{Char,TrieNode{Tuple{}}}}}}}}},Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{Pair{Char,TrieNode{Tuple{}}}}}}}}}}}
``````

Note, however, that the API of `StringTrie` remains unharmed by this, so the problem is only potentially a performance one.

My understanding is that this particular implementation of `TrieNode` should have a completely flat profile of `Char`s and `Bool`s in memory with none of the data hierarchy explicitly needed, which is a big plus. However, it’s unclear to me how much compile-time I’m sacrificing by going this route. A trie like this with thousands of entries seems untenable, instinctively.

I also suspect that with some clever technomancy one could come up with similar examples where the extent to which the code leans on the compiler is less than in the last case but more than in the first two. What’s the ideal to aim for here? Also, what is the best way to benchmark? (Does `@benchmark` include/ignore compile time? Can they be benchmarked separately?). If I have code that generates types everywhere, will the llvm eventually garbage collect them or will they build up indefinitely?

1 Like

Very interesting question! I have a partial answer only, but maybe helpful:

My experience with chained types shows that you can get “interactively acceptable” compile times for up to a few hundred total type parameters. E.g. This test of Plugins.jl creates a chained list of 100 types, where each has 4 parameters (one of them links to the next item), and using that it merges 100 small functions into a single one. All of this is done in ~0.67 secs on my machine. For length 1000 it needs ~18 secs.

If the compilation runs only once, as usual, then `@benchmark` will not be useful. But it seems possible to force compilation using Val types in every round, and get out usable data. For a rough decision about usability, `@time` is good enough in my opinion.

As I remember, there is no garbage collection of compiled code and types, but recently I tried to search for this info and found nothing.

1 Like

I think, you can avoid excessively long type signatures if you use linked lists or BSTs to store children.

Should be something like

``````struct List{T}
data::T
next::Union{List{T}, Nothing}
end

struct TrieNode
endsword::Bool
children::Union{List{Pair{Char, TrieNode}}, Nothing}
end
``````

The performance will not be as good as with tuples, I think, but should compile faster.

1 Like