String pool / indirect array analogue to make a struct isbits?

I have a program which ingests JSON describe a list of objects. For the sake of example, let’s say each object is a Cartesian point with a name:

struct NamedPoint
    x::Float64
    y::Float64
    name::String
end

The actual use case is more complicated and not relevant here; the relevant feature is the pattern of bits fields like Float64 + a String. In practice, for data with millions of input points, there are only ~10 distinct names. So it seems a bit of a waste to have every NamedPoint contain a String - and thus not be isbits. Instead, each NamedPoint should be able to just contain a small integer indicating which name it has.

This seems intuitively similar to the idea behind PooledArrays and IndirectArrays. But I don’t want an indirect or pooled array of just names; the code uses (and is way more readable with) NamedPoint objects.

My thought now is to just collect the unique names seen when parsing a JSON and build a dict UInt8 => name at parse time. Then each NamedPoint can have a UInt8 instead of a String name, and will be isbits. Does a package implementing this better than “roll your own” already exist? Is there a different recommended solution? I don’t love this mapping dictionary approach, since the mapping has to be passed around to helper functions and a NamedPoint is no longer meaningful without the associated name mapping dictionary.

I welcome any advice on how to maximize performance here while keeping the code well readable and maintainable. Thanks!

Do you know the set of possible names beforehand? If so, you could use an Enum, since they’re isbits. Under the hood, the @enum macro actually defines a primitive type.

If you don’t know the set of possible names beforehand, perhaps you could repurpose the Base.@enum code so that it can accept a vector of strings, so you could do something like the following:

names = ["A", "B"]
@myenum names
1 Like

Good question; I should have put this in the top post:

  • Each input JSON to be processed has ~10-20 names
  • The set of possible names is not known in advance, i.e. any JSON could contain a never-before-seen name.
  • However, the set of possible names is fairly small, maybe ~100 elements.

I’m a bit confused about the @myenum definition for unknown names. That’d dynamically define a new type each time a new JSON is processed, right? (I should clarify that for this use case, a long-running (hopefully) Julia server accepts many JSON requests.)

Ah, ok. I was thinking about processing just one big JSON. Yes, that would create a new type each time you call @myenum. Although I suppose I got the syntax above wrong. It would also need to include a name for the type:

@myenum typename names

Interestingly, it turns out that when you define an Enum, it adds a namemap method inside the Base.Enums module:

julia> @enum Fruit apple orange kiwi

julia> methods(Base.Enums.namemap)
# 31 methods for generic function "namemap":
[1] namemap(::Type{Base.MPFR.MPFRRoundingMode}) in Base.MPFR at Enums.jl:189
[2] namemap(::Type{LibGit2.Consts.OBJECT}) in LibGit2.Consts at Enums.jl:189
# Entries 3 - 30 omitted for brevity.
[31] namemap(::Type{Fruit}) in Main at Enums.jl:189

julia> Base.Enums.namemap(Fruit)
Dict{Int32,Symbol} with 3 entries:
  0 => :apple
  2 => :kiwi
  1 => :orange

So there’s still a dictionary involved, it’s just hidden away in a module.

Upon further reflection, my @enum suggestion probably wouldn’t be very useful anyways, since @enum has to be used at top-level:

julia> function bar()
           @enum Fruit apple orange kiwi
       end
ERROR: syntax: "toplevel" expression not at top level

And the same would probably be true for any macro that is creating new types.

Perhaps this is exactly what you don’t want, but here’s a mapping approach anyway. While it’s true that a NamedPoint name isn’t meaningful without the mapping, overwriting getproperty makes this distinction completely invisible I would say. The only thing I would be careful of is if saving these to a file, to make sure any code uses getproperty rather than explicitly getfield.

const NAME_TO_INT = Dict{String, Int}()
const INT_TO_NAME = Vector{String}()

# wherever you read from the json:
x, y, s = #one line of the json
if !haskey(NAME_TO_INT, s)
    NAME_TO_INT[s] = length(NAME_TO_INT)+1
    push!(INT_TO_NAME, s)
end

p = NamedPoint(x, y, NAME_TO_INT[s])    

# to map back, either define an interface
name(p::NamedPoint) =  INT_TO_NAME[p.s] 

# or just overwrite getproperty
Base.getproperty(p::NamedPoint, s::Symbol) = s === :name ? INT_TO_NAME[p.s] : getfield(p, s)