I have a program which ingests JSON describe a list of objects. For the sake of example, let’s say each object is a Cartesian point with a name:
struct NamedPoint
x::Float64
y::Float64
name::String
end
The actual use case is more complicated and not relevant here; the relevant feature is the pattern of bits fields like Float64
+ a String
. In practice, for data with millions of input points, there are only ~10 distinct names. So it seems a bit of a waste to have every NamedPoint
contain a String
- and thus not be isbits
. Instead, each NamedPoint
should be able to just contain a small integer indicating which name it has.
This seems intuitively similar to the idea behind PooledArrays
and IndirectArrays
. But I don’t want an indirect or pooled array of just names; the code uses (and is way more readable with) NamedPoint
objects.
My thought now is to just collect the unique names seen when parsing a JSON and build a dict UInt8 => name
at parse time. Then each NamedPoint
can have a UInt8
instead of a String
name, and will be isbits
. Does a package implementing this better than “roll your own” already exist? Is there a different recommended solution? I don’t love this mapping dictionary approach, since the mapping has to be passed around to helper functions and a NamedPoint
is no longer meaningful without the associated name mapping dictionary.
I welcome any advice on how to maximize performance here while keeping the code well readable and maintainable. Thanks!
Do you know the set of possible names beforehand? If so, you could use an Enum
, since they’re isbits
. Under the hood, the @enum
macro actually defines a primitive type.
If you don’t know the set of possible names beforehand, perhaps you could repurpose the Base.@enum
code so that it can accept a vector of strings, so you could do something like the following:
names = ["A", "B"]
@myenum names
1 Like
Good question; I should have put this in the top post:
- Each input JSON to be processed has ~10-20 names
- The set of possible names is not known in advance, i.e. any JSON could contain a never-before-seen name.
- However, the set of possible names is fairly small, maybe ~100 elements.
I’m a bit confused about the @myenum
definition for unknown names. That’d dynamically define a new type each time a new JSON is processed, right? (I should clarify that for this use case, a long-running (hopefully) Julia server accepts many JSON requests.)
Ah, ok. I was thinking about processing just one big JSON. Yes, that would create a new type each time you call @myenum
. Although I suppose I got the syntax above wrong. It would also need to include a name for the type:
@myenum typename names
Interestingly, it turns out that when you define an Enum
, it adds a namemap
method inside the Base.Enums
module:
julia> @enum Fruit apple orange kiwi
julia> methods(Base.Enums.namemap)
# 31 methods for generic function "namemap":
[1] namemap(::Type{Base.MPFR.MPFRRoundingMode}) in Base.MPFR at Enums.jl:189
[2] namemap(::Type{LibGit2.Consts.OBJECT}) in LibGit2.Consts at Enums.jl:189
# Entries 3 - 30 omitted for brevity.
[31] namemap(::Type{Fruit}) in Main at Enums.jl:189
julia> Base.Enums.namemap(Fruit)
Dict{Int32,Symbol} with 3 entries:
0 => :apple
2 => :kiwi
1 => :orange
So there’s still a dictionary involved, it’s just hidden away in a module.
Upon further reflection, my @enum
suggestion probably wouldn’t be very useful anyways, since @enum
has to be used at top-level:
julia> function bar()
@enum Fruit apple orange kiwi
end
ERROR: syntax: "toplevel" expression not at top level
And the same would probably be true for any macro that is creating new types.
Perhaps this is exactly what you don’t want, but here’s a mapping approach anyway. While it’s true that a NamedPoint name isn’t meaningful without the mapping, overwriting getproperty
makes this distinction completely invisible I would say. The only thing I would be careful of is if saving these to a file, to make sure any code uses getproperty
rather than explicitly getfield
.
const NAME_TO_INT = Dict{String, Int}()
const INT_TO_NAME = Vector{String}()
# wherever you read from the json:
x, y, s = #one line of the json
if !haskey(NAME_TO_INT, s)
NAME_TO_INT[s] = length(NAME_TO_INT)+1
push!(INT_TO_NAME, s)
end
p = NamedPoint(x, y, NAME_TO_INT[s])
# to map back, either define an interface
name(p::NamedPoint) = INT_TO_NAME[p.s]
# or just overwrite getproperty
Base.getproperty(p::NamedPoint, s::Symbol) = s === :name ? INT_TO_NAME[p.s] : getfield(p, s)