String pool / indirect array analogue to make a struct isbits?

evanfields · July 22, 2020, 4:31pm

I have a program which ingests JSON describe a list of objects. For the sake of example, let’s say each object is a Cartesian point with a name:

struct NamedPoint
    x::Float64
    y::Float64
    name::String
end

The actual use case is more complicated and not relevant here; the relevant feature is the pattern of bits fields like Float64 + a String. In practice, for data with millions of input points, there are only ~10 distinct names. So it seems a bit of a waste to have every NamedPoint contain a String - and thus not be isbits. Instead, each NamedPoint should be able to just contain a small integer indicating which name it has.

This seems intuitively similar to the idea behind PooledArrays and IndirectArrays. But I don’t want an indirect or pooled array of just names; the code uses (and is way more readable with) NamedPoint objects.

My thought now is to just collect the unique names seen when parsing a JSON and build a dict UInt8 => name at parse time. Then each NamedPoint can have a UInt8 instead of a String name, and will be isbits. Does a package implementing this better than “roll your own” already exist? Is there a different recommended solution? I don’t love this mapping dictionary approach, since the mapping has to be passed around to helper functions and a NamedPoint is no longer meaningful without the associated name mapping dictionary.

I welcome any advice on how to maximize performance here while keeping the code well readable and maintainable. Thanks!

CameronBieganek · July 22, 2020, 5:06pm

Do you know the set of possible names beforehand? If so, you could use an Enum, since they’re isbits. Under the hood, the @enum macro actually defines a primitive type.

If you don’t know the set of possible names beforehand, perhaps you could repurpose the Base.@enum code so that it can accept a vector of strings, so you could do something like the following:

names = ["A", "B"]
@myenum names

evanfields · July 22, 2020, 5:12pm

Good question; I should have put this in the top post:

Each input JSON to be processed has ~10-20 names
The set of possible names is not known in advance, i.e. any JSON could contain a never-before-seen name.
However, the set of possible names is fairly small, maybe ~100 elements.

I’m a bit confused about the @myenum definition for unknown names. That’d dynamically define a new type each time a new JSON is processed, right? (I should clarify that for this use case, a long-running (hopefully) Julia server accepts many JSON requests.)

CameronBieganek · July 22, 2020, 5:27pm

Ah, ok. I was thinking about processing just one big JSON. Yes, that would create a new type each time you call @myenum. Although I suppose I got the syntax above wrong. It would also need to include a name for the type:

@myenum typename names

Interestingly, it turns out that when you define an Enum, it adds a namemap method inside the Base.Enums module:

julia> @enum Fruit apple orange kiwi

julia> methods(Base.Enums.namemap)
# 31 methods for generic function "namemap":
[1] namemap(::Type{Base.MPFR.MPFRRoundingMode}) in Base.MPFR at Enums.jl:189
[2] namemap(::Type{LibGit2.Consts.OBJECT}) in LibGit2.Consts at Enums.jl:189
# Entries 3 - 30 omitted for brevity.
[31] namemap(::Type{Fruit}) in Main at Enums.jl:189

julia> Base.Enums.namemap(Fruit)
Dict{Int32,Symbol} with 3 entries:
  0 => :apple
  2 => :kiwi
  1 => :orange

So there’s still a dictionary involved, it’s just hidden away in a module.

CameronBieganek · July 22, 2020, 6:30pm

Upon further reflection, my @enum suggestion probably wouldn’t be very useful anyways, since @enum has to be used at top-level:

julia> function bar()
           @enum Fruit apple orange kiwi
       end
ERROR: syntax: "toplevel" expression not at top level

And the same would probably be true for any macro that is creating new types.

tomerarnon · July 23, 2020, 10:09pm

Perhaps this is exactly what you don’t want, but here’s a mapping approach anyway. While it’s true that a NamedPoint name isn’t meaningful without the mapping, overwriting getproperty makes this distinction completely invisible I would say. The only thing I would be careful of is if saving these to a file, to make sure any code uses getproperty rather than explicitly getfield.

const NAME_TO_INT = Dict{String, Int}()
const INT_TO_NAME = Vector{String}()

# wherever you read from the json:
x, y, s = #one line of the json
if !haskey(NAME_TO_INT, s)
    NAME_TO_INT[s] = length(NAME_TO_INT)+1
    push!(INT_TO_NAME, s)
end

p = NamedPoint(x, y, NAME_TO_INT[s])    

# to map back, either define an interface
name(p::NamedPoint) =  INT_TO_NAME[p.s] 

# or just overwrite getproperty
Base.getproperty(p::NamedPoint, s::Symbol) = s === :name ? INT_TO_NAME[p.s] : getfield(p, s)

Topic		Replies	Views
Generating @enum from string array New to Julia enum	2	1100	April 6, 2019
Isbits object lifetimes on 0.7 General Usage	4	854	December 4, 2017
Arrays of NamedTuples Performance question	2	61	October 4, 2024
[ANN] Quicktype -- generate JSON3 structs from json samples (WIP) Package Announcements announcement , json , generator	0	735	January 6, 2021
[ANN] ObjectPools.jl Package Announcements	0	326	May 8, 2023

String pool / indirect array analogue to make a struct isbits?

Related topics