Coimbination of categories / nominal features encoding

There is CategoricalArray that supports only mutually exclusive categories (you can have only one caterory from a dictionary).

But what if I need to store a combinations of several binary features, say:

  • “Visible”: true, false,
  • “Clicked”: true, false,
  • “Moved”: true, false,

Probably the most compact way (for a total number of features less than 64) would be to encode them into bit positions of UInt64 mask.

Then, each binary feature associates with its bit position, say:

  • “Visible”: 0b00001
  • “Moved”: 0b00010
  • “Clicked”: 0b00100

And the check is simple - bitwise & with feature mask:

  • check if x is “Visible”: x & 0b00001 > 0
  • check if x is “Moved”: x & 0b00010 > 0
  • check if x is “Clicked”: x & 0b00100 > 0

Further, what if I have not only binary features, but features with three and more values?
Then each feature itself is a mutually exclusive category with its own dictionary:

  • “Direction”: “top”, “down”, “left”, “right”

To encode them into bit positions of UInt64 mask along with three previous binary features, we need to provide it with its own bit range, non-overlapping with other features (external mask):

  • “Direction”: 0b01111000

And with value masks inside that external mask for each feature value:

  • “top”: 0b00001000
  • “down”: 0b00010000
  • “left”: 0b00100000
  • “right”: 0b01000000

To check for that feature values, we should first apply external mask, then compare with value mask:

  • check if x has feature “Direction” set to “top”: x & 0b01111000 == 0b00001000
  • check if x has feature “Direction” set to “down”: x & 0b01111000 == 0b00010000
    etc.

Looking back to binary features, they just have external mask equal to value mask.

That way, we can store a combination of features with two or more values inside one 64-bit number.

The question is: how should I organize data structures around this kind of categorical encoding, to be able to check and set individual feature values?

There is an inherent trade-off between data size and access speed (similar to BitVector vs Vector{Bool}).

If you want to use bitflags, and need generic code for this, you could experiment with a type like

struct FlagContainer{T}
    flags::T
end

where T is a Tuple of keys, eg (:top, :down, :left, :right), and clever code uses this to figure out the bit masks, possibly using Base.pure (see examples for the NamedTuple code in Base).

This should be a fun package to write.

2 Likes

This would be a nice package to have around. I’d love to see how others implement this too.

Did you mean @pure macro? Any references to that examples?

Actually I thougth of using named tuples with names for categories and values for masks.
It is unclear for now, how would I encode toogether:
a) binary features, that only have their own name and a mask, and
b) multi-valued features that have their own name, external mask, and also a set of value names and sub-masks.

Yes, that’s a typo, I mean Base.@pure.

I agree that implementing this well is not trivial, I would recommend that you experiment with something simple (single feature) and build up from there.

If you take the time to comb through discourse you’ll get an idea of how to appropriately use pure

Wrote a first version of bit categories:
https://github.com/sairus7/try_bitcategories.jl

There are some questions and insonsistence in:

  1. Value of two-valued BinaryFeature: boolean (present - true or absent - false) or equals to feature name and some default name when it is absent (in this case :no symbol)?
  2. Value of multi-valued ComplexFeature: name of one of it’s child BinaryFeature?
  3. Hierarchy: should ComplexFeature be a parent for several BinaryFeatures, or make it another child struct?
  4. Store external mask inside BinaryFeature to be able to apply it independently from the ComplexFeature?
  5. ComplexFeature and FeatureSet similarity - should it be the same object?
  6. Check for features: how to unify syntax for BinaryFeature, ComplexFeature and FeatureSet?
  7. Distinguish between mutual exclusive (unique) and combinative FeatureSets.
  8. Distinguish between ordinal and nominal features.
  9. Automatically building objects from a feature list without knowing their masks (this one is trivial - just check available bits).
2 Likes

@jw3126’s PhaseSpaceIO.jl has an interesting usage of Setfield.jl to get and set values packed in bits. Even if you don’t use Setfield.jl, I think it’s worth considering to support ConstructionBase.jl API so that it works with packages based on this API.

It would be nice if there were some sort of syntax for constructing these similar to @enum where you could just do:

@features Directions begin
   Top
   Bottom
   Left
   Right
end

You would probably need to add some extra syntax to this though to account for appropriate masking between features.

For enum-like interface [ANN] New package: BitFlags.jl — An enum-like type for bit flags

1 Like

I am using this approach for a while and I think it is a very good API for manipulating packed bit objects. You need to implement ConstructionBase.setproperties once an then it pays off very quickly. You don’t need to think about bitmasks ever again. (I also recommend overloading Base.show, Base.propertynames and adding convenient constructor).
See https://github.com/jw3126/PhaseSpaceIO.jl/blob/fc380dad764295f7b3995effc5e5440c315a409c/src/egs/particle.jl#L91-L115 for an example in the wild.

Made a repository in the past which might be related or of interest:

https://github.com/chakravala/PrimitiveBits.jl

Also, I use a lot of custom bit flags in DirectSum.jl and Grassmann.jl but this is specialized.

BitFlags.jl looks nice.


If you wanted to roll your own:

There is a BitSet in Base which is a subtype of AbstractSet{Int}.
Enums can be converted to Ints.
You could make a very small wrapper that does it.
Seems nicer than starting from BitVector


Don’t do it.
Base.@pure doesn’t mean what it sounds like it means.
It doesn’t mean “Thus function has no side-effects, and if it does then ignore them”,
It means “This function follows a very strict set of rules, and if you violate them or allow them to be violated, the compiler is allowed to catch fire”
And that set of rules includes: Not calling generic functions. Only built-ins and intrinsics.
The NamedTuple example is Base actually violates that, but only in very careful ways: in particular they only depend on functions that are completely defined already on all the types they allow, and which if overwridden would break the language anyway.
Which you too can take advantage of, but its really not trivial.
One needs to be very careful when writing not to allow types and methods you don’t fully control in to it.

Base.@pure is not for use under normal circumstances – its not exported for a reason.

1 Like