December 18, 2019, 3:15pm
There is CategoricalArray that supports only mutually exclusive categories (you can have only one caterory from a dictionary).
But what if I need to store a combinations of several binary features, say:
“Visible”: true, false,
“Clicked”: true, false,
“Moved”: true, false,
Probably the most compact way (for a total number of features less than 64) would be to encode them into bit positions of
Then, each binary feature associates with its bit position, say:
And the check is simple - bitwise & with feature mask:
x is “Visible”:
x & 0b00001 > 0
x is “Moved”:
x & 0b00010 > 0
x is “Clicked”:
x & 0b00100 > 0
Further, what if I have not only binary features, but features with three and more values?
Then each feature itself is a mutually exclusive category with its own dictionary:
“Direction”: “top”, “down”, “left”, “right”
To encode them into bit positions of UInt64 mask along with three previous binary features, we need to provide it with its own bit range, non-overlapping with other features (
value masks inside that external mask for each feature value:
To check for that feature values, we should first apply
external mask, then compare with value mask:
x has feature “Direction” set to “top”:
x & 0b01111000 == 0b00001000
x has feature “Direction” set to “down”:
x & 0b01111000 == 0b00010000
Looking back to binary features, they just have external mask equal to value mask.
That way, we can store a combination of features with two or more values inside one 64-bit number.
The question is: how should I organize data structures around this kind of categorical encoding, to be able to check and set individual feature values?
There is an inherent trade-off between data size and access speed (similar to
If you want to use bitflags, and need generic code for this, you could experiment with a type like
T is a
Tuple of keys, eg
(:top, :down, :left, :right), and clever code uses this to figure out the bit masks, possibly using
Base.pure (see examples for the
NamedTuple code in
This should be a fun package to write.
This would be a nice package to have around. I’d love to see how others implement this too.
December 19, 2019, 3:54pm
Did you mean
@pure macro? Any references to that examples?
Actually I thougth of using named tuples with names for categories and values for masks.
It is unclear for now, how would I encode toogether:
a) binary features, that only have their own name and a mask, and
b) multi-valued features that have their own name, external mask, and also a set of value names and sub-masks.
Yes, that’s a typo, I mean
I agree that implementing this well is not trivial, I would recommend that you experiment with something simple (single feature) and build up from there.
If you take the time to comb through discourse you’ll get an idea of how to appropriately use pure
December 20, 2019, 3:43pm
Wrote a first version of bit categories:
There are some questions and insonsistence in:
Value of two-valued
BinaryFeature: boolean (present -
true or absent -
false) or equals to feature name and some default name when it is absent (in this case
Value of multi-valued
ComplexFeature: name of one of it’s child
ComplexFeature be a parent for several
BinaryFeatures, or make it another child struct?
Store external mask inside
BinaryFeature to be able to apply it independently from the
FeatureSet similarity - should it be the same object?
Check for features: how to unify syntax for
Distinguish between mutual exclusive (unique) and combinative
Distinguish between ordinal and nominal features.
Automatically building objects from a feature list without knowing their masks (this one is trivial - just check available bits).
December 22, 2019, 7:02am
@jw3126’s PhaseSpaceIO.jl has an interesting usage of Setfield.jl to get and set values packed in bits. Even if you don’t use Setfield.jl, I think it’s worth considering to support ConstructionBase.jl API so that it works with packages based on this API.
It would be nice if there were some sort of syntax for constructing these similar to
@enum where you could just do:
@features Directions begin
You would probably need to add some extra syntax to this though to account for appropriate masking between features.
December 22, 2019, 8:15am
December 22, 2019, 10:10am
I am using this approach for a while and I think it is a very good API for manipulating packed bit objects. You need to implement
ConstructionBase.setproperties once an then it pays off very quickly. You don’t need to think about bitmasks ever again. (I also recommend overloading
Base.propertynames and adding convenient constructor).
See https://github.com/jw3126/PhaseSpaceIO.jl/blob/fc380dad764295f7b3995effc5e5440c315a409c/src/egs/particle.jl#L91-L115 for an example in the wild.
Made a repository in the past which might be related or of interest:
Also, I use a lot of custom bit flags in
DirectSum.jl and Grassmann.jl but this is specialized.
BitFlags.jl looks nice.
If you wanted to roll your own:
There is a
BitSet in Base which is a subtype of
Enums can be converted to
You could make a very small wrapper that does it.
Seems nicer than starting from
Don’t do it.
Base.@pure doesn’t mean what it sounds like it means.
It doesn’t mean “Thus function has no side-effects, and if it does then ignore them”,
It means “This function follows a very strict set of rules, and if you violate them or allow them to be violated, the compiler is allowed to catch fire”
And that set of rules includes: Not calling generic functions. Only built-ins and intrinsics.
The NamedTuple example is
Base actually violates that, but only in very careful ways: in particular they only depend on functions that are completely defined already on all the types they allow, and which if overwridden would break the language anyway.
Which you too can take advantage of, but its really not trivial.
One needs to be very careful when writing not to allow types and methods you don’t fully control in to it.
Base.@pure is not for use under normal circumstances – its not exported for a reason.