How to define an efficient `bswap_int` for user defined primitive types?

I am trying to define my own primitive type of bit-length 8*n for some n, e.g. below I have n=3.
The functions lshr_int and shl_int works directly on the newly-defined type but not bswap_int. I tried to check its definition using @which bswap_int(UInt(888)) and I see that it’s an intrinsic function so I can’t look at its implementation for UInt and try to adapt it.

I can build my own bswap_int using lshr_int and shl_int and | but is there a more efficient way?

primitive type  UInt24 <: Unsigned 24 end
x = unsafe_load(Ptr{UInt24}(pointer("abc")))

# bitshifts work fine
Base.lshr_int(x, 8)
Base.shl_int(x, 8)

# this will crash 
Base.bswap_int(x)

The background is that I am trying to build a more efficient string radixsort so being able to load the underlying bits of various length efficiently is key.

The fastest way to do it is using shifts:
((x & 0xff) << 16) | (x & 0xff00) | ((x & 0xff0000) >> 16)

1 Like

I get the following error:

julia> x & 0xff
ERROR: no promotion exists for UInt24 and UInt8
Stacktrace:

Actually if I define a type whose length is a multiple of word size then bitswap_int works fine. E.g.

primitive type Bits192 192 end
tmp = "abcdefedghiklmnopqrstuvwz"
a = unsafe_load(Ptr{Bits192}(pointer(tmp)))
Base.bswap_int(a)

I’m kind of unclear on what the benefit of a 24-bit integer type is. It doesn’t save space in registers and doesn’t save storage on disk unless you sacrifice decent alignment altogether which doesn’t seem worth it.

1 Like

it was just an example. i was trying to make a type that can load 3 bytes
at once from a string using unsafe_load. Also SAS has a 24bit numeric type
of for reading sas data this might become helpful.

the other examples i tried were 192 bits type.

It’s certainly doable but it would probably be easier to load bytes individually and put them into a standard integer type like you would in C.

A number of people have brought up their use cases for 24-bit numbers (such as representing RGB colors)
Using 33% more space (in memory or on disk) can end up affecting performance more than alignment issues (which generally aren’t even issues on most processors these days).

Better to actually get real evidence rather than stating opinions without the data to back them up.

1 Like

If you can guarantee that reading one byte past the end is safe (which isn’t that hard to do, by simply allocating a buffer one byte larger than needed), then you can simply do an unaligned 32-bit read and a mask faster than doing loading 3 bytes individually.
Even when you can’t, I’ve seen that LLVM optimizes a 24-bit load or store into two operations, a 16-bit one and an 8-bit one.