RFC: Compressed (base-6) floating point, taking 1/8 the space (new plan 1/64 compression for neural networks)

I’ll likely be making a package later, but this may already be useful for someone. I though about posting under performance category, but so far decided against, since I’m not (yet) asking for tuning advice, but this would of course help others, with performance; memory-bandwidth.

I’ll first show the (potential) downsides. Likely fixed by B. when bug found.

A.

julia> decode_8(encode_8((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 1/3)))
(0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3333333333333333)

You do not get signed zero back (and that’s before I bring Posits into the mix, they neither support, so you may want to get used to live without them). Will they be missed?

My goal (one of) is representing 1/3 exactly in floats (posits), and that seems possible by scaling all numbers by 3 (so binary base-2 * 3 = base-6 floats).

As people may know 0.3 (like 1/3, and 0.1) isn’t exact, but/so when I compress, I get:

julia> decode_8(encode_8((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3)))
(1.7763568394002505e-15, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3000000000000001)

That’s too bad. Code so far:

julia> function encode_8(tuple; scale = 3.0)
         (a, b, c, d, e, f, g, h) = tuple .* scale; mid = (a + b + c + d + e + f + g + h) / 8
         (mid, b - mid, c - mid, d - mid, e - mid, f - mid, g - mid, h - mid)
       end

julia> decode_8((mid, b, c, d, e, f, g, h); scale = 3.0) = ((mid - b - c - d - e - f - g - h), b + mid, c + mid, d + mid, e + mid, f + mid, g + mid, h + mid) ./ scale

B.

julia> using SoftPosit
julia> function encode_8p(tuple; scale = 5.0)
         (a, b, c, d, e, f, g, h) = tuple .* scale; mid = (a + b + c + d + e + f + g + h) / 8
         Posit8.((mid, b - mid, c - mid, d - mid, e - mid, f - mid, g - mid, h - mid))
       end

julia> encode_8p((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3); scale=3.0)
(Posit8(15.0), Posit8(20.0), Posit8(-5.5), Posit8(-2.75), Posit8(20.0), Posit8(-5.5), Posit8(0.25), Posit8(-14.0))

EDIT: I figured out the bug from “below”, it’s caused by above. Even thought the numbers seem sensible, Posit8 may not have enough precision. If I take out the averaging stuff, and simply scale it should work.

Can’t see the bug:

julia> function decode_8p((mid, b, c, d, e, f, g, h); scale = 3.0)
         (mid, b, c, d, e, f, g, h) = Float64.((mid, b, c, d, e, f, g, h)); ((mid - b - c - d - e - f - g - h), b + mid, c + mid, d + mid, e + mid, f + mid, g + mid, h + mid) ./ scale
       end

julia> decode_8p(encode_8p((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3)))
(0.8333333333333334, 11.666666666666666, 3.1666666666666665, 4.083333333333333, 11.666666666666666, 3.1666666666666665, 5.083333333333333, 0.3333333333333333)

C.

I believe(ed) that Posit8 has 2 significant decimal digits, but that’s just max (or I uncovered a bug in SoftPosit.jl, possibly only conversion routine).

So even with no scaling (scale = 1.0 should disable it), I get less accurate than that for last number 0.3:

julia> encode_8p((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3); scale=1.0)
(Posit8(0.0), Posit8(12.0), Posit8(3.0), Posit8(4.0), Posit8(12.0), Posit8(3.0), Posit8(5.0), Posit8(0.3125))

julia> encode_8p((-0.0, 12.0, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3); scale=3.0)
(Posit8(0.0), Posit8(32.0), Posit8(9.0), Posit8(12.0), Posit8(32.0), Posit8(9.0), Posit8(15.0), Posit8(0.875))

The error is larger with the scaling:

julia> 0.3*3-0.875
0.02499999999999991

julia> 0.3-0.3125
-0.012500000000000011

@JeffreySarnoff, a potential solution is (your) using IntervalArithmetic.jl. That means the floats take twice the space, I was aiming for 8 Float64 stored in one 64-bit register.

Another possibility if people are bothered by precision loss is using Posit16. I think currently your package has [min, max], or does it use midpoint? In either case, does it assume the same type or both numbers. Could we use say Posit16 for the midpoint(s) and Posit8 for the the range? For 64-bit float (or Float32) down to 16+8=24 bits, if not just 2 bytes.

Do you really care about signed zero for compressed float formats?
I do not. They bring a good deal of baggage and are designed for using Complex.

This will be a storage format only as described here, but I have other ideas how to work with posits, implicitly, as Float64 until really needing to store, to make faster than Float32.

I made a new unsigned posit8, because 3 significant binary digits is awfully small, and max 4 fractional bits then at least a bit better.

This is work-in-progress, and as is below dangerous to use (with min or I think also average values stored), as I explain:

Note, unchanged Posit8 has all -16 to +16 integers exact, is are a superset of those numbers, must be where I got 2 significant digital digits from. That’s true, but not for arbitrary XX integer, e.g. 21 or 99.

With that one more bit I suppose I get 0 to 32 exact, then base-30 is more practical, since I want to have 1.0 exact… but it’s only exact up to 0.9, 1.0 and then 1 1/3 (and I thought I would get 1 2/3 exact but seemingly not) but then no more.

Then I can no longer store the average, so I store the min number (likely must store the max number too). I should be able to get some more extra bits for the remaining 6 or 5 numbers. That’s an unfinished idea.

If I store min and max, I can store then in [min, max] or [max, min] order giving me one more bit to work with and use somewhere and can calculate the average from that. I can scale the difference numbers to get some more fractional bits.

As is:

julia> UPosit8(n) = UInt8((reinterpret(UInt16, Posit16(n)) << 1) >> 8)

julia> UPosit8_to_Float64(n) = Float64(reinterpret(Posit16, UInt16(n) << 7)) # UInt8((reinterpret(UInt16, Posit16(n)) << 1) >> 8)

julia> function encode_8p(tuple; base = 30)
         (a, b, c, d, e, f, g, h) = tuple .* base//2
         m = min(a, b, c, d, e, f, g, h); # mid = (a + b + c + d + e + f + g + h) / 8
         (Posit8(m), UPosit8.((b - m, c - m, d - m, e - m, f - m, g - m, h - m)))
       end

julia> function decode_8p((min, (b, c, d, e, f, g, h)); base=30) # , T=Posit16)
         b, c, d, e, f, g, h = UPosit8_to_Float64.((b, c, d, e, f, g, h))
         min = Float64(min)
         (min, b + c + d + e + f + g + h, b + min, c + min, d + min, e + min, f + min, g + min, h + min) ./ base//2
       end

julia> function decode_8p((min, b, c, d, e, f, g, h); base=30) # , T=Posit16)
         (a, b, c, d, e, f, g, h) = Float64.((b + c + d + e + f + g + h, b + min, c + min, d + min, e + min, f + min, g + min, h + min)) ./ base//2
       end

Alternative, that should be safe:
julia> function encode_8p(tuple; base=30, T=Posit16)
         (a, b, c, d, e, f, g, h) = tuple .* base//2;  # mid = (a + b + c + d + e + f + g + h) / 8
         T.((a, b, c, d, e, f, g, h))
       end

Note, the first two values are just debug values, I’ve not yet recovered the first value, but it should be possible:

julia> decode_8p(encode_8p((-0.0, 1.3, 3.0, 4.0, 12.0, 3.0, 5.0, 0.3); base=30); base=30)
(0.0, 27.9, 1.2, 2.933333333333333, 4.0, 11.733333333333333, 2.933333333333333, 4.8, 0.3)

As you can see 0.3 exact as promised, and 1 + 1//3 (both unlike in regular floats or posits) but if I change/ lower the min value, then values change:

julia> decode_8p(encode_8p((-5.0, 1.3, 1 + 1//3, 12.0, 1 + 2//3, 0.9, 5.0, 0.3); base=30); base=30)
(-5.333333333333333, 54.4, 0.5333333333333333, 0.5333333333333333, 10.666666666666666, 1.0666666666666667, 0.5333333333333333, 4.266666666666667, -0.5333333333333333)

e.g. 0.3 changes to -0.533 which is kind of bad… :slight_smile: I think that’s because of the nature of the posit system (and low precision, and catastrophic cancellation), so working with (low-bit) posits has been illuminating, but I think I can work around this and make posits user-friendly.

No, I don’t worry about signed-zero personally, and that’s the least of my worries now. :slight_smile: I was just informing about posits in general, not just my system based on them, asking if others really want them. Then it would be a deal-breaker and I might as well abandon my investigation.

After playing with Posit8 a bit, I can really see why most people wouldn’t accept it, because of low precision, at least without (your) interval arithmetic package.

Now I’m thinking 1/16 “compression” (of Float64), going to Posit4; or even lower 3- or 2-bit:

Posit8 IS useful for neural networks, and maybe mostly that (so I’m going to concentrate only on that application for now, possibly 4x4 sub-matrices too). But it only gives at best 2x speedup over Float16, and I was thinking maybe go all in, embrace the low precision. There are already binary and ternary networks, neither mainstream yet. I think because having only {-1, 0, 1} as possible weights is rather restrictive (let alone {0, 1} for weights, and {0, 1} or {-1, 1} for inputs as in some binary networks), so likely you need a larger network to compensate.

I’m thinking if networks are sparse (as they should), then you might often only actually need to store one (or few) values, for every 16 and might get a lot better precision that way.

One way would be to store locations of non-zeros in 16 bits, leaving 48 bits, or 3 (at least) for each of the 16 posits (or 48/n bits for each non-zeros value). Then having 48-bit (posit) precision possible when there’s only one value, at least better then Float32 in that case.

Alternatively I could store the first and last index of non-zeros, for 8 (possibly even fewer) bits, leaving 7 bits per value in case of 8 non-zeros stored, 14 for 4 and 28 for 2, or for all if in a row. This might be better, in case non-zeros cluster together, if not fully I need to store zeros, as if they where non-zeros.

In this paper, we propose ternary neural networks (TNNs) in order to make deep
learning more resource-efficient. We train these TNNs using a
teacher-student approach based on a novel, layer-wise greedy
methodology. Thanks to our two-stage training procedure, the
teacher network is still able to use state-of-the-art methods such
as dropout and batch normalization to increase accuracy and
reduce training time. Using only ternary weights and activations,
the student ternary network learns to mimic the behavior of its
teacher network without using any multiplication. Unlike its {-1,1}
binary counterparts, a ternary neural network inherently prunes
the smaller weights by setting them to zero during training. This
makes them sparser and thus more energy-efficient. We design a
purpose-built hardware architecture for TNNs and implement it
on FPGA and ASIC. We evaluate TNNs on several benchmark
datasets and demonstrate up to 3.1× better energy efficiency with
respect to the state of the art while also improving accuracy.
[…]
Courbariaux et al. [15 ] propose the BinaryConnect (BC)
method for binarizing only the weights, leaving the inputs
and the activations as real-values. The same idea is also
used as TernaryConnect (TC) in [ 9] and Ternary Weight
Networks (TWN) in [8] with ternary {−1, 0, 1} weights instead
of binary {−1, 1}.

See: Table II

We propose two efficient approximations to standard convolutional
neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-
Networks, the filters are approximated with binary values resulting in 32× mem-
ory saving. In XNOR-Networks, both the filters and the input to convolutional
layers are binary. XNOR-Networks approximate convolutions using primarily bi-
nary operations. This results in 58× faster convolutional operations (in terms of
number of the high precision operations) and 32× memory savings. XNOR-Nets
offer the possibility of running state-of-the-art networks on CPUs (rather than
GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work
on challenging visual tasks.

1 Like

Since I’m going all in, I’ll go for 1/64 compression: I could do 6 bits for first non-zero, but choose to use fewer, say only 4 bits (for index÷4); and not store last index, rather the count of non-zeros encoded in 2 bits as, 00 => 64 “stored” (see below), 01 => 32 stored, 10 => 16, 11 => 8 4 [EDIT: 4 dropped from proposal, but of course, allows for fewer than 8 values too].

So when I have only 4 values, I get (64-4-2)/4 = 14.5 bits per non-zero value. I guess I could give every other 15 and the other ones 14 bits, or to start with just all 14 bits. [EDIT 14 bits is already overkill, and since float8 is enough, hopefully float7 too.]

Then 7.25 bits for 8 values, 3.625 for 16, and 1.8125 bits for 32. For a ternary network, that’s enough, I need log2(3) = 1.59 bits for a ternary network, or as a start store as a binary network as below (or every other binary, rest ternary, interleaving as mentioned above):

0.90625 bits for 64 values is tricky. What I can simply do is store 1 bit for 58 values, and they could mean -1 or 1 for a binary network. I will simply miss out on storing some of the first and/or last numbers of the 64, and they will get to be zero (or NaN, NaR? those are likely no good for a neural network… I doubt this will ever be too useful for other context, than that’s a possibility, or then just go with 1/32 compression).

Here is a copy of exerpts from a message James Demmel sent to the IEEE754 mailing list; the links may be of interest.

For those of you interested in following the new IEEE standards committee on floating point for machine learning, here is a list of publicly available materials describing possibilities we have heard about …

A joint effort from Nvidia, Arm and Intel:
[2209.05433] FP8 Formats for Deep Learning

Tesla:
https://tesla-cdn.thron.com/static/SBY4B9_tesla-dojo-technology_OPNZ0M.pdf?xseo=&response-content-disposition=inline%3Bfilename%3D"tesla-dojo-technology.pdf"

A joint effort from Graphcore, AMD and Qualcomm:
Graphcore and AMD propose FP8 AI standard with Qualcomm support

A presentation from Nvidia about a logarithmic format (no mantissa bits):
https://media.mlsys.org/Conferences/MLSYS2021/Slides/DL_Chips_ML_Sys_0421.pdf

3 Likes

Thanks, great to know!

Not only does the float8 (E4M3) format match float16 across the board (with one exception), but where it matters (for the largest models) it slightly beats it!

i.e. if you look at: https://arxiv.org/pdf/2209.05433.pdf

Table 5: […] For F1 metrics higher is better, for perplexity lower is better. Best 8-bit result is bolded

better for BERT Large, and GPT3 6.7B (largest in that table).

This gives me some confidence, I might be one a good track with my 8x denser format. Note this part:

8-bit inference deployment is greatly simplified by FP8 training, as inference and training use the same datatypes. This is in contrast to int8 inference with networks trained in 32- or 16-bit floating point, which require post-training quantization (PTQ) calibration and sometimes quantization-aware training (QAT) in order to maintain model accuracy. Furthermore, even with quantization aware training some int8-quantized models may not completely recover the accuracy achieved with floating point [1].

and:

Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our
training experiments include large, up to 175B parameter, language models

My worry with binary and ternary networks were that they might not work for Transformers, or only some very specific neural-network application.

I would want my format to be a drop-in replacement for neural networks, just a new datatype, without needing new hyperparameters.

Since int8 is worse than float8, what hope would there be for binary or ternary, since those are in effect at best int1 or int2? My format doesn’t strictly depend on such good.

Since they are designing hardware, it’s understandable that they go with a format for a scalar number (even though SIMD vectors are a thing, usually just repeated scalars; shared exponent format is though also a thing in some exotic hardware that I believe didn’t catch on, Intel bought that company).

But my format would only be 1-bit when a network were fully dense (or well allowing 58 of 64 values in a chunk), and 2-bit when you do away with 32 values in a row out of a 64 chunk, then “7.25 bits for 8 values” in a row of 64 values.

So my hypothesis, or hope, is that networks would be at least that sparse (on average).

It bugged me a bit that I could only store max 58 values out of 64, but that is avoidable. I can simply store all 64 as binary (plus allow for other cases than binary)!

I can test e.g. the last 8 bits, if they are all zero (to be able to use AL register which is fast), and that’s a good sign that I have sparse data (and the odds are 1/256 for some random dense bit-pattern).

Then I can interpret the bits differently, e.g. as a ternary as the second option. You might think ternary needs 2 bits, per value, but I think it might be best to decode them in pairs, as 3 bits, not 4. You just loose out on one bit-pattern of the 3*3 = 9 combinations, e.g. -1, -1, likely most useless. I could decode 36 ternary values that way, and possibly use 1 or 2 bits for location data of those.

And the last case would be to use float8 (I’m agnostic for now if E4M3 or Posit8, or maybe Posit7 or lower?).

It’s sort of important that decode is fast (much less so for encode), and it can be, so I just need to identify the common case for a fast path, and the others will not be much slower.

The encode function isn’t terribly hard, but in some cases it has to choose between encoding more values with down to even one bit (so possibly quantize), or sparseness and possibly then drop some values to zero.

You can store all 64 values with the 58 that you have. Let the 58 be finite, nonspecial values. There are three likely special values Inf, -Inf, NaN. Their encodings are bit patterns that must be checked anyway, and you are writing the checks (e.g. isnan(x) = reinterpret(UInt8, x) === 0x__) … so place them as best suits your approach (not in the 58 values, and if they were there – take them out and use a few additional conventional values instead). This still leaves 3 unassigned values. You might have +/- Huge (nonspecific large finite, a non Infinite numerousness). And for the last one, you may find a great purpose for it as you lay out the rest. Otherwise a useful Irrational?

Long story short … encodings are many, allocations of number ramps are many too. Tell me more specifics that elucidate your underlying intent.

You can store all 64 values with the 58 that you have.

Almost all of your comment including that part is cryptic to me (I’ll try to think about it and decode), and especially: “Otherwise a useful Irrational?”

When I encode all 64 values (into UInt64), then they need to be 0 or 1 (or -1 and 1, not sure which more useful). There’s no place for NaN, or more to be encoded, in the case of that many values. Neither for the ternary case as explained, but I could encode ternary plus possible NaN in 2 bits, at the cost of 25% fewer (max.) ternary values encoded. Since NaNs (and infinities) aren’t that useful, or wanted in neural networks, I think I’ll just skip it, or at least at best include in the float8 case.

Infinities aren’t supported in E4M3, like in Posit8, the formats are surprisingly similar, with only one NaN bitpattern. In neural networks you simply clamp to the highest value allowed.

what does this mean?

When I encode all 64 values (into UInt64), then they need to be 0 or 1 (or -1 and 1, not sure which more useful)

[and I am happy to do the decoding … if it is relevant]

Are you using bit positions to flag each of 64 values?
Are you representing numbers this way? If so what is each?
Or are you still looking at 5 bits? If so why use a UInt64?

| 0b | 0b | 0b | 1b |
| b3 | b2 | b1 | b0 |

| 0b | 0b | 1b | 0b |
| b3 | b2 | b1 | b0 |

| 0b | 1b | 0b | 0b |
| b3 | b2 | b1 | b0 |

I think the miscommunication might be that you think I have 64 bitpatterns for each value, and was talking about using 58 of them. I can see my proposal being confusing, with me abandoning the first one, and then list some variations on the new one for neural networks.

I was meaning I want to take in a tuple (vector) of 64 Float64 values, and I want to encode all of them into UInt64. I can do that! Before, I thought it needed to be sparse and I could encode only 58 max of those 64.

Does that clarify?

yes

I’m conflicted if I should support just binary, or also ternary (in addition to the third format).

To simplify and have only two different formats, one dense and one sparse, I’m thinking:

  1. 64 binary values with a twist, call it pseudo-ternary, where every other value is 1 or 0, and the other values are -1 or 0.
  2. Then the other format float8s or maybe alternatively only allow for that format a bunch of float8, then float7, then maybe float5… to squeeze in a few more values in the sparse format; plus locate information for the whole row of such.

EDIT: When I think about it, storing location information is maybe not needed or wanted. It complicates encoding, meaning I have to try out for all the possible locations, see which fit best.

Simpler for the sparse format, store (8 lowest bits as zero to signal it), and then 7 float8s that have to be at the start of the vector.

A cell in JPEG is 8x8, one DC component, then 63 frequency components. I’m thinking, this would catch the DC and a few of the most important components with float8.

Do you know if neural networks are fed with fully decoded images, or possibly half decoded JPEGs (which seems sensible, as I explain here)? I know Nvidia has JPEG hardware on the GPUs to make training of neural nets faster. It seems most sensible to only half decode and keep the frequency components in the frequency domain. The DC component alone gives a self-similar image already, in the “time” domain, just scaled by 8 for both X and Y direction, i.e. 64 times smaller. (It’s also very well possible that the neural networks figure nothing out from the frequency components, basically disabling 63/64 of the training data…, but if not, then you want to bias computer vision to the big picture, and away from textures, and this might do that, while also using some of the texture info).

The point of the format is to force a sparse representation of weights. I think this format would induce that. To begin with you need to initialize weights somehow, and I think it’s just done randomly, so I think a random number for the dense format is good enough.

This is basically done as a proof-of-concept, e.g. encode_64_full, the main function. I just need to finish the trivial corresponding decode function, and validate correctness (I believe the functions it calls are ok).

Those are some ugly (long) function definitions (longest I’ve ever made, is there a better way to do this?):

using Statistics

function encode_64_full((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  codeb = encode_64b((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  code = encode_64((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  meanb = mean(abs.(decode_64b(codeb) .- (w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63)))
  meana = mean(abs.(decode_64(code) .- (w0, w1, w2, w3, w4, w5, w6, w7))); # println(meanb, "\n", meana);
  if meanb > meana
    ('b', codeb)  # Binary, dense
  else
    ('n', code .% UInt8)  # Not binary, sparse
  end
end

Running this functions, needs all the function definitions below:
The returned values may seems strange, but they are correct clamped values (or rounded), as intended:
julia> decode_64_full(encode_64_full((0, -0.5, 0.5, 1.5, 2.5, -1.5, -2.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
(0.0, -1.0, 1.0, 0.0, 1.0, -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)

julia> encode_64_full((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63))
('b', (false, true, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false))

julia> encode_64_full((0, -0.5, 0.5, 1.5, 2.5, -1.5, -2.5, 10, 20, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
('n', (0x00, 0xc8, 0x38, 0x44, 0x4a, 0xbc, 0xb6, 0x5a))

function decode_64_full((is_b, tuple))
  if is_b == 'b'
    decode_64b(tuple)
  else
    decode_64(tuple)
  end
end

using SoftPosit
PInt(n) = reinterpret(UInt8, Posit8(n))

function encode_64((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  (PInt(w0), PInt(w1), PInt(w2), PInt(w3), PInt(w4), PInt(w5), PInt(w6), PInt(w7)) # TODO: Add many extra, UInt8(0))
end

function encode_64b((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  0.5 .<= (w0, -w1, w2, -w3, w4, -w5, w6, -w7, w8, -w9, w10, -w11, w12, -w13, w14, -w15, w16, -w17, w18, -w19, w20, -w21, w22, -w23, w24, -w25, w26, -w27, w28, -w29, w30, -w31, w32, -w33, w34, -w35, w36, -w37, w38, -w39, w40, -w41, w42, -w43, w44, -w45, w46, -w47, w48, -w49, w50, -w51, w52, -w53, w54, -w55, w56, -w57, w58, -w59, w60, -w61, w62, -w63)
end

julia> encode_64b((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63))
(false, true, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false, true, false)

Clamp done, those need combining into one function, and to return UInt64 not (different) tuples.

function decode_64b((w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15, w16, w17, w18, w19, w20, w21, w22, w23, w24, w25, w26, w27, w28, w29, w30, w31, w32, w33, w34, w35, w36, w37, w38, w39, w40, w41, w42, w43, w44, w45, w46, w47, w48, w49, w50, w51, w52, w53, w54, w55, w56, w57, w58, w59, w60, w61, w62, w63))
  Float64.((w0, -w1, w2, -w3, w4, -w5, w6, -w7, w8, -w9, w10, -w11, w12, -w13, w14, -w15, w16, -w17, w18, -w19, w20, -w21, w22, -w23, w24, -w25, w26, -w27, w28, -w29, w30, -w31, w32, -w33, w34, -w35, w36, -w37, w38, -w39, w40, -w41, w42, -w43, w44, -w45, w46, -w47, w48, -w49, w50, -w51, w52, -w53, w54, -w55, w56, -w57, w58, -w59, w60, -w61, w62, -w63))
end

This might seem like a bug, but is as intended, because of clamping:
julia> decode_64b(encode_64b((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)))
(0.0, -1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0)

PInt_to_float(n) = Float64(reinterpret(Posit8, n))

function decode_64((w0, w1, w2, w3, w4, w5, w6, dummy))
  PInt_to_float.((w0, w1, w2, w3, w4, w5, w6, UInt8(0))) # , UInt8(0))) # Need to pad with more zeros
end

julia> decode_64(encode_64((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)))
(0.0, -0.5, 0.5, 3.0, 4.0, 5.0, 6.0, 0.0, 0.0)




julia> function decode_64_(n::UInt64) # (w0, w1, w2, w3, w4, w5, w6, dummy))
         curr = n >> 8
         w6 = PInt_to_float(curr & 255); curr >>= 8
         w5 = PInt_to_float(curr & 255); curr >>= 8
         w4 = PInt_to_float(curr & 255); curr >>= 8
         w3 = PInt_to_float(curr & 255); curr >>= 8
         w2 = PInt_to_float(curr & 255); curr >>= 8
         w1 = PInt_to_float(curr & 255); curr >>= 8
         w0 = PInt_to_float(curr)
         (w0, w1, w2, w3, w4, w5, w6)
       end

julia> PInt_to_float(n) = Float64(reinterpret(Posit8, n % UInt8))

julia> decode_64_(UInt64(232555454547376583))
(3.814697265625e-6, 0.625, 0.34375, -0.0009765625, 0.15625, -0.0625, 0.5625)
@code_native encode_64_full((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63))

is huge (less of a worry, I’ll look into optimizing, one reason might be it’s still type-unstable), but

res = encode_64b((0, -0.5, 0.5, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63))
@code_native decode_64b(res)

is also huge (currently, will look at first), more of a worry.

Hi @elrod, can you look at the (latter) assembly? Does anything stick out, and is it normal to get such long assembly for vectorized code? I haven’t timed it yet, it might be ok. Is there a good way to count instructions or just screenfull pages…? It would be nice to be able to pipe into less and wc.

It is not vectorized.

1 Like

I did see v instructions like vcvtsi2sd, the v-prefix is for VEX, or vectorized extensions, so it fooled me. I like to get my 64 values compressed into UInt64, and while decoding, since we do not have 64 registers, to SIMD registers.

Right now testing the code it’s 12x slower (understandable, to a degree at least) [EDIT: 62x slower than Float32 (strange I only got 22x slower before), and 2x slower vs Float16, which is maybe, maybe not, more fair to compare to.]; than I want it to be, at least. But actually I’m not streaming over 1/64 the memory yet, so it might actually be pretty good (even with the two allocations).

EDIT: My code is now seemingly 0.13% faster than Float16, even though it’s not directly comparable and doing a bit more. For Float32 it’s 24x slower, and for Float64 it’s 9.5x slower. I realized where my allocation was, it was in the timing code, and I’ve not yet updated the numbers. Does anyone know how to do this without rather spelling out, similar to this:

function time_code(A)
                s = 0; for i in eachindex(A)[1:64:end]
                  decode_64_full(encode_64_full(A[i], A[i+1], 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) # A[i:i+63])) # s += i; println(i)
                end
              end
julia> A_orig2 = rand(64*10000); # suppose this as comparison isn't threaded, I don't want that, for fairness

julia> @time sum(A_orig2)
  0.000560 seconds (1 allocation: 16 bytes)
320055.14035831956

vs.

julia> function time_code(A)
         s = 0; for i in eachindex(A)[1:64:end]
           decode_64_full(encode_64_full((A[i:i+63]))) # s += i; println(i)
         end
       end

julia> @time time_code(A_orig2)
  0.006761 seconds (10.00 k allocations: 5.493 MiB)

julia> @time time_code(A_orig2)  # outdated timing
  0.010273 seconds (20.00 k allocations: 6.256 MiB)

julia> 0.006761/0.000560
12.073214285714286

julia> 0.010273/0.000638 #older outdated timing, getting rid of 1 of 2 allocations gave 51% speedup.
16.101880877742946       #getting rid of 1 of 2 allocations gave 51% speedup.

julia> @code_native time_arr(A_orig2)  # shows optimized away

I do see e.g.:

 	vmovsd	%xmm10, 272(%rax)
	vmovsd	%xmm11, 280(%rax)
	vmovsd	%xmm12, 288(%rax)
	vmovsd	%xmm13, 296(%rax)
	vmovsd	%xmm14, 304(%rax)
	vmovsd	%xmm15, 312(%rax)
	vmovsd	%xmm0, 320(%rax)
	vmovsd	%xmm1, 328(%rax)
	vmovsd	%xmm2, 336(%rax)
	vmovsd	%xmm3, 344(%rax)
	vmovsd	%xmm4, 352(%rax)
	vmovsd	%xmm5, 360(%rax)
	vmovsd	%xmm6, 368(%rax)
	vmovsd	%xmm7, 376(%rax)
	vmovsd	%xmm8, 384(%rax)
	vmovsd	%xmm9, 392(%rax)

and what I don’t like:

||movl|%ecx, -68(%rsp)                 # 4-byte Spill|
||movsbl|18(%rsi), %ecx|
||movl|%ecx, -64(%rsp)                 # 4-byte Spill|
||movsbl|20(%rsi), %ecx|
||movl|%ecx, -52(%rsp)                 # 4-byte Spill|
||movsbl|22(%rsi), %ecx|
||movl|%ecx, -48(%rsp)                 # 4-byte Spill|
||movsbl|24(%rsi), %ecx|
||movl|%ecx, -56(%rsp)                 # 4-byte Spill|
||movsbl|26(%rsi), %ecx|
||movl|%ecx, -72(%rsp)                 # 4-byte Spill|
||movsbl|28(%rsi), %ecx|
||movl|%ecx, -80(%rsp)                 # 4-byte Spill|
||movsbl|30(%rsi), %ecx|
||movl|%ecx, -88(%rsp)                 # 4-byte Spill|
||movsbl|32(%rsi), %ecx|
||movl|%ecx, -96(%rsp)                 # 4-byte Spill|
||movsbl|34(%rsi), %ecx|
||movl|%ecx, -100(%rsp)                # 4-byte Spill|
||movsbl|36(%rsi), %ecx|
||movl|%ecx, -104(%rsp)                # 4-byte Spill|
||movsbl|38(%rsi), %ecx|
||movl|%ecx, -108(%rsp)                # 4-byte Spill|
||movsbl|40(%rsi), %ecx|
||movl|%ecx, -112(%rsp)                # 4-byte Spill|
||movsbl|42(%rsi), %ecx|
||movl|%ecx, -116(%rsp)                # 4-byte Spill|
||movsbl|44(%rsi), %ecx|
||movl|%ecx, -120(%rsp)                # 4-byte Spill|
||movsbl|46(%rsi), %ecx|
||movl|%ecx, -124(%rsp)                # 4-byte Spill|
||movsbl|48(%rsi), %ecx|
||movl|%ecx, -128(%rsp)                # 4-byte Spill|
||movsbl|50(%rsi), %ecx|
||movl|%ecx, -40(%rsp)                 # 4-byte Spill|
||movsbl|52(%rsi), %ecx|
||movl|%ecx, -44(%rsp)                 # 4-byte Spill|
||movsbl|54(%rsi), %ecx|
||movl|%ecx, -60(%rsp)                 # 4-byte Spill|
||movsbl|56(%rsi), %ecx|
||movl|%ecx, -76(%rsp)                 # 4-byte Spill|
||movsbl|58(%rsi), %ecx|
||movl|%ecx, -84(%rsp)                 # 4-byte Spill|
||movsbl|60(%rsi), %ecx|
||movl|%ecx, -92(%rsp)                 # 4-byte Spill|