Julia needs to support some Float8 then, for max speed (and most preferably sparse matrices). Because thatâs what they use for its speed. With Julia code generic, it should be easier than for many other languages.
https://www.tachyum.com/datasheets/Prodigy%20Family%20SKUs%20V1.06%20220815.pdf
Note, their Float8 (FP8*) is 12 PFLOPs = 12000 TFLOPs or 133x their Float64 (DP), not 8x you would expect from memory bandwidth alone.
I noticed (for FP8)
*With sparsity
too late after writing the rest here, what could it mean?
Why is that? My guess, and what I would do, is because operations (e.g. multiply, even division, if they support that way), can be done with a lookup table of 64 KB for each (or 32 KB because of symmetry) for each operation, even divide in 1 cycle, if supported.
Float64 is only 8 times larger than Float8, so I suspect the former is just (partially?) emulated, explaining 133x vs 8x. You canât even do Float16 practically with a lookup table.
Ok, if Iâm wrong about lookup tables, doesnât matter too much, could be part of the reason, or none, or all due to sparsity 133/8 = 16.625. I just note they claim âWith sparsityâ only for FP8, not DP, so it seems they canât do sparsity (at least yet) at high(est) precision. Or I think they could claim it there and higher numbers. I suppose their â4096-bit matrix processorâ takes sparsity into account, and works on some 16x compressed format.
Runs binaries for x86, Arm, and RISC-V in addition to native ISA
Thatâs helpful and intriguing if their (only) 30% speed-loss claim (using QEMU) is still valid. Iâm guessing thatâs compared to their optimal case of â4096-bit matrix processor per coreâ vs the competition. They also claimed to beat Intel on SPECint.
Their native ISA is VLIW. It needs not be bad, maybe you need to recompile for each new chip (except when emulating), or not as with Itaniumâs EPIC (similar to VLIW). You didnât need to recompile Itanium code, for new chips, to still work, but I believe you did need it to get a performance increase.
My guess is their FP8 are Posits, as it its more efficient use of bits, what I would use. I would at least like good arguments against it (not like they need to keep compatibility with any other Float8, does any (mainstream) hardware support such?).
⢠128 64-bit cores in a single socket up to 5+ GHz
⢠2 x 1024-bit vector units per core
⢠4096-bit matrix processor per core
⢠Out-of-Order, 4 instructions per clock
⢠Virtualization and Advanced RAS
[âŚ]
⢠5nm Process Technology
⢠64 mm x 84 mm FCLGA Package
See elsewhere:
Flip Chip Land Grid Array (FcLGA) packages are widely used in Mobile product applications due to their thin form factor and performance.