Tachyum processor with Julia

I know that new processors come and go. There is a classic hype cycle every time a new idea is announced Gartner hype cycle - Wikipedia. Many do not even make it to the plateau of productivity because mainstream processors catch up with them on Feature X and well, you might as well get a conventional server with a mainstream processor then.

The interesting thing here: _he Prodigy platform has 64 cores with fully coherent memory, barrier, lock and standard synchronization, including transactional memory. Single-threaded performance will be higher than a conventional core, the CEO said. Each chip will have two 400 Gigabit Ethernet ports.

Would anyone care to comment if this architecture would be a good fit for Julia?

Sadly I missed the recent MIT talk on the future of Julia parallelism. Is ti online somewhere?

As a quick addition, this is a really serious project. They are recruiting for kernel and compiler engineers.

The talk is posted at youtube on the JuliaLang channel

Which instruction set are they using ? ARM, x86-64, or something completely new? It seems that they are working on a GCC back-end. For Julia, I would think that first an LLVM back-end would be needed if it is a completely new instruction set.

However, it is really hard to get a new instruction set established (Itanium - Wikipedia and the Itanium was x86 compatible).

But combining the power of GPU and the programmability of a CPU would certainly be a significant progress. It remains to be seen if the bold claims of Tachyum materialize. It is weird to see these announcements without actual benchmarks (or are you aware of any ?).

It wasn’t x86 compatible (unless they stuck an x86 co-processor on it later on!).
I worked on porting to it (even before it was publicly available - we got (slightly) early access to new processors from Intel and IBM so that they could have application vendors / partners ready when it was released).
That processor made me stop doing the assembly language parts of the porting process myself, porting to the Dec Alpha chip in '93 was fun, Itanium was hell.

2 Likes

I have no idea regarding the instruction set!

Strange that the conversation moves to discussing Itanium.
Itanium actually was very good for CFD workloads.
When I worked in Formula One I managed five SGI Altix machines which used Itanium, and they were very efficient for CFD.
They really came to an end when (a) the company which produced our package stopped being willing to port to Itanium - it was a lot easier to get commodity Xeon servers plus the tools to compile etc…
(b) we started to look at the electricity consumption versus multicore Xeon CPUs.

One very telling anecdote. The BX2 I had , with dual core Itanium, ran solidly for four years. On the day the truck arrived to take it away to the knackers yard I had to stop users jobs running in order to shut it down.

Regarding Itanium, dont forget that AMD were the ones to come out with the x86-64 architecture and Intel had to catch up.

Well, Itanium was based on some of the same ideas as the Tachyum processor, i.e. do pretty much everything in software, skip all the fancy instruction set interpretation, register renaming, speculative execution (which is what got them into the security mess lately) etc that the x86 architecture has to do in silicon.
I do hope they’re able to do a better job of dealing with that with the Tachyum than what happened with the 'Titanic’ium (as we liked to call it back then!)
Some of the real problems with the Itanium came from depending on compiler technology that wasn’t really ready for it back in 2000 (no LLVM, for example, and you had to recompile / optimize for every specific processor - really annoying, not easy to deal with when you just ship binaries).

Yes, because Intel was so sure that Itanium was the future for 64-bits, they ignored their cash cow.

Talking about speculative execution, and hence Meltdown and Spectre, there was a rather perceptive article in The Register recently.
The article postulated that in order to keep up with Moores Law scaling of performance, CPU engineers had to add features like speculative branching because programmers are ‘too lazy’ to learn how to parallelize codes.
I stress that I am paraphrasing this a lot.

I don’t think there is sufficient information in the article to answer this, but given that they are going after “hyperscale datacenters”, they may be targeting a different audience.

In general I find new processor architectures exciting, but the hype gives me a migraine. The literature for this is unreadable.

Hopefully they are doing something great that’s simply being obscured by nauseating levels of hype.

1 Like

As a general comment on this subject, in my opinion any hardware manufacture that doesn’t ship their hardware together with excellent open source LLVM backend support these days is doing it wrong.

10 Likes

I got a response from Tachyum. There is an LLVM port in the works for late this year.
Perhaps they should come to JuliaCon

Interesting video.
I wonder about the possible consequences on the language (and the language implementation). If we agree on the prominent importance of the subject (multithreading in Julia), wouldn’t be wise to wait for the conclusions of this work before Julia 1.0. Or Julia’s core team is already confident about MT not impacting the language design.

2 Likes

This is something I’ve been extremely curious about myself.

Now the company has announced the availability of their Prodigy processor, wanted to ask if there is any plan to have Julia compatible with their sw stack.

Julia needs to support some Float8 then, for max speed (and most preferably sparse matrices). Because that’s what they use for its speed. With Julia code generic, it should be easier than for many other languages.

https://www.tachyum.com/datasheets/Prodigy%20Family%20SKUs%20V1.06%20220815.pdf

Note, their Float8 (FP8*) is 12 PFLOPs = 12000 TFLOPs or 133x their Float64 (DP), not 8x you would expect from memory bandwidth alone.

I noticed (for FP8)

*With sparsity

too late after writing the rest here, what could it mean?

Why is that? My guess, and what I would do, is because operations (e.g. multiply, even division, if they support that way), can be done with a lookup table of 64 KB for each (or 32 KB because of symmetry) for each operation, even divide in 1 cycle, if supported.

Float64 is only 8 times larger than Float8, so I suspect the former is just (partially?) emulated, explaining 133x vs 8x. You can’t even do Float16 practically with a lookup table.

Ok, if I’m wrong about lookup tables, doesn’t matter too much, could be part of the reason, or none, or all due to sparsity 133/8 = 16.625. I just note they claim “With sparsity” only for FP8, not DP, so it seems they can’t do sparsity (at least yet) at high(est) precision. Or I think they could claim it there and higher numbers. I suppose their “4096-bit matrix processor” takes sparsity into account, and works on some 16x compressed format.

Runs binaries for x86, Arm, and RISC-V in addition to native ISA

That’s helpful and intriguing if their (only) 30% speed-loss claim (using QEMU) is still valid. I’m guessing that’s compared to their optimal case of “4096-bit matrix processor per core” vs the competition. They also claimed to beat Intel on SPECint.

Their native ISA is VLIW. It needs not be bad, maybe you need to recompile for each new chip (except when emulating), or not as with Itanium’s EPIC (similar to VLIW). You didn’t need to recompile Itanium code, for new chips, to still work, but I believe you did need it to get a performance increase.

My guess is their FP8 are Posits, as it its more efficient use of bits, what I would use. I would at least like good arguments against it (not like they need to keep compatibility with any other Float8, does any (mainstream) hardware support such?).

• 128 64-bit cores in a single socket up to 5+ GHz
• 2 x 1024-bit vector units per core
• 4096-bit matrix processor per core
• Out-of-Order, 4 instructions per clock
• Virtualization and Advanced RAS
[…]
• 5nm Process Technology
• 64 mm x 84 mm FCLGA Package

See elsewhere:

Flip Chip Land Grid Array (FcLGA) packages are widely used in Mobile product applications due to their thin form factor and performance.

1 Like

@elrod [are you only working on dense?]
This (likely) “sparsity 133/8 = 16.625” got got me thinking, how would you design hardware (or software, maybe we can do better for Julia) for sparse matrices? What I had written here, since off-topic here(?) moved elsewhere: