Julia on Copilot+ PCs?

Microsoft has defined CoPilot+ PCs as PCs with NPUs that offer a minimum of 40 TOPS (Apple has similar plans, I think). I’m not the kind of guy who tends to socialize with a machine, so I’m interested if this can be utilized in Scientific Computing…

  1. What are the possibilities/usefulness of using such NPUs for machine learning/neural networks within Julia? [Microsoft has stated that they have a native implementation of PyTorch on ARM PCs, I think.]
  2. Currently, the only CoPilot+ PCs are based on the Qualcomm X processors, although AMD will release CoPilot+ PCs this month (?, Strix platform), and Intel perhaps in September (?, Lunar Lake).
  • What are the chances that the Julia eco-system will run in emulated mode on the Qualcomm processor? Any experience?
  • Any plans to offer Julia on the Qualcomm processor?

Julie does not yet support Windows on Arm, but since we support both windows and arm separately, it shouldn’t be too hard. it mostly depends on someone with the hardware getting it building/getting all the tests passing.

I haven’t had a lot of luck finding technical docs for the chips but the TOPS are apparently measured with INT8 precision. Most of the discussion I have seen around the NPUs in development by Qualcomm and Intel (Lunar Lake) indicate that they will not be very useful for the higher precision applications typical of scientific computing.

In terms of usefulness, the NPU Intel is shipping with Lunar Lake has lower peak TOPS than the integrated GPU they are shipping on the same package. I think most users writing Julia ML code would want to use the GPU. The advantage of the NPU is that it can run many low-power applications simultaneously, enabling applications like coding copilots and email assistants and so on. For running your own ML models that you want to finish ASAP, the NPU isn’t the optimal hardware.

For me, adoption of matrix extensions (Intel has XMX, Apple had AMX and is now using ARM’s SME for M4) is more interesting. Apple’s AMX supports higher precision matrix operations and can improve the performance of BLAS calls using AppleAccelerate.jl for double precision inputs by ~2x. Intel’s AMX that they rolled out on some Xeon platforms a while ago only supported low precision operations, and so did not offer similar usefulness. It looks like their data center GPUs with the new XMX design support higher precision, but all their marketing material on lunar lake is talking about lower precisions, so we’ll have to wait and see I guess.


Ah. I used copilot to check the precision, and it says that both the Qualcomm X and the Lunar Lake are INT8 precision. Doesn’t say about AMDs chip.

So the “only thing” going for the Qualcomm is perhaps a good battery life… the GPU appears to have poorer performance than that of both Lunar Lake and the Strix. Not that battery life is not important – it is.