I would try adapting the stackoverflow code, which is a lot better explained and seems like it should have similar or better performance.
I copied and pasted @jd-forester’s code verbatim to perform benchmarks and tests and devised a different algorithm from scratch; I did not adapt any existing code.
Changing the 64 to 128 as @stevengj suggested and the magic numbers as @Dan suggested yields the counterexample (0x350caac24c5180855a2891ec9a655250, 0x5d043f9d0c547441717da21f58a17d54, 0xac4dd9d313c605d3e9c794646437b1b9)
, which I expect will stand as the wiki page notes that m must have fewer than 64 bits and I imagine that carries forward to m must have fewer than 128 bits once extended.
In addition to still not being correct over the whole domain, I benchmark the revised version as comparable in speed to the correct (I think) version I posted.
mul_mod wide_mul_mod steve_dan_wiki_mul_mod (still broken)
UInt16 43.684 ns 8.239 ns 31.100 ns
UInt32 40.280 ns 14.272 ns 30.405 ns
UInt64 60.862 ns 268.941 ns 274.094 ns
UInt128 568.878 ns 1.000 μs 560.923 ns
My version does not include a loop over every binary digit and as far as I can tell, all the linked implementations do (except the extended precision floating point trick).