Modular multiplication without overflow

I would try adapting the stackoverflow code, which is a lot better explained and seems like it should have similar or better performance.

I copied and pasted @jd-forester’s code verbatim to perform benchmarks and tests and devised a different algorithm from scratch; I did not adapt any existing code.

Changing the 64 to 128 as @stevengj suggested and the magic numbers as @Dan suggested yields the counterexample (0x350caac24c5180855a2891ec9a655250, 0x5d043f9d0c547441717da21f58a17d54, 0xac4dd9d313c605d3e9c794646437b1b9), which I expect will stand as the wiki page notes that m must have fewer than 64 bits and I imagine that carries forward to m must have fewer than 128 bits once extended.

In addition to still not being correct over the whole domain, I benchmark the revised version as comparable in speed to the correct (I think) version I posted.

          mul_mod   wide_mul_mod  steve_dan_wiki_mul_mod (still broken)
UInt16    43.684 ns    8.239 ns   31.100 ns
UInt32    40.280 ns   14.272 ns   30.405 ns
UInt64    60.862 ns  268.941 ns  274.094 ns
UInt128  568.878 ns    1.000 μs  560.923 ns

My version does not include a loop over every binary digit and as far as I can tell, all the linked implementations do (except the extended precision floating point trick).