Problem with: Issue: MurmurHash3 has undefined behavior

Hmmm, it compiled perfectly well for me, many times, using the same set of options that I believe Julia uses for MurmurHash3. Were you not using the options Julia uses on godbolt?

I changed to use the “modern” style with sizeof - but that didn’t change anything, I still get differences between the two, using casting vs. memcpy, with both -O2 and -O3.

The bigger differences I saw do seem to have been caused by the optimization level.
I’d originally had -O2 on, and then tried to copy the options that are in the Julia Makefile for src/support, and lost the -O2 accidentally. Sorry for that noise.

I hadn’t heard of godbolt before, but it seems like that’s a nice resource.

See here for what the problem is: Pointer arithmetic for void pointer in C - Stack Overflow

Ah, so, with stricter compiler options, like the “pedantic” one.
True, strictly speaking, p should have been cast to uint8_t * before use, I was just trying to see the assembly results, and without the -pedantic-errors option it compiled without warning.

It depends on compiler version. Recent clang errors with default options, and recent GCC gives a warning with default options

1 Like

What do you think of the bigger issue, should MurmurHash3 even be used at all for string hashing?

We have crc32c available, it is much faster, it has the property that you can perform the CRC on a string in chunks and produce the same result (which doesn’t work with MurmurHash, I’m not sure if it is just because it mixes in the length of the string to the result, or if there are other reasons).
For associative arrays, a 32-bit CRC seems to be more than enough to give a good spread.

I think there’ll be a time to think about it, but right now is not that time. It’s a non-breaking change and we have many bigger fish to fry before we get to it.

OK - it’s a one-line change though, that has a big impact on string hashing, which in turn affects Dict{String,...} performance, which in turn affects all sorts of things like JSON performance, etc.

Seems like low-hanging fruit, actually simpler to do than having to fix MurmurHash3.c to deal correctly with unaligned input strings for ARM-32 platforms.

The problem is not the complexity of the change, it’s assessing whether it’s a good idea. We’ll need to do some work to assess possible alternatives, benchmark them, think of the DOS implications, etc, all of which we can do, but not now.

I think it’s best to look at the evidence, as I’d like to take the time to do, rather than assuming that crc32 will do a good job.

7 Likes