Problem with: Issue: MurmurHash3 has undefined behavior

ScottPJones · March 19, 2018, 12:12pm

MurmurHash3 has undefined behavior

opened 12:11AM - 19 Mar 18 UTC

closed 10:01PM - 18 Dec 18 UTC

upstream hashing

Our implementation of MurmurHash3 has: https://github.com/JuliaLang/julia/blo…b/ed3798495d735a9608ef42ac266649a2a3fe085e/src/support/MurmurHash3.c#L60-L63 However, the pointer is not necessarily properly aligned, making this undefined behavior. The proper fix is to stop casting to an invalid pointer type and to use `memcpy`, which all modern compiler optimize to a proper sequence for unaligned access. This is the same issue as https://github.com/grpc/grpc/issues/13221

@keno: This doesn’t make sense at all to me. Why would you want to use a function to copy bytes, when trying to deal with unaligned access (on only 1 platform! i.e. 32-bit ARM) in a function that normally just loads 64-bits into a register?

There are two ways of dealing with this correctly (which should only be done for any platform(s) requiring aligned access), the most efficient would be to use a compiler flag to indicate the access is unaligned (on MS compilers, that was __unaligned, for the ARM compiler, apparently it is __packed:

When compiling C, variables are by default architecturally aligned. A global of type int (or uint32_t) will be 4-byte aligned in memory. Similarly, a pointer of type int* is expected to contain a 4-byte aligned address.

Where this is not the case (or may not be the case) the variable or pointer MUST be marked with the __packed keyword. This is a warning to the compiler that this variable, structure or pointer is potentially unaligned.

The other would be to simply define (only for platforms with this problems, i.e. ARM32!) getblock32 and getblock64 to pick up the bytes individually if necessary (but the code could be made to better handle the cases where the data is sufficiently aligned, even on platforms requiring alignment).

Related but somewhat off-topic, it would be really nice if Julia had a Ptr type that indicated to LLVM that the access was potentially unaligned, to handle these cases without compromising performance, that is something that I had to fake in Strs.jl.

Keno · March 19, 2018, 12:57pm

memcpy is the canonical way to bypass alignment and aliasing assumptions in C compilers. The compiler will turn it into whatever the most efficient sequence for unaligned memory access on the given architecture is.

ScottPJones · March 19, 2018, 1:04pm

In 38 years of programming in C I’ve never once heard that!
Definitely not “canonical” by any means.
So, you think you have to allocate some buffer and make a copy, just to deal with potentially unaligned data?

Keno · March 19, 2018, 1:30pm

Scott, I’m sorry, but that’s not how this works. If you want information from somebody that’s not the tone you choose. I’m generally happy to provide people with detailed technical explanations of gory compiler details, I find it kinda fun. I also often answer technical questions despite what I think is an unfortunate choice of manner of phrasing it. However, that gets exhausting after a while, so I stop doing so if the person asking is is not a newcomer (because they should now better). I’m still happy to provide explanations, but not this way. As a suggestion, here’s how you might have asked the question:

Huh, that’s funny? I’ve never heard that before. Do you have any references where I could learn more about this.

ScottPJones · March 19, 2018, 1:37pm

Sorry about the tone. I was just very surprised by your issue on GitHub, and your terse comment here that memcpy was the canonical way for dealing with alignment issues in C.

Please explain that statement more. You didn’t bother to respond to what I’d said about what I’d consider to be the “canonical” ways of dealing loading unaligned data into an integer in C.

I also was not trying to get information from you - I was rather trying to provide you with information that it seemed that you lacked (which greatly surprised me, because I wouldn’t consider you a newcomer either).

Keno · March 19, 2018, 1:49pm

Well, you see, this is precisely the problem. You assumed that you were correct and I was wrong and it came plainly across in your choice of tone. And while you may very well have been right about that, starting from that assumption is a bad thing, because it antagonizes people. Instead, you should start from the assumption that you’re wrong, ask for more information on the pieces that you don’t understand and if it turns out you’re right, that will generally emerge over the course of the conversation. I will reply to the technical question is a separate post, so these couple posts on decorum can be split and moved to OffTopic.

ScottPJones · March 19, 2018, 1:50pm

And that was the problem with your first response - which totally ignored the technical response that I had made, and repeated your idea of using memcpy.

ScottPJones · March 19, 2018, 1:51pm

There wasn’t anything that I didn’t understand.

Keno · March 19, 2018, 2:04pm

Scott, I had half written up a detailed note (with references!) on undefined behavior of unaligned pointer access in C and how to write standards compliant C code. In fact, I was gonna post it soon, as I promised in my previous post, because I thought we had come to a collegial understanding here. Unfortunately, your last two posts made me reconsider finishing it, which is too bad because I thought it was interesting. Because as I said in my original post, “that’s not how this works”. I generally don’t mind spending my morning composing an interesting post to explain things in detail, but you’re also not entitled to it - I’m quite busy these days. Maybe somebody else will feel inclined to engage with you on the technical point, but that this point, I don’t. I will stop engaging on this thread now.

FWIW, I could probably have expanded a little more on my original response, but I thought a quick an correct response that I thought would address the concern (it’s the canonical way to do it - any modern compiler will optimize it away, so it’s not slow) was better than none. I read discourse in the morning before work, where my time is limited. Sometimes I choose to spend some of my leisure hours on writing up detailed notes for the community, as I was going to in this topic, but I can’t do that for everything. I’m sorry if you thought the short reply was brash. I will keep that in mind and cease replying to you in the future unless I have the time to write up a detailed note.

jameson · March 19, 2018, 2:06pm

Then why did you need to ask a question? I agree with Keno that this is an interesting question, but really not a good way to approach asking it.

As Keno mentioned above, memcpy is a standards-compliant way to handle this. It’s up to each specific compiler to optimize it correctly (or not), or provide some compiler-specific extension (such as the ones you mentioned above). Those also may or may not be optimized.

LLVM really likes the memcpy operation, since it is very expressive. We lower anything that isn’t a small primitive load or store into memcpy, since the memcpy operation gives LLVM much better optimization potential than giving it just a load/store. In fact, its really happy to get memcpy. I have no idea if other compilers are quite so accommodating.

ScottPJones · March 19, 2018, 2:11pm

I didn’t. I tried to point out what I saw as a problem with the issue that Keno had raised.

But that is precisely what the getblock64 function is, it’s a primitive load of 64-bits into a register.

There are a number of other issues about the use of MurmurHash3 in Julia - using the 128-bit version on a 32-bit platform (and throwing away the top 64-bits always) seems to be rather overkill.

Something like crc32c would do better for things like string hashing for use in Dict, for example.

jameson · March 19, 2018, 2:25pm

For a time, we would lower almost everything to a memcpy, but it caused a bit of extra compilation time overhead (LLVM felt we were giving it too much information ), so we reverted back to using the more basic load/store instructions for simple enough cases.

it’s a primitive load of

No, it’s quite specifically not that – hence the bug report with using the wrong C idiom here and incurring UB, since we’re attempting to do a primitive load of something that isn’t.

ScottPJones · March 19, 2018, 2:25pm

If this is the canonical way in C to load an value from an unaligned pointer into an integer, can you please show how you’d rewrite getblock32 and getblock64 using memcpy, which would be more performant than using something like the __packed keyword that I suggested?

jameson · March 19, 2018, 2:44pm

Yichao already did https://github.com/JuliaLang/julia/issues/26512#issuecomment-374193242

ScottPJones · March 19, 2018, 4:04pm

I tested out the sort of change that Yichao did, on MurmurHash3.c, and the results were simply not the same.
Hashing is pretty performance critical, so I think it’s important to make sure this is made as fast as possible, and base decisions about how to best code something on evidence, not simply opinions.

Using memcpy in this way was a lot better than what I thought, compilers have definitely improved a lot in the last decade, to be sure, but still, it would be a performance degradation on all but the one platform with this problem
(I haven’t had time to fire up a Raspberry Pi to see what the resulting code would be there, comparing the memcpy approach to the __packed approach, but I think we should avoid slowing down all cases just to handle an issue on one.

Using casting:

	movslq	-20(%rbp), %rdx
	movl	(%rax,%rdx,4), %ecx

Using memcpy:

	movl	-20(%rbp), %ecx
	shll	$2, %ecx
	movslq	%ecx, %rdx
	movl	(%rax,%rdx), %ecx
	movl	%ecx, -24(%rbp)
	movl	-24(%rbp), %ecx

Keno · March 19, 2018, 4:45pm

When comparing assembly code generated by C compilers, please include compiler version and full options used (or even better a link to https://godbolt.org/ with the relevant snippet).

The presence of

seems to suggest either missing optimization options or a serious compiler bug that should be reported on the relevant bug tracker.

ScottPJones · March 19, 2018, 5:01pm

11:49 $ clang MurmurHash3.c -S -Wold-style-definition -Wstrict-prototypes -Wc++-compat
✔ /j
12:50 $ clang --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

The changed code was:

FORCE_INLINE uint32_t getblock32 ( const void * p, int i )
{
    uint32_t val;
    memcpy(&val, p + (i<<2), 4);
    return val;
}
FORCE_INLINE uint64_t getblock64 ( const void * p, int i )
{
    uint64_t val;
    memcpy(&val, p + (i<<3), 8);
    return val;
}

Maybe they’ll fix the C compiler missing optimization or bug, but in the meantime, I think it’s best to look at the evidence, as I was doing, and use a more performant solution to the issue with MurmurHash3, instead of assuming that the memcpy trick would do as good a job.

Just this weekend, I’d been looking in-depth at the performance issues with hashing of strings in Julia, which is why I already knew about some of the problems with the MurmurHash3 implementation, as well as issues with Julia’s use of it.

I really was just trying to be helpful, and I’m sorry that between your terse message and my “wonderful” communications skills we got off on a bad foot here.

yuyichao · March 19, 2018, 5:04pm

Fwiw, the difference in asm has nothing to do with loading. It’s a less optimum indexing plus a useless register spill.

ScottPJones · March 19, 2018, 5:06pm

Yes, but it doesn’t really matter how the difference in assembly occurs, what is important is currently, using memcpy in this fashion does not produce the most performant code.

Keno · March 19, 2018, 5:11pm

Scott, the code you posted doesn’t compile. In any case, I pasted the relevant snippet into godbolt myself and the generated assembly code is identical between the two ways to write it for modern compilers (e.g. clang trunk: Compiler Explorer). If you’d like to argue differently, I’ll take an appropriate godbolt link.

Topic		Replies	Views
Use of MurmurHash3 for hashing strings Internals & Design	29	8626	March 30, 2018
Use of `pointer` General Usage	19	6947	February 6, 2018
Unsafe_store! sometimes slower than arrays' setindex!? Performance	15	1443	November 30, 2017
Use "memcpy" instead of "memmove" to copy Array when there's no overlap? Internals & Design	7	2129	January 9, 2022
SIMD struggles, seeking solutions (with KangarooTwelve.jl) Performance	23	891	November 7, 2023

Problem with: Issue: MurmurHash3 has undefined behavior

Related topics