Problem with: Issue: MurmurHash3 has undefined behavior

@keno: This doesn’t make sense at all to me. Why would you want to use a function to copy bytes, when trying to deal with unaligned access (on only 1 platform! i.e. 32-bit ARM) in a function that normally just loads 64-bits into a register?

There are two ways of dealing with this correctly (which should only be done for any platform(s) requiring aligned access), the most efficient would be to use a compiler flag to indicate the access is unaligned (on MS compilers, that was __unaligned, for the ARM compiler, apparently it is __packed:

When compiling C, variables are by default architecturally aligned. A global of type int (or uint32_t) will be 4-byte aligned in memory. Similarly, a pointer of type int* is expected to contain a 4-byte aligned address.

Where this is not the case (or may not be the case) the variable or pointer MUST be marked with the __packed keyword. This is a warning to the compiler that this variable, structure or pointer is potentially unaligned.

The other would be to simply define (only for platforms with this problems, i.e. ARM32!) getblock32 and getblock64 to pick up the bytes individually if necessary (but the code could be made to better handle the cases where the data is sufficiently aligned, even on platforms requiring alignment).

Related but somewhat off-topic, it would be really nice if Julia had a Ptr type that indicated to LLVM that the access was potentially unaligned, to handle these cases without compromising performance, that is something that I had to fake in Strs.jl.

memcpy is the canonical way to bypass alignment and aliasing assumptions in C compilers. The compiler will turn it into whatever the most efficient sequence for unaligned memory access on the given architecture is.

In 38 years of programming in C I’ve never once heard that!
Definitely not “canonical” by any means.
So, you think you have to allocate some buffer and make a copy, just to deal with potentially unaligned data?

Scott, I’m sorry, but that’s not how this works. If you want information from somebody that’s not the tone you choose. I’m generally happy to provide people with detailed technical explanations of gory compiler details, I find it kinda fun. I also often answer technical questions despite what I think is an unfortunate choice of manner of phrasing it. However, that gets exhausting after a while, so I stop doing so if the person asking is is not a newcomer (because they should now better). I’m still happy to provide explanations, but not this way. As a suggestion, here’s how you might have asked the question:

Huh, that’s funny? I’ve never heard that before. Do you have any references where I could learn more about this.

4 Likes

Sorry about the tone. I was just very surprised by your issue on GitHub, and your terse comment here that memcpy was the canonical way for dealing with alignment issues in C.

Please explain that statement more. You didn’t bother to respond to what I’d said about what I’d consider to be the “canonical” ways of dealing loading unaligned data into an integer in C.

I also was not trying to get information from you - I was rather trying to provide you with information that it seemed that you lacked (which greatly surprised me, because I wouldn’t consider you a newcomer either).

Well, you see, this is precisely the problem. You assumed that you were correct and I was wrong and it came plainly across in your choice of tone. And while you may very well have been right about that, starting from that assumption is a bad thing, because it antagonizes people. Instead, you should start from the assumption that you’re wrong, ask for more information on the pieces that you don’t understand and if it turns out you’re right, that will generally emerge over the course of the conversation. I will reply to the technical question is a separate post, so these couple posts on decorum can be split and moved to OffTopic.

4 Likes

And that was the problem with your first response - which totally ignored the technical response that I had made, and repeated your idea of using memcpy.

There wasn’t anything that I didn’t understand.

Scott, I had half written up a detailed note (with references!) on undefined behavior of unaligned pointer access in C and how to write standards compliant C code. In fact, I was gonna post it soon, as I promised in my previous post, because I thought we had come to a collegial understanding here. Unfortunately, your last two posts made me reconsider finishing it, which is too bad because I thought it was interesting. Because as I said in my original post, “that’s not how this works”. I generally don’t mind spending my morning composing an interesting post to explain things in detail, but you’re also not entitled to it - I’m quite busy these days. Maybe somebody else will feel inclined to engage with you on the technical point, but that this point, I don’t. I will stop engaging on this thread now.

FWIW, I could probably have expanded a little more on my original response, but I thought a quick an correct response that I thought would address the concern (it’s the canonical way to do it - any modern compiler will optimize it away, so it’s not slow) was better than none. I read discourse in the morning before work, where my time is limited. Sometimes I choose to spend some of my leisure hours on writing up detailed notes for the community, as I was going to in this topic, but I can’t do that for everything. I’m sorry if you thought the short reply was brash. I will keep that in mind and cease replying to you in the future unless I have the time to write up a detailed note.

Then why did you need to ask a question? I agree with Keno that this is an interesting question, but really not a good way to approach asking it.

As Keno mentioned above, memcpy is a standards-compliant way to handle this. It’s up to each specific compiler to optimize it correctly (or not), or provide some compiler-specific extension (such as the ones you mentioned above). Those also may or may not be optimized.

LLVM really likes the memcpy operation, since it is very expressive. We lower anything that isn’t a small primitive load or store into memcpy, since the memcpy operation gives LLVM much better optimization potential than giving it just a load/store. In fact, its really happy to get memcpy. I have no idea if other compilers are quite so accommodating.

2 Likes

I didn’t. I tried to point out what I saw as a problem with the issue that Keno had raised.

But that is precisely what the getblock64 function is, it’s a primitive load of 64-bits into a register.

There are a number of other issues about the use of MurmurHash3 in Julia - using the 128-bit version on a 32-bit platform (and throwing away the top 64-bits always) seems to be rather overkill.

Something like crc32c would do better for things like string hashing for use in Dict, for example.

For a time, we would lower almost everything to a memcpy, but it caused a bit of extra compilation time overhead (LLVM felt we were giving it too much information :slightly_smiling_face:), so we reverted back to using the more basic load/store instructions for simple enough cases.

it’s a primitive load of

No, it’s quite specifically not that – hence the bug report with using the wrong C idiom here and incurring UB, since we’re attempting to do a primitive load of something that isn’t.

If this is the canonical way in C to load an value from an unaligned pointer into an integer, can you please show how you’d rewrite getblock32 and getblock64 using memcpy, which would be more performant than using something like the __packed keyword that I suggested?

Yichao already did https://github.com/JuliaLang/julia/issues/26512#issuecomment-374193242

I tested out the sort of change that Yichao did, on MurmurHash3.c, and the results were simply not the same.
Hashing is pretty performance critical, so I think it’s important to make sure this is made as fast as possible, and base decisions about how to best code something on evidence, not simply opinions.

Using memcpy in this way was a lot better than what I thought, compilers have definitely improved a lot in the last decade, to be sure, but still, it would be a performance degradation on all but the one platform with this problem
(I haven’t had time to fire up a Raspberry Pi to see what the resulting code would be there, comparing the memcpy approach to the __packed approach, but I think we should avoid slowing down all cases just to handle an issue on one.

Using casting:

	movslq	-20(%rbp), %rdx
	movl	(%rax,%rdx,4), %ecx

Using memcpy:

	movl	-20(%rbp), %ecx
	shll	$2, %ecx
	movslq	%ecx, %rdx
	movl	(%rax,%rdx), %ecx
	movl	%ecx, -24(%rbp)
	movl	-24(%rbp), %ecx

When comparing assembly code generated by C compilers, please include compiler version and full options used (or even better a link to https://godbolt.org/ with the relevant snippet).

The presence of

seems to suggest either missing optimization options or a serious compiler bug that should be reported on the relevant bug tracker.

11:49 $ clang MurmurHash3.c -S -Wold-style-definition -Wstrict-prototypes -Wc++-compat
✔ /j
12:50 $ clang --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

The changed code was:

FORCE_INLINE uint32_t getblock32 ( const void * p, int i )
{
    uint32_t val;
    memcpy(&val, p + (i<<2), 4);
    return val;
}
FORCE_INLINE uint64_t getblock64 ( const void * p, int i )
{
    uint64_t val;
    memcpy(&val, p + (i<<3), 8);
    return val;
}

Maybe they’ll fix the C compiler missing optimization or bug, but in the meantime, I think it’s best to look at the evidence, as I was doing, and use a more performant solution to the issue with MurmurHash3, instead of assuming that the memcpy trick would do as good a job.

Just this weekend, I’d been looking in-depth at the performance issues with hashing of strings in Julia, which is why I already knew about some of the problems with the MurmurHash3 implementation, as well as issues with Julia’s use of it.

I really was just trying to be helpful, and I’m sorry that between your terse message and my “wonderful” :stuck_out_tongue_winking_eye:communications skills we got off on a bad foot here.

Fwiw, the difference in asm has nothing to do with loading. It’s a less optimum indexing plus a useless register spill.

Yes, but it doesn’t really matter how the difference in assembly occurs, what is important is currently, using memcpy in this fashion does not produce the most performant code.

Scott, the code you posted doesn’t compile. In any case, I pasted the relevant snippet into godbolt myself and the generated assembly code is identical between the two ways to write it for modern compilers (e.g. clang trunk: Compiler Explorer). If you’d like to argue differently, I’ll take an appropriate godbolt link.

2 Likes