Just for fun/learning, I created an implementation of the scrypt key derivation algorithm in pure Julia.
I had previously made a highly optimized C++/C# variant which used compiler intrinsics to work directly with CPU vector instructions (SSE/AVX/AVX2/NEON) and others (nontemporal prefetch, etc.), and I was curious how a naïve, pure Julia implementation would compare.
This is the result. It’s ~8x slower on reasonable parameters. Also didn’t implement parallel processing for the p parameter (not factored into that multiplier).
I suspect much of the slowdown is unnecessary allocations and data copying; I haven’t fully optimized all of the internals. If someone has tips on how I can go deeper with Julia and get down to the original C++ performance, I’d love to know.
Not sure it should be put into the registry in its current state, so I’ll hold off. Also since it’s in the crypto space, wouldn’t want to put a stamp of legitimacy on something that might be missing important security considerations.
UPDATE: SIMD.jl and some other measures led to create a package. Started at ~525x slower and is now only ~2x slower than my original library.
If you want to write high-performance code in julia you have to read https://docs.julialang.org/en/v1/manual/performance-tips/. I see you’re already failing the first tip by having non-const globals
Another thing that caught my eyes was the use of slices (e.g.
a[:, 1]). These allocate a new array every time, so you probably want to use views, which can be as easy as just putting
@view in front of these expressions. In 1.5, they should now be really fast. Your arrays also look fairly small, so you might want to look into StaticArrays.jl and replacing the use of regular arrays with
Most of the time I used views for that reason, don’t know why I didn’t in that most critical section. Changing the slices in salsa20!() alone got me down from ~8x slower to ~6x slower, and just over halved the number of allocations.
However, changing to views in salsatranspose!() actually made things worse. I suspect it’s because the slices are so small it’s easy to optimize that into operations in the CPU register?
Making the globals I had const didn’t change much, but by making the constant selectors, which had been inline for the salsatranspose!() function (e.g.
block[[2, 3, 4, 1], 1]) into const globals (e.g.
const selector = [2, 3, 4, 1] and
block[selector, 1]), I cut allocations down by another 25% and not quite half a second off processing time. Which is curious because that seems like an easy optimization for the compiler to handle. Is that something that should be suggested? And to whom would I raise that?
So now I’m at about ~5.5x slower than the C++/C# implementation. One thing I really like about Julia is it’s fairly easy to optimize once you have things working, and satisfying to see the allocation counts come down with @time.
Did you try the views with 1.5 or a previous Julia version? If not, I would try this again, since that might have been fixed.
The problem is that non-constant globals can be redefined at any time, so the compiler can’t make any assumptions about them.
Hmm, okay. I would still suggest looking into StaticArrays, since that looks like a case where this could make things quite a bit faster.
Remember to use
@btime from BenchmarkTools.jl to make sure you’re actually measuring the function.
@time includes allocations and time from compilation.
I’m not sure having the SalsaBlock as a primitive type is a good idea - the compiler is fairly smart about simple container structs which pass operations through to the underlying array. You might want to take a look at this talk for more information, if you decide to go forward with the wrapper type. It’s a little bit older but everything in it should still apply.
As far as I’m aware,
vcat returns a new array, which you use often. You might achieve better performance by eliding the intermediate via broadcasting over the generator:
workingbuffer[2:end] .= (prepareblock(view(element, i:i)) for i ∈ 1:length(element) - 1)
It may not be as easy as you think -
block[x] lowers to
getindex(block, x) which can be more complicated than a simple memory offset in something like C (for example, due to bounds checks). If profiling shows that failed constant propagation is indeed a missed optimization (after the obvious ones as mentioned in the performance tips are taken care of), I’m sure opening an issue (github.com/julialang/julia) would be appreciated. I doubt that this is the case though, since julia is sometimes too aggressive with constant propagation, to the point that sometimes we have to jump through hoops to make sure it isn’t done when running certain microbenchmarks.
It’s fairly late where I’m at so I haven’t taken a detailed look, but I’m almost sure an even more naive approach (without primitive type and reinterpreting memory, just UInt8 arrays instead) might be more performant.
Thanks for the tip on
BenchmarkTools.@btime. I knew to ignore the first compilation round for
@time, but I see the output is a bit more granular which is nice.
Also thanks for the pointers on array types. I’m sure it could be more performant by working with UInt8 directly, but one of my goals is to have code that is more self-explanatory, so I want to abstract away the byte-math.
Actually, by copying the data into an MMatrix from StaticArrays, as @simeonschaub suggested, for the critical loop and then copying back to the source array, I’m already within spitting distance of native performance
Just a nitpick, but julia is compiled to native assembly You can take a look at it with
@code_native (and the LLVM-IR with
@code_llvm), just don’t judge the performance of the original code solely by the output of those two. Also make sure you got the caveats from using those, such as the function you’re using it on always being specialised on it’s input types, even if a regular julia call might not result in specialization.
I’d also advise you to use the
@views macro to make the code easier to read - that way, you can keep the nice indexing syntax but still get the performance benefit of views, without having to manually construct them in the code.
I know! My comparison is the C++ code I wrote with compiler intrinsics, so this exercise has been about how to tell Julia the right things to get it there.
+1 to @views macro
If you’re interested in how to write similar code with explicit SIMD calls, check out SIMD.jl and maybe even LoopVectorization.jl or Tullio.jl. While I personally haven’t used Tullio.jl, I’ve heard very good things about it and the other two are well worth checking out if you care about performance. I’m not sure it would speed things up here though.
If you do end up with a faster version, please do post benchmarks of the different versions, preferrably with code! It’s always nice to point to a project where specific simple steps result in great performance increases without changing much of the code
End result: refactored and switched from StaticArrays to SIMD for the inner loops and now the Julia code is 2-3x faster than my C#/C++ implementation .
It started at ~16x slower. I documented the improvements I made in the Readme.
I guess I’ll register it now, too.
EDIT: Oops: I was comparing it to the Debug build of my original code . It’s actually still 10 times slower… but that means it used to be ~525x slower! So still some optimization I can try to eek out.