[ANN] ChaChaCiphers.jl: fast cryptographic RNG and stream ciphers

ChaChaCiphers.jl

I’m happy to announce ChaChaCiphers.jl, a pure-Julia implementation of the ChaCha stream cipher family. This package aims to provide

  • fast, cryptographically-secure, and reproducible random number generators for CPU and GPU, and
  • building blocks that can be used for cryptographic primitives such as ChaCha20-Poly1305.

Usage

To use ChaChaCiphers.jl for cryptographically secure random number generation (CRNG), you can create a new ChaChaStream instance and feed it in as the rng parameter to functions like rand or randn:

julia> using ChaChaCiphers

julia> rng = ChaCha20Stream();

julia> rand(rng, Float32, 3)
3-element Vector{Float32}:
 0.2602576
 0.7704005
 0.96076345

julia> rand(rng, 1:10)
9

julia> randn(rng)
-0.8894196311020175

The GPU interface is still a WIP, but you can use Random.rand! on a CuArray with a CUDAChaChaStream instance to fill it with random values:

julia> using CUDA, Random

julia> x = CuVector{Float32}(undef, 1024);

julia> rng = CUDAChaCha20Stream();

julia> Random.rand!(rng, x);

Performance

Here are some simple benchmarks to give you an idea of the package’s current performance. Because the security requirements for a CRNG are much higher than those of other PRNGs like Xoshiro, it is by necessity much slower than using the default RNG, although

  • it’s still much faster than reading from RandomDevice(), and
  • you can speed things up by using fewer ChaCha rounds (e.g., by using rng = ChaCha12Stream()):

CPU performance

shell> sh -c "lscpu | grep 'Model name'"
Model name:            Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

julia> using BenchmarkTools, CUDA, Random, ChaChaCiphers

julia> rng = ChaCha20Stream();

julia> urandom = RandomDevice();

julia> x = Vector{UInt8}(undef, 2^20);

julia> @btime Random.rand!(rng, x);   # ChaChaCiphers.jl
  609.709 μs (4 allocations: 192 bytes)

julia> @btime Random.rand!(x);   # Default Xoshiro++ RNG
  30.512 μs (0 allocations: 0 bytes)

julia> @btime Random.rand!(urandom, x);   # OS CRNG
  5.848 ms (0 allocations: 0 bytes)

GPU performance

shell> nvidia-smi --query-gpu=gpu_name --format=csv -i 0
name
Quadro RTX 4000

julia> rng = CUDAChaCha20Stream();

julia> x = CuVector{UInt8}(undef, 2^30);

# Warm-up omitted

julia> CUDA.@elapsed Random.rand!(rng, x)
0.40175077f0

julia> CUDA.@elapsed Random.rand!(x)
0.024985567f0

What’s next

There’s still plenty of work to be done to improve performance, and eventually I’d like to expand the package to include other stream ciphers in the ChaCha family (e.g. XChaCha, and an IETF RFC 8439-compliant implementation of ChaCha20, as well as some more practical system improvements to improve security. On the whole though, I’d say it’s at least in its minimal usable state for CRNG-dependent applications.

14 Likes

Very nice performance! Would be nice to have CUDA and SIMD as optional dependencies though, if you’re e.g. on a system that uses a non-NVidia GPU or on a system that doesn’t have SIMD (as far as I could tell, there’s no graceful fallback, right?). I think Requires.jl is for now the canonical way to handle this.

One question I do have: since you only allow filling of <: BitInteger arrays, what’s the advantage of using a custom UnsafeView instead of a regular view and/or reinterpret for converting to UInt8? IME, the latter two are exactly as efficient & shouldn’t produce problems here, since there’s no padding anyway.

1 Like

Thanks!

Would be nice to have CUDA and SIMD as optional dependencies though, if you’re e.g. on a system that uses a non-NVidia GPU or on a system that doesn’t have SIMD (as far as I could tell, there’s no graceful fallback, right?).

Yeah, this is a good point (although, is Julia supported on any architectures that don’t have SIMD?). I reflexively rejected the idea of using Requires.jl because of the issues mentioned in Pkg.jl#1285 and because I’m not a huge fan of the idea adding a new dependency to manage my other dependencies. Since it doesn’t seem like Pkg.jl is going to support conditional dependencies any time soon, though, I’ll definitely look into adding Requires.jl to ChaChaCiphers.jl. :slightly_smiling_face:

One question I do have: since you only allow filling of <: BitInteger arrays, what’s the advantage of using a custom UnsafeView instead of a regular view and/or reinterpret for converting to UInt8 ?

So, I’m not actually using UnsafeView at the moment, and I might get rid of it if I can’t find a compelling use for it. That being said, when I saw how the standard library used UnsafeView (implemented here) for RNG, I benchmarked it against reinterpret(...) (which is what I originally used) and found that it was actually quite a lot faster. I’m afraid I didn’t keep a record of the benchmarks I ran for its performance in ChaCha, but at least as a naive benchmark, here’s the performance difference between sum(reinterpret(...)) and sum(UnsafeView(...)):

julia> using BenchmarkTools; using ChaChaCiphers: UnsafeView;

julia> const x = rand(UInt32, 2^20);

julia> const xr = reinterpret(UInt8, x);

julia> @btime sum(xr);
  3.343 ms (0 allocations: 0 bytes)

julia> GC.@preserve x begin
           p = pointer(x)
           xv = UnsafeView(p, sizeof(x))
           @btime sum(xv);
       end
  748.333 μs (1 allocation: 16 bytes)

which I’d say is roughly the performance difference I saw between reinterpret(...) and UnsafeView as well. That said, this package is the first time I’ve written Julia code in a while, so I’m sure I could’ve done some things differently to make working with reinterpret(...) faster.

Right - I keep forgetting that this is a problem…

An alternative avoiding Requires.jl would be to define an interface package for ciphersuites in general, with CPUChaCha20.jl or GPUChaCha20.jl being specific implementations for that. I think that’s how people usually work around those limitations, while also being able to version those dependencies properly, which is not possible with Requires.jl. That definitely would require a more thorough think about how that interface should look - though I’d definitely love to see something like that! This is a good read if you want to tackle this for cryptographic ciphersuites in the julia space (but I’m not entirely certain it’s a perfect fit for just the RNG part, as that has its own API already).

I’m 85% certain that’s the difference between being inbounds or not - UnsafeView in Base just throws all that out of the window because Mersenne Twister is just such a twisted beast. I think your benchmark also isn’t representative of your internal code :slight_smile: My concern is mostly with avoiding to rely on pointers, if it’s possible to express the same code without them. I’m not sure I’ll have the time today, but I may take a look at it over the coming days.