Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba)

In my Win 10 environment, the timing results of the Monte Carlo estimation of \pi by CPU are as follows:

  • CPU 1 thread: 223 ms
  • CPU 12 threads: 38 ms

I have tried the GPU CUDA version. (Jupyter notebook)

CPU 1 thread version for Float32:

using BenchmarkTools
using Random
function mcpi_f32(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng, Float32)^2 + rand(rng, Float32)^2 ≤ 1, 1:n)/n
end
@show mcpi_f32(10^8)
@btime mcpi_f32(10^8);
mcpi_f32(10 ^ 8) = 3.14160556
  239.164 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float32:

using BenchmarkTools
using CUDA
mcpi_f32_cu(n) = 4count(x -> x^2 + rand(Float32)^2 ≤ 1, CUDA.rand(n))/n
@show mcpi_f32_cu(10^8)
@btime mcpi_f32_cu(10^8);
mcpi_f32_cu(10 ^ 8) = 3.14173908
  6.894 ms (8189 allocations: 257.55 KiB)

CPU 1 thread version for Float64:

using BenchmarkTools
using Random
function mcpi_f64(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng)^2 + rand(rng)^2 ≤ 1, 1:n)/n
end
@show mcpi_f64(10^8)
@btime mcpi_f64(10^8);
mcpi_f64(10 ^ 8) = 3.14161428
  225.706 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float64:

using BenchmarkTools
using CUDA
mcpi_f64_cu(n) = 4count(x -> x^2 + rand(Float64)^2 ≤ 1, CUDA.rand(Float64, n))/n
@show mcpi_f64_cu(10^8)
@btime mcpi_f64_cu(10^8);
mcpi_f64_cu(10 ^ 8) = 3.1418442
  12.518 ms (15601 allocations: 489.17 KiB)
My CUDA versioninfo
CUDA.versioninfo()
CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 462.31.0

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+462.31
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce GTX 1650 Ti (sm_75, 833.594 MiB / 4.000 GiB available)

Summary: n = 10^8

Previous post:

  • Float64 CPU 1 thread: 223 ms
  • Float64 CPU 12 threads: 38 ms

Only 1 line versions:

  • Float32 CPU 1 thread: 239 ms
  • Float32 GPU CUDA: 6.9 ms
  • Float64 CPU 1 thread: 226 ms
  • Float64 GPU CUDA: 12.5 ms
1 Like