Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba)

genkuroki · August 22, 2021, 4:45am

In my Win 10 environment, the timing results of the Monte Carlo estimation of \pi by CPU are as follows:

CPU 1 thread: 223 ms
CPU 12 threads: 38 ms

I have tried the GPU CUDA version. (Jupyter notebook)

CPU 1 thread version for Float32:

using BenchmarkTools
using Random
function mcpi_f32(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng, Float32)^2 + rand(rng, Float32)^2 ≤ 1, 1:n)/n
end
@show mcpi_f32(10^8)
@btime mcpi_f32(10^8);

mcpi_f32(10 ^ 8) = 3.14160556
  239.164 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float32:

using BenchmarkTools
using CUDA
mcpi_f32_cu(n) = 4count(x -> x^2 + rand(Float32)^2 ≤ 1, CUDA.rand(n))/n
@show mcpi_f32_cu(10^8)
@btime mcpi_f32_cu(10^8);

mcpi_f32_cu(10 ^ 8) = 3.14173908
  6.894 ms (8189 allocations: 257.55 KiB)

CPU 1 thread version for Float64:

using BenchmarkTools
using Random
function mcpi_f64(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng)^2 + rand(rng)^2 ≤ 1, 1:n)/n
end
@show mcpi_f64(10^8)
@btime mcpi_f64(10^8);

mcpi_f64(10 ^ 8) = 3.14161428
  225.706 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float64:

using BenchmarkTools
using CUDA
mcpi_f64_cu(n) = 4count(x -> x^2 + rand(Float64)^2 ≤ 1, CUDA.rand(Float64, n))/n
@show mcpi_f64_cu(10^8)
@btime mcpi_f64_cu(10^8);

mcpi_f64_cu(10 ^ 8) = 3.1418442
  12.518 ms (15601 allocations: 489.17 KiB)

My CUDA versioninfo

CUDA.versioninfo()

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 462.31.0

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+462.31
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce GTX 1650 Ti (sm_75, 833.594 MiB / 4.000 GiB available)

Summary: n = 10^8

Float64 CPU 1 thread: 223 ms
Float64 CPU 12 threads: 38 ms

Only 1 line versions:

Float32 CPU 1 thread: 239 ms
Float32 GPU CUDA: 6.9 ms
Float64 CPU 1 thread: 226 ms
Float64 GPU CUDA: 12.5 ms

Topic		Replies	Views
Advice for improving Monte-Carlo code New to Julia performance , parallel , monte-carlo	23	2261	November 11, 2020
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5214	November 13, 2018
Trying to understand low performance compared to C++ Performance	13	338	October 2, 2024
Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling Performance	7	1501	June 20, 2021
Blog: Using Julia on the HPC Teaching & Outreach blog-post	40	2253	October 10, 2022

Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba)

Related topics