In my Win 10 environment, the timing results of the Monte Carlo estimation of \pi by CPU are as follows:
- CPU 1 thread: 223 ms
- CPU 12 threads: 38 ms
I have tried the GPU CUDA version. (Jupyter notebook)
CPU 1 thread version for Float32
:
using BenchmarkTools
using Random
function mcpi_f32(n)
rng = MersenneTwister()
4count(_ -> rand(rng, Float32)^2 + rand(rng, Float32)^2 ≤ 1, 1:n)/n
end
@show mcpi_f32(10^8)
@btime mcpi_f32(10^8);
mcpi_f32(10 ^ 8) = 3.14160556
239.164 ms (12 allocations: 19.66 KiB)
My very simple only 1 line CUDA version for Float32
:
using BenchmarkTools
using CUDA
mcpi_f32_cu(n) = 4count(x -> x^2 + rand(Float32)^2 ≤ 1, CUDA.rand(n))/n
@show mcpi_f32_cu(10^8)
@btime mcpi_f32_cu(10^8);
mcpi_f32_cu(10 ^ 8) = 3.14173908
6.894 ms (8189 allocations: 257.55 KiB)
CPU 1 thread version for Float64
:
using BenchmarkTools
using Random
function mcpi_f64(n)
rng = MersenneTwister()
4count(_ -> rand(rng)^2 + rand(rng)^2 ≤ 1, 1:n)/n
end
@show mcpi_f64(10^8)
@btime mcpi_f64(10^8);
mcpi_f64(10 ^ 8) = 3.14161428
225.706 ms (12 allocations: 19.66 KiB)
My very simple only 1 line CUDA version for Float64
:
using BenchmarkTools
using CUDA
mcpi_f64_cu(n) = 4count(x -> x^2 + rand(Float64)^2 ≤ 1, CUDA.rand(Float64, n))/n
@show mcpi_f64_cu(10^8)
@btime mcpi_f64_cu(10^8);
mcpi_f64_cu(10 ^ 8) = 3.1418442
12.518 ms (15601 allocations: 489.17 KiB)
My CUDA versioninfo
CUDA.versioninfo()
CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 462.31.0
Libraries:
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+462.31
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)
Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: GeForce GTX 1650 Ti (sm_75, 833.594 MiB / 4.000 GiB available)
Summary: n = 10^8
Previous post:
- Float64 CPU 1 thread: 223 ms
- Float64 CPU 12 threads: 38 ms
Only 1 line versions:
- Float32 CPU 1 thread: 239 ms
- Float32 GPU CUDA: 6.9 ms
- Float64 CPU 1 thread: 226 ms
- Float64 GPU CUDA: 12.5 ms