Square roots of integers


#1

I need a function that returns the square root of a positive integer less than Nmax but does the computation only once for a given integer.
In C, I would compute all values on the first call using a static variable like this:

double function sqrttable(unsigned int i)
{
     static double *cache = NULL:
     if (cache == NULL)
     {
             cache = (double*)malloc((Nmax+1)*sizeof(double));
             for (unsigned int k=0; k<=Nmax; ++k) cache[k] = sqrt(k)
     }
     // add a check for i<=Nmax
     return cache[i]
}

What is the best way to achieve this in Julia?


#2
julia> const sqrttable=[sqrt(i) for i=1:1_000_000];
julia> sqrt_tab(x)=sqrttable[x];
julia> sqrt_tab_ib(x)=(@inbounds r= sqrttable[x]; r);
julia> vals=rand(1:1_000_000, 1_000_000);
julia> using BenchmarkTools
julia> @btime copy($vals);
  842.508 ��s (2 allocations: 7.63 MiB)
julia> @btime map($sqrt, $vals);
  3.365 ms (3 allocations: 7.63 MiB)
julia> @btime map($sqrt_tab, $vals);
  10.001 ms (3 allocations: 7.63 MiB)
julia> @btime map($sqrt_tab_ib, $vals);
  9.612 ms (3 allocations: 7.63 MiB)

Unless your table is truly tiny or you have to use a CPU from the early 90s, you are better off not using a table. Cache latency > Memory bandwidth > instruction latency > instruction throughput

Note that the above is memory bandwidth constrained (the CPU out-of-order reads ahead). If you CPU cannot predict the lookup (via out-of-order), your cache will blow up in your face even worse.

Note that the timings almost fit: 8x read-amplification means that the lookups should take 6.8 ms if your CPU never stalls due to latency.


#3

You can use a let block to create a new scope where the cache lives, define an anonymous function inside that scope, and then store the resulting anonymous function as a const variable (since otherwise it would be a non-global constant):

julia> const sqrttable = let cache = Float64[]
          for i in 1:100
             push!(cache, sqrt(i))
          end
          i -> cache[i] # this anonymous function is returned from the 
                        # let block and thus becomes bound to the
                        # const sqrttable
       end
(::#1) (generic function with 1 method)

julia> sqrttable(4)
2.0

julia> sqrttable(16)
4.0

although it’s worth nothing that you don’t save much time by doing this:

julia> using BenchmarkTools

julia> @btime sqrttable(4)
  1.824 ns (0 allocations: 0 bytes)
2.0

julia> @btime sqrt(4)
  2.171 ns (0 allocations: 0 bytes)
2.0

#4

So just to extend, my benchmarks were of course a bit off because I did not account for higher-order caches (which are always hot in the benchmark loop, and large enough to account for a sizable fraction of the table). Just to show you how fast your CPU is compared to its puny memory bus (repeat on your target computer):

julia> const sqrttable=[sqrt(i) for i=1:10_000_000];
julia> sqrt_tab_ib(x)=(@inbounds r= sqrttable[x]; r);
julia> vals=rand(1:10_000_000, 10_000_000);

julia> function cpx(V)
       res=similar(V)
       @simd for i=1:length(V)
       @inbounds res[i]=V[i]
       end
       res
       end


julia> @btime copy(vals);
  48.847 ms (2 allocations: 76.29 MiB)
julia> @btime cpx(vals);
  39.668 ms (2 allocations: 76.29 MiB)
julia> @btime sqrt.(vals);
  66.954 ms (26 allocations: 76.30 MiB)
julia> @btime sqrt_tab_ib.(vals);
  159.656 ms (26 allocations: 76.30 MiB)

Vectorized sqrt is not much slower than memcopy from/to main memory. I have no idea why the default copy appears to be slow on my system (maybe I borked my 0.6 sysimg?).

In the table lookup case, each entry costs 1x write and 1+8x read (one cache line = 8 Float64). In the memcopy case we need 1x write and 1x read for an expected factor of 5 I’d call that close enough.

Postscript: My system sucks. The problem is not julia’s slow copy, it is that memcopy is significantly slower that the simd-copy on my system.

julia> function memcpx(V)
       res = similar(V)
       ccall(:memcpy, Ptr{Void}, (Ptr{Void}, Ptr{Void},Csize_t), pointer(res), pointer(V), sizeof(V))
       res
       end
julia> @btime memcpx(vals);
  47.814 ms (2 allocations: 76.29 MiB)

Pstscript2: Meh, I’m really bad at guessing what I’m benchmarking. Need to pre-allocate buffers. All the copying does not measure memory speed; probably how fast the kernel is at faulting in freshly zeroed memory.

julia> memcpy(dst,src)=ccall(:memcpy, Ptr{Void}, (Ptr{Void}, Ptr{Void},Csize_t), pointer(dst), pointer(src), sizeof(src));
julia> src=zeros(10_000_000); dst=similar(src);
julia> @btime memcpy(dst,src)
  8.518 ms (7 allocations: 112 bytes)
julia> @btime zeros(10_000_000);
  36.124 ms (2 allocations: 76.29 MiB)

#5

Thank you both for showing me it is not a good idea. Even with a perfectly predictable access pattern, the gain is not significant:

julia> vals=collect(1:10_000_000);

julia> @btime sqrt.(vals);
  42.875 ms (26 allocations: 76.30 MiB)

julia> @btime sqrt_tab_ib.(vals);
  31.707 ms (26 allocations: 76.30 MiB)

#6

I have something similar:

julia> vals=rand(1:10_000_000, 10_000_000);

julia> @btime copy(vals);
  35.318 ms (2 allocations: 76.29 MiB)

julia> @btime cpx(vals);
  27.363 ms (2 allocations: 76.29 MiB)

julia> @btime memcpx(vals);
  33.627 ms (2 allocations: 76.29 MiB)

#7

memcpy is provided by the binaries of the Linux distribution which are compiled for a generic x64 system.
Can’t julia compilation using LLVM achieve a better copy using the best SIMD instructions available in the actual CPU?

Edit: I guess it would not be consistent with the copy being limited only by memory bandwidth.


#8

On my machine, running version 0.7-alpha on Windows 10, I get no difference at all:

julia> const sqrttable = let cache = Float64[]
          for i in 1:100
             push!(cache, sqrt(i))
          end
          i -> cache[i] # this anonymous function is returned from the 
                        # let block and thus becomes bound to the
                        # const sqrttable
       end
(::#1) (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime sqrttable(4)
  1.026 ns (0 allocations: 0 bytes)
2.0

julia> @btime sqrt(4)
  1.026 ns (0 allocations: 0 bytes)
2.0

#9

I suspect this has to do with literal propagation. Your functions get compiled specifically for the literal input 4, so what you’re measuring is the time it takes to return the constant 2.0.


#10

You could also do @generated s(x) = :(sqrt(x)), but that’s not much faster either. Caching would probably make more sense if you are computing more complicated expressions than sqrt.


#11

Why would that help?


#12

In my post I said that it doesn’t give you any benefit, since it’s on par with taking sqrt on its own. However, it is another possible way of caching the results.


#13

No, it doesn’t cache anything; it’s just the same as s(x) = sqrt(x) (but slower to compile).

julia> @generated s(x) = :(sqrt(x))
s (generic function with 1 method)

julia> @code_llvm s(3)

define double @julia_s_62894(i64) #0 !dbg !5 {
top:
  %1 = icmp sgt i64 %0, -1
  br i1 %1, label %pass, label %fail

fail:                                             ; preds = %top
  call void @jl_throw(i8** inttoptr (i64 4482821056 to i8**))
  unreachable

pass:                                             ; preds = %top
  %2 = sitofp i64 %0 to double
  %3 = call double @llvm.sqrt.f64(double %2)
  ret double %3
}

julia> @code_llvm sqrt(3)

define double @julia_sqrt_62897(i64) #0 !dbg !5 {
top:
  %1 = icmp sgt i64 %0, -1
  br i1 %1, label %pass, label %fail

fail:                                             ; preds = %top
  call void @jl_throw(i8** inttoptr (i64 4482821056 to i8**))
  unreachable

pass:                                             ; preds = %top
  %2 = sitofp i64 %0 to double
  %3 = call double @llvm.sqrt.f64(double %2)
  ret double %3
}

#14

You’re right, it doesn’t quite work the way I thought it did. It does cache the code, but not the results.


#15

One way that you can use to speed up sqrt is to avoid the range check. This can allow the sqrt to be vectorized (if the caller can be vectorized otherwise of course…).

You can do this with the @fastmath version of the function if you convert the integer input to float, or you can make sure you use a unsigned integer type in which case LLVM should be smart enough to tell that the range check can be skipped.