Julia-ism for two-dimensional map?


#1

dear j wizards—is there an obvious julia-ism to express the following faster and better? if the answer is no, I am happy enough. I just want to know if I am missing the spirit:

xset= 10:15
yset= 12:14

m= Matrix{Float64}( length(xset)*length(yset), 3)
f(x,y) = sqrt( x^2 + y^2 )

i=1
for x in collect(xset)
    for y in collect(yset)
	m[i,:]=	[ x, y, f(x,y) ]
	i+=1
    end
end

#2

Not a wizard, but here’s a one-liner:

hcat([ [z(x,y) for x in xset for y in yset] for z in ((x,y)->x, (x,y)->y, f) ]...)

And here’s perhaps a more julian way:

using StaticArrays

vv = [ @SVector [x, y, f(x,y)]  for x in xset for y in yset ]

reinterpret(Float64, vv, (3, length(vv) ))

This one is quickest:

julia> @btime m_orig($xset, $yset, $f);
  4.604 μs (145 allocations: 6.16 KiB)

julia> @btime m_hcat($xset, $yset, $f);
  27.847 μs (65 allocations: 5.23 KiB)

julia> @btime m_svec($xset, $yset, $f);
  738.832 ns (14 allocations: 2.03 KiB)

For xset= 10:150; yset= 12:140 the hcat way does better than original, but still slower than svec one.


#3

If all you compare about is performance, this version is faster than the svec version on my computer (and doesn’t have to worry about problems with reinterpreted arrays starting in Julia 0.7):

function orig(xset, yset)
    i = 1
    m = Matrix{Float64}(undef, 3, length(xset)*length(yset))
    @inbounds for x in Float64.(xset) 
        @simd for y in Float64.(yset)
            m[:,i] .= ( x, y, f(x,y) )
            i += 1
        end
    end
    m
end

function svec(xset, yset)
    vv = [ @SVector [x, y, f(x,y)]  for x in xset for y in yset ]
    reinterpret(Float64, vv, (3, length(vv) ))
end

The calls to Float64.(_set) are important. I think because having to convert interferes with the @simd.

julia> @benchmark svec($xset, $yset)
BenchmarkTools.Trial: 
  memory estimate:  1.91 KiB
  allocs estimate:  10
  --------------
  minimum time:     890.643 ns (0.00% GC)
  median time:      970.143 ns (0.00% GC)
  mean time:        1.212 μs (17.50% GC)
  maximum time:     65.267 μs (94.76% GC)
  --------------
  samples:          10000
  evals/sample:     42

julia> @benchmark orig($xset, $yset)
BenchmarkTools.Trial: 
  memory estimate:  1.31 KiB
  allocs estimate:  8
  --------------
  minimum time:     640.097 ns (0.00% GC)
  median time:      663.758 ns (0.00% GC)
  mean time:        771.472 ns (11.81% GC)
  maximum time:     11.189 μs (88.50% GC)
  --------------
  samples:          10000
  evals/sample:     165

IterTools may be useful if you want to avoid for loops, or reinterprets/reshapes:
https://juliacollections.github.io/IterTools.jl/latest/#product(xs...)-1


#4

First off you can and should loop over your ranges without collecting them. There’s absolutely nothing wrong with doing explicit loops but if you want to go for compactness, here’s another one-liner:

reduce(vcat, [[x, y, f(x, y)]' for (y, x) in Iterators.product(yset, xset)])

#5

Good point about reinterpret and v0.7. On v0.6.2, after deleting undef, Elrod’s @simd version takes 712 ns. Adding @inbounds brings the svec version within a few percent:

svec(xset, yset) = @inbounds reinterpret(Float64, 
        [ @SVector [x, y, f(x,y)]  for x in xset for y in yset ], 
        (3, length(xset)*length(yset)) ) ## 727 ns

And GunnarFarneback’s reduce-vcat version is twice as fast as my hcat one. Removing the transpose and using hcat makes it about twice fast again, surprisingly?

redh(xset, yset) = reduce(hcat, [[x, y, f(x, y)] for (y, x) in Iterators.product(yset, xset)]) ## 8.531 μs

#6

thank you everybody. very instructive. one-liners that can be understood with one glance make it easier to understand code later on (IMHO). something about “number of lines of code” that one can glance at in one chunk. (PS: Sorry about the collect. I did not think much about my own version.)

I also learned that maybe I should not view StaticArrays as just another add-on, but as something that is very useful. But…

lordie…what is this? can someone give me a pointer, please?

regards,

/iaw

PS: May I point out that ns are not comparable across computers?


#7

lordie…what is this? can someone give me a pointer, please?

PS: May I point out that ns are not comparable across computers?

Yeah, just tried on another computer, and the SVector version is just slightly faster:

julia> @benchmark orig($xset, $yset)
BenchmarkTools.Trial: 
  memory estimate:  1.31 KiB
  allocs estimate:  8
  --------------
  minimum time:     407.743 ns (0.00% GC)
  median time:      430.580 ns (0.00% GC)
  mean time:        500.106 ns (10.39% GC)
  maximum time:     6.392 μs (88.50% GC)
  --------------
  samples:          10000
  evals/sample:     206

julia> @benchmark svec($xset, $yset)
BenchmarkTools.Trial: 
  memory estimate:  1.91 KiB
  allocs estimate:  10
  --------------
  minimum time:     404.726 ns (0.00% GC)
  median time:      425.562 ns (0.00% GC)
  mean time:        544.980 ns (19.62% GC)
  maximum time:     7.589 μs (90.31% GC)
  --------------
  samples:          10000
  evals/sample:     201

#8

“slightly faster”: I bet you that if you run this on a slightly different computer a few times, they will sometimes go one way and sometimes another way. I am guessing this is just sampling variation. looks about the same to me…and even if correct, this is fortunately trivial.


#9

This version involves no fancy tricks and is far faster than both for me:

julia> function loop(xset, yset)
           m = Matrix{Float64}(undef, 3, length(xset)*length(yset))
           i = 1
           @inbounds for x in xset, y in yset 
               m[1,i] = x
               m[2,i] = y
               m[3,i] = f(x,y)
               i += 1
           end
           m
       end

#10

Me too! About 5x faster.
I should’ve followed the “KISS” mantra.

This sort of thing makes me doubt myself. Why would I not at least have tried that? I feel like I should take a second look at a lot of my own code to make sure it isn’t needlessly complicated.
I don’t know much about compilers, but long term I’d expect simpler code to be easier to optimize as the compiler gets smarter…


#11

and, because I instigated this, here it is all put it together on an i7-7700k:

  1.111 ms (79491 allocations: 2.82 MiB)
  83.008 μs (103 allocations: 322.83 KiB)
  14.454 μs (2 allocations: 234.45 KiB) . <-- fastest is steven's simple loop with direct assignment
  28.244 ms (215534 allocations: 81.67 MiB)
  27.926 ms (215534 allocations: 81.67 MiB)
  157.606 μs (87 allocations: 772.67 KiB)
  24.193 ms (163685 allocations: 80.42 MiB)

for the code below. interestingly, when one replaces the matrix entries with

m[1,i]=...   with	m[:,i] .= [x,y,f(x,y)]

  1.093 ms (79491 allocations: 2.82 MiB)

then the performance drops by two orders of magnitude, to about 1.1ms. apparently, this makes all the difference in performance. looking at the memory allocations, this is what takes the extra time here. .= does not avoid it.


f(x,y) = sqrt( x^2 + y^2 );

xset= 1:100; yset= 1:100; i= 1;

function loop1( xset, yset )
    i= 1
    m= Matrix{Float64}( length(xset)*length(yset), 3)
    for x in xset
	for y in yset
	    m[i,:]=	[ x, y, f(x,y) ]
	    i+=1
	end#for x
    end#for##
    m
end;#loop1##

function loop2( xset, yset )
    i= 1
    m= Matrix{Float64}( 3, length(xset)*length(yset))
    @inbounds for x in Float64.(xset)
        @simd for y in Float64.(yset)
            m[:,i] .= ( x, y, f(x,y) )  ## fails
            i += 1
        end#for
    end#for
end#function

function loop3( xset, yset )
    m= Matrix{Float64}( 3, length(xset)*length(yset))
    i = 1
    @inbounds for x in xset, y in yset
	m[1,i] = x
	m[2,i] = y
        m[3,i] = f(x,y)
	i += 1
    end#for
end#function


using Iterators;

rv( xset, yset) = reduce(vcat, [[x, y, f(x, y)]' for (y, x) in Iterators.product(yset, xset)])

ra( xset, yset) = ([ [z(x,y) for x in xset for y in yset] for z in ((x,y)->x, (x,y)->y, f) ]...)

using StaticArrays

sa( xset, yset) = (vv= [ @SVector [x, y, f(x,y)]  for x in xset for y in yset ]; reinterpret(Float64, vv, (3, length(vv) )) )

rh( xset, yset) = reduce(hcat, [[x, y, f(x, y)] for (y, x) in Iterators.product(yset, xset)])  ## actually, this produces 10000x3

using BenchmarkTools

@btime z1= loop1( xset, yset );
@btime z2= loop2( xset, yset );
@btime z3= loop3( xset, yset );
@btime z4= rv( xset, yset );
@btime z5= rv( xset, yset );
@btime z6= ra( xset, yset );
@btime z7= rh( xset, yset );

#12

Use dollar signs to interpolate the variables into the @btime expression:

@btime z1= loop1( $xset, $yset );
@btime z2= loop2( $xset, $yset );
@btime z3= loop3( $xset, $yset );
@btime z4= rv( $xset, $yset );
@btime z5= rv( $xset, $yset );
@btime z6= ra( $xset, $yset );
@btime z7= rh( $xset, $yset );

#13

fortunately, it did not matter… ;-).

below is corrected:

loop1  1.125 ms (79491 allocations: 2.82 MiB)
loop2    84.387 μs (103 allocations: 322.83 KiB)
loop3    14.132 μs (2 allocations: 234.45 KiB)  <-- steveng's loop
loop3  1.089 ms (79491 allocations: 2.82 MiB) <-- steveng's loop, but with .= instead of direct
rv    26.252 ms (215531 allocations: 81.67 MiB)
sa       87.650 μs (20 allocations: 769.02 KiB)  <-- corrected
ra      157.509 μs (85 allocations: 772.58 KiB)
rh    22.710 ms (163682 allocations: 80.42 MiB)

#14

That is because for every i you are allocating a temporary array [x,y,f(x,y)] and then copying its entries into m with .=. Don’t allocate arrays in inner loops.


#15

indeed. I learned this from your post. this little tidbit should also be put prominently into the list of julia speedup recommendations, together with others (type invariance, etc.). I had thought the compiler were smart enough to recognize this. eventually, I think a mature compiler can recognize the construct and fold it back; but for now, it is what it is. (I am going to file an issue as a suggestion for a compiler improvement on this one.)

regards,

/iaw


#16

Great idea! Perhaps you could make PR for the documentation. You already have a nice minimal example.

http://wiki.c2.com/?SufficientlySmartCompiler


#17

would be happy to do so, but I do not really know how to do this. I don’t really use github other than for the most trivial of items. I have never collaborated with anyone on github. any my bandwidth right at the moment is very limited, too. (I tend to file issues instead, because I understand it; and I have occasionally edited wikipedia.)

I filed https://github.com/JuliaLang/julia/issues/26938 on the performance issue. I don’t know the internals, either, but I bet this one has a reasonably feasible fix. the allocation happens right before the deallocation, which should be detectable…eventually. and then this issue will go away altogether.

(I amended my request to include a doc notice.)

regards,

/iaw


#18

Note that you test rv twice and sa not at all.

And that the ranges being compared here are longer than the original original post… although even on the original data, stevengj’s loop is fastest.


#19

Another elegant solution would be using a struct taking three elements representing a row. It has the same performance as @stevengj loop, but keeps the original array without transposing.

struct xyf 
	x :: Int64
	y :: Int64
	f :: Float64
end 
function struct_loop(xset, yset)
	i = 1
	m = Array{xyf}(undef, length(xset)*length(yset))
	@inbounds for x in xset, y in yset
		m[i] = xyf(x, y, f(x,y))
		i += 1
	end 
	m
end 

@btime stevengj_loop($xset, $yset)
@btime struct_loop($xset, $yset)
  97.499 ns (1 allocation: 544 bytes)
  95.141 ns (1 allocation: 544 bytes)

One more method using tuples, with the same performance as struct_loop but shorter:

function tuple_loop(xset, yset)
	i = 1
	m = Array{typeof((0,0,0.0))}(undef, length(xset)*length(yset))
	@inbounds for x in xset, y in yset
		m[i] = (x, y, f(x,y))
		i += 1
	end 
	m
end 

#20

Good point, since it’s not really very matrix-like.

When I make a (one-line!) list comprehension using this struct, it’s much faster to use Iterators.product, which I find surprising – isn’t that the only thing which for ... for ... can mean?

struct_comp(xlist, ylist) = [xyf(x, y, f(x,y)) for x in xset for y in yset] ## 15.737 ms
struct_prod(xlist, ylist) = [xyf(x, y, f(x,y)) for (x,y) in Iterators.product(xset,yset)] ## 64.946 μs

Not quite as fast as the loop, but close. Much faster than SVector way above:

@btime struct_loop($xset, $yset) ## 50.979 μs
@btime svec($xset, $yset); ## 183.364 μs