Maximum(x) uses memory differently for different sized array views


#1

Hello all, it’s nice to be using and posting about Julia.

I need to use the maximum() function on SubArrays, and have observed some results which confuse me based on the shape of the original array. Essentially, I have a large m and small n, and would have thought that an m by n matrix would yield the better performance due to the column major ordering. However, that is not what I’m observing in terms of memory when using the maximum() function, as in the below example.

function test()
  x = rand(512,2) # Array is 512 by 2
  y = x' # Array is 2 by 512

  @time results1 = maximum(view(x,1:128,:))
  @time results2 = maximum(view(y,:,1:128))

  # Check the calculation was the same with the percentage error
  println(maximum(abs,100.0*(results1-results2)/results1))
end

test()
0.000003 seconds (4 allocations: 144 bytes)
0.000002 seconds (1 allocation: 48 bytes)
0.0

I have tried using 0.5.0 and 0.6.0-rc3 on two different Windows machines, using the Atom IDE.
Is there something that I’m overlooking?


#2

(Please quote your code using backticks.)

Check out the BenchmarkTooks.jl package for correct micro-benchmarking.

You need to run the code twice if you don’t use that package.


#3

Thank you for the tip, I will investigate this package. In the meantime, I’ve added the a bit about my development environment and machine.

I have indeed run the code multiple times and the memory allocation benchmarks are invariant (after the first run affected by compilation).


#4

Using the following:

using BenchmarkTools

function test()
    x = rand(512,2) # Array is 512 by 2
    y = x' # Array is 2 by 512

    @btime results1 = maximum(view(x, 1:128, :))
    @btime results2 = maximum(view(y, :, 1:128))

    # Check the calculation was the same with the percentage error
    println(maximum(abs,100.0*(results1-results2)/results1))
end

I get

julia 0.6> test()
  5.572 μs (35 allocations: 896 bytes)
  5.578 μs (37 allocations: 944 bytes)
0.0

Note the use of @btime (with a b), from BenchmarkTools.jl, instead of @time.


#5

Welcome to Julia @bingcrosby!

Another tip to improve your future code, you can use isapprox(x,y) to check if x and y are approximately the same. You can also use the unicode version of the function by typing \approx and pressing TAB.


#6

Thanks for the example. However, this code throws,

julia 0.6> test()
UndefVarError: x not defined.

#7

I’ve had more of a look at the problem and got Benchmark Tools working correctly. I’ve also implemented the trivial matrix max.

using BenchmarkTools

function my2DMatrixMax(x::SubArray{Float64,2})
    max = 0.0
    @inbounds for j = 1:size(x,2)
         for i = 1:size(x,1)
            if x[i,j] > max
                max = x[i,j]
            end
        end
    end
end

function test()
  x = rand(512,2) # Array is 512 by 2
  y = x' # Array is 2 by 512

  results1 = 0.0; results2 = 0.0; results3 = 0.0; results4 = 0.0

  # Based on FORTRAN ordering, the first first and third results should be quickest
  @btime $results1 = maximum(view($x,1:128,:))
  @btime $results2 = maximum(view($y,:,1:128))
  @btime $results3 = my2DMatrixMax(view($x,1:128,:))
  @btime $results4 = my2DMatrixMax(view($y,:,1:128))

  # Check the calculation was the same with the percentage error
  println(isapprox(results1,results2))
  println(isapprox(results2,results3))
  println(isapprox(results4,results4))
end

Running test() returns the following output:

julia> test()
  1.079 μs (4 allocations: 160 bytes)
  730.280 ns (1 allocation: 64 bytes)
  322.061 ns (1 allocation: 64 bytes)
  845.157 ns (1 allocation: 64 bytes)
true
true
true

Also, running Julia with --check-bounds=no does not change the numbers. My function performs as I would expect with regards to memory ordering, but the built in function does not. I am also surprised that it is slow in both cases (relative to the results3 example).

Could anybody elucidate the performance I’m seeing?


#8

For the view(y,:,1:128), the Base implementation of mapreduce realizes that you are actually looking at a memory-
contiguous subset of y and it uses fast “linear indexing”. Specifically, this is because Base.IndexStyle(view(y,:,1:128)) returns IndexLinear().

For view(x,1:128,:), Base.IndexStyle returns IndexCartesian(): because the view is not memory-contiguous, it uses mapfoldl, which uses the much slower start/next/done iteration. It seems like this is something that could be sped up for AbstractArray by using explicit loops over the coordinates.