Efficient creation of power series matrix or array of arrays

question

#1

Hi, Thank you for all the effort that has gone into getting Julia to this point.
I’m very new to Julia and trying to get my head around when I get faster run-times or smaller fpo-counts by handing things over to the compiler and when I’m better off hand coding something.

Take this example:
Given x is an array of Float64 or BigFloat.
Create x, x^2, x^3, ....x^n.

Obviously there are common sub-expressions that I could exploit and save the number of calculations. Example: x2 = x*x; x4 = x2 * x2, etc

How do I determine if the Julia compiler exploits those relations in something like the following:

x=[1,2,3,4]; 
collect(ntuple(i -> x::Array{Float64,1} .^ i::Int64, 4))

Alternatively:

julia> @time map(i -> x .^i ,[1,2,3,4])
  0.133190 seconds (72.89 k allocations: 3.724 MiB, 21.80% gc time)
4-element Array{Array{Int64,1},1}:
 [1, 2, 3, 4]    
 [1, 4, 9, 16]   
 [1, 8, 27, 64]  
 [1, 16, 81, 256]

In fact: Is the above the most efficient Julia-way to build such a matrix or array of arrays. Or am I better-off handing coding (accepting that this might limit the size of n).

Appreciate any hints or tips.

Update:
David Sanders Slack comment pointed to Cassette.jl as a package suited to write a FPO counter.

Mark Van de vyver
Is there a convenience macro/function that counts the floating point operations in compiled code/function?

David Sanders
You should be able to do that with Cassette.jl
Or you could probably set up a new type that wrapped floating point numbers and counted the operations. But Cassette effectively does that automatically for you

Such a counter would go a long way to answering the question “Which approach ends up using the fewest floating-point-operations”. However, it is beyond my current Julia-fu to work out how to make such a counter from Cassette.jl.
Some related/prior art in Python land is outlined in this Github issue.

HTH


#2

I would suggest that the first thing to try would simply be benchmarking with BenchmarkTools.jl which will generally provide more reproducible and meaningful results than simply running something once with @time. After all, I would guess from your question that you’re primarily interested in reducing the number of floating-point operations purely as a way to achieve faster run-time, but since run-time is also tied into cache locality, memory throughput, branch prediction, etc., it may be more effective to simply measure the thing you actually care about.

That said, Julia can certainly provide you as much information as you could ever want to digest. In particular, @code_typed, @code_lowered, @code_llvm, and @code_native provide increasingly low-level depictions of exactly what a given function will do, and if you really want to investigate exactly what is happening at each level, you certainly can. Jameson Nash had a cool talk at the last JuliaCon about building tools to explore all of the levels of compilation and optimization in depth which you might find interesting: https://www.youtube.com/watch?v=l0Go2S_L95M


#3

Thanks for the feedback, tips and pointers. I’ll try the BenchmarkTools and I’ll look into the JuliaCon talk you cite.

Ack. that when dealing with arrays, etc the FPO count might get swamped by memory allocation and other issues/subtleties.

Nonetheless, especially when implementing/comparing algorithms it is often the metric of interest. Effective/efficient implementations of the algorithm being a distinct issue.

Overall it seems that a FPO count is not yet part of the convenience/std output of BenchmarkTools, or Base.


#4

Unless you just need the first few powers, coding a doubling (divide and conquer) algorithm is the best way to ensure that you get the optimal algorithm.


#5

In my opinion, FPO count is interesting from a theoretical perspective, but not a good measure when it comes to actual performance. Things like memory accesses (and caching thereof), branching (and prediction thereof), SIMD, loop unrolling, and characteristics in the input data, make FPO count a very unreliable measure. As suggested above, I would instead just use BenchmarkTools and the various code_ methods to analyze the code.

In your example, the problem can be solved in a single FPO per element, since to create row k, you can simply multiply row k-1 with the first row. Example:

julia> x = [1,2,3,4];

julia> r = copy(x);

julia> for n = 1:3 println(r .*= x) end
[1, 4, 9, 16]
[1, 8, 27, 64]
[1, 16, 81, 256]

In fact, with SIMD, multiple elements can be handled at the same time, leading to less than a single operation per element.

To avoid the memory access (reading the previous row), and to better exploit the way Julia stores matrices (column-major order), you should probably build the matrix column by column. To take advantage of SIMD then, you could either construct a few columns at a time, or likely better, construct several rows at a time within the same column. E.g. consider the first 8 powers of 3:

julia> println(3 .^ (1:8))
[3, 9, 27, 81, 243, 729, 2187, 6561]

If you can process 4 elements at a time, the next 4 can be constructed by multiplying the previous 4 with the constant value 3^4:

julia> println([3, 9, 27, 81] .* 81)
[243, 729, 2187, 6561]

#6

Thanks for taking time to respond and for the suggestions, I’ll look into SIMD, and appreciate the alternative you proposed.

The FPO count is a piece of the jigsaw more generally and I was hoping to explore it and Julia in this simple setting - comparing what Julia ends compiling to what was coded.

As mentioned before, ack: In some (many?) cases it is likely swamped by other differences/issues that are implementation specific.

It is a separate topic but given so much effort can be expended on different algorithms, I’m curious about whether such differences survive compilation, then whether they exert influence greater than the other issues you mention. FPO count is one piece of the (complex) jigsaw.

I’ll definitely be using BenchmarkTools.jl in the way you suggest.

Thanks again.


#7

Not if you need all of the powers in a sequence—then it is faster to multiply by x one at a time. (Note that exponentiation-by-squaring is already the algorithm used for z^n in Julia.)

It sounds like he just wants a Vandermonde matrix, and the fastest way to do this is probably just two loops. For example:

function vander!(V::AbstractMatrix, x::AbstractVector, n=length(x))
    m = length(x)
    (m,n) == size(V) || throw(DimensionMismatch())
    for j = 1:m
        @inbounds V[j,1] = one(x[j])
    end
    for i = 2:n, j = 1:m
        @inbounds V[j,i] = x[j] * V[j,i-1]
    end
    return V
end
vander(x::AbstractVector, n=length(x)) = vander!(Array{eltype(x)}(undef, length(x), n), x, n)

Note that, as is common in Julia, I provide a vander routine that allocates an output matrix, and also a vander! routine that operates on a pre-allocated matrix (which is faster if you need to do this computation many times for different x with the same dimensions).

There are various ways you could modify the above to be faster (by optimizing cache-line or SIMD utilization, for example), but that kind of micro-optimization is only worth it for critical routines.

In contrast, the simplest code to do the above is probably vander(x, n) = x .^ (0:n)', which uses broadcasting. However, this is significantly less efficient because it computes each power of x independently (about 20× slower for a 1000×1000 Float64 Vandermonde matrix on my machine).

PS. Change one(x[j]) to x[j] if you want the first column to be x^1 and not x^0.


Avoid recalculating X^q when f(p): X.^1 ... X.^p
#8

I often use this code to create a Vandermonde matrix to help introduce Julia, by the way. The two-loop version is pretty compact and readable, is completely generic to any array-like container and any number-like type supporting one and * (including user-defined types), but performs similarly to the optimized numpy.vander routine on arrays of Float64.

If you look at the numpy.vander implementation, in contrast, it is a thin Python wrapper around a rather long and complicated C routine that is itself basically a shim to dispatch to the correct inner-loop kernel; the lowest level kernels are generated from a custom C-like template syntax. The reason for all of that complexity is to handle a multiplicity of types and array formats, but even so it only handles one container type (NumPy arrays) and a small number of “built-in” numeric scalar types.


#9

Nice! I posted this method above btw, but in a vector format instead of a matrix.

I find this statement a bit odd/misleading – I consider cache locality crucial in algorithms like this, and not at all a micro-optimization. To see why, reverse the two for loops like this: for j = 1:m, i = 2:n and re-run and you’ll see the performance drop 10-fold! Same algorithm, same FPO count, but it executes 10 times slower. That’s only twice as fast as the naive solution, meaning that cache locality can be more important than picking the right algorithm. (The reason for this is explained in more detail in the link in my previous post.)

SIMD can also lead to enormous improvements, there was a recent topic where we played with vectorized and branchless code and were able to make some sample code several 100 times faster. The code you posted will already be SIMD-vectorized (and loop-unrolled) automatically by the compiler btw, so it’s not something you need to enable manually.

But perhaps you knew all of this already, and meant that although your code is already high-performing, you could still optimize it further


#10

Exactly. For example, optimizing cache lines further is not just a matter of reordering the loops, it requires blocking so that you do several outer-loop (i) iterations in every inner-loop (j) iteration. And to get the most out of SIMD the compiler typically needs a little help (I didn’t even use @simd above because it would have required me to break up the for statement).

It’s not so much that the performance gains from further optimizations of this sort are small, it’s that the tradeoff in code size/complexity may not be worth it for non-critical code.


#11

People interested in these issues, likely, will be interested in the Following Post by Tim Holy, updated Dec 2018 for Julia 1.1:

Multidimensional algorithms and iteration

I’m running some benchmarks that confirm the above discussions. When I have managed to work out the use of @simd in the code Steven posted I post the results here.


#12

Thanks Steve for your pointer to Vandermonde matrix and your intro to Julia notes.

The use of SIMD is beyond my ken right now, but the improvement made by Steven’s code (and SpecifalMatrices.jl) would be enough for immediate needs.

julia> include("~/src/scratch/vandermonde.jl")
  19.904 ms (1004 allocations: 7.77 MiB)  # My naive map attempt
  2.612 ms (3 allocations: 7.63 MiB)      # SpecialMatrices.jl implementation
  12.217 ms (3 allocations: 7.63 MiB)     # SpecialMatrices.jl re-ordered for loop
  2.097 ms (2 allocations: 7.63 MiB)      # stevengj implementation
  12.770 ms (2 allocations: 7.63 MiB)     # stevengj re-rdered for loop

The code used is:

using BenchmarkTools

# The Vandermonde definitions and implementation using it comes from:
# https://github.com/JuliaMatrices/SpecialMatrices.jl
struct Vandermonde{T} <: AbstractArray{T, 2}
    c :: Vector{T}
end

getindex(V::Vandermonde, i::Int, j::Int) = V.c[i]^(j-1)
isassigned(V::Vandermonde, i::Int, j::Int) = isassigned(V.c, i)

size(V::Vandermonde, r::Int) = (r==1 || r==2) ? length(V.c) :
    throw(ArgumentError("Invalid dimension $r"))
size(V::Vandermonde) = length(V.c), length(V.c)

function vander1(V::Vandermonde{T}) where T
    n=size(V, 1)
    M=Matrix{Float64}(undef, n, n)
    M[:,1] .= 1
    for j=2:n
        for i=1:n
	    M[i,j] = M[i,j-1]*V.c[i]
        end
    end
    M
end
# This is vander1 with the order of the loop reversed
function vander2(V::Vandermonde{T}) where T
    n=size(V, 1)
    M=Matrix{Float64}(undef, n, n)
    M[:,1] .= 1
    for i=1:n
        for j=2:n
	    M[i,j] = M[i,j-1]*V.c[i]
        end
    end
    M
end

function vander3(x::AbstractVector, n=length(x))
    m = length(x)
    V=Matrix{Float64}(undef, m, n)
    V[:,1] .= 1
    for i = 2:n, j = 1:m
        @inbounds V[j,i] = x[j] * V[j,i-1]
    end
    return V
end
# This changes the vander3 loop order: result is approx 6x slower
function vander4(x::AbstractVector, n=length(x))
    m = length(x)
    V=Matrix{Float64}(undef, m, n)
    V[:,1] .= 1
    for j = 1:m, i = 2:n
        @inbounds V[j,i] = x[j] * V[j,i-1]
    end
    return V
end

function vander0(x::AbstractVector)
    m = length(x)
    map(i -> x .^i ,collect(1:m))
end

n = 1000

# @btime const M=Matrix{Float64}(undef, n, n)
# @btime M[:,1] .= 1

a = collect(1:n)

@btime V0 = vander0($a);
@btime V1 = vander1(Vandermonde($a));
@btime V2 = vander2(Vandermonde($a)); 
@btime V3 = vander3($a); 
@btime V4 = vander4($a); 


#13

This is off-topic, but are Vandermonde matrices used in practice in numerical linear algebra applications? I thought they were very badly conditioned (except in some special cases). But since you are using them in a course, I must be mistaken.


#14

I had the same question, tuns out: representing Discrete Fourier Transforms and fitting polynomials to points. Likely others too.


#15

I know about DFT (one of the special cases), but for fitting general polynomials it has numerical problems, and in any case solving a linear system may not be as efficient as using eg orthogonal families, which are incidentally better conditioned.


#16

I mainly use them for low-degree polynomial least-squares, where conditioning is not a problem. For Julia introductions, though, they are a nice example because they are easy to explain, easy to implement, and are available as a “built-in” library function in many high-level languages for comparison.

(I’ve never seen a case where the construction of the Vandermonde matrix was performance-critical, though.)