Cool Case of Faster than C++ performance

dataDiver · September 25, 2019, 2:25pm

This query was originally going to be me complaining about why I was getting inconsistent performance over some benchmark times, it turns out that it was at some kind of cache threshold and modifying the size of the vector fixed the issue.

My original code was in R and I decided to speed things up by implementing it in C++ using Rcpp:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
#include <functional>
#include <cmath>
#include <Rcpp.h>
using namespace Rcpp;

double bananaObjective(NumericVector x)
{
  return 100*std::pow(x[1] - std::pow(x[0], 2.0), 2.0) + std::pow(1 - x[0], 2.0);
}

//[[Rcpp::export]]
NumericVector bananaObjective(NumericMatrix x)
{
  int n = x.nrow();
  NumericVector output(x.nrow());
  for(int i = 0; i < n; ++i)
  {
    NumericVector tmp = x(i, _);
    output[i] = bananaObjective(tmp);
  }
  return output;
}

//[[Rcpp::export]]
NumericVector benchmarkBanana(int len, int n)
{
  NumericVector output(len);
  for(int i = 0; i < len; ++i)
  {
    //arma::mat tmp(n, 2, arma::fill::randu);
    NumericVector v = runif(n*2);
    v.attr("dim") = Dimension(n, 2);
    NumericMatrix tmp = as<NumericMatrix>(v); 
    arma::wall_clock timer;
    timer.tic();
    auto x = bananaObjective(tmp);
    output[i] = timer.toc();
  }
  return output;
}

Okay before I carry on, there is this horrible myth that you can get okay performance from R if you pre-cache a vector even in a loop. The performance in an R loop is always bad but sometimes it can be less bad if you pre-cache. Anyway we can get a x10 speedup just by moving the functions to C++, here is the equivalent R code and benchmark for comparison:

.bananaObjective = function(x)
  100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2

RBananaObjective = function(x)
{
  n = nrow(x)
  output = vector(mode = "numeric", length = n)
  for(i in 1:n)
    output[i] = .bananaObjective(x[i, ])
  return(output)
}

require(Rcpp)
sourceCpp("test.cpp")

n = 5E5; x = matrix(runif(n*2), ncol = 2)
system.time(RBananaObjective(x))
#   user  system elapsed
#  0.540   0.000   0.541
system.time(bananaObjective(x))
#   user  system elapsed
#  0.047   0.000   0.047

Okay great, now we can run the benchmark function in C++ 100 times maybe the returning to R each time has an overhead - apparently it doesn’t:

x = benchmarkBanana(100, 500000)
mean(x)
# [1] 0.04853786

So what about Julia?

function bananaObjective(x::Array{T, 1}) where {T <: AbstractFloat}
  return 100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2
end

function bananaObjective(x::Array{T, 2}) where {T <: AbstractFloat}
  n = size(x)[1]
  output = zeros(T, n)
  for i in 1:n
    output[i] = bananaObjective(x[i, :])
  end
  return output
end


# bench(Float64, 100, 1000_000)
function bench(T::Type{<: AbstractFloat}, len::I, n::I) where {I <: Integer}
  times = zeros(T, len)
  for i in 1:len
    x = rand(T, n, 2)
    t1 = time(); bananaObjective(x); t2 = time()
    times[i] = t2 - t1
  end
  return times
end

import Statistics: mean
x = bench(Float64, 100, 500_000)
mean(x)
# 0.019942469596862793

That’s more than x2 the speed of the code implemented with Rcpp. What’s more this code is much easier to write than the C++ code and far more flexible and generic.

kristoffer.carlsson · September 25, 2019, 3:09pm

Cool! Just a few notes. We have a pretty good benchmarking tool (BenchmarkTools.jl) that can be used to easily benchmark functions without having to set up your own benchmark loop. We can install it and import it with:

julia> import Pkg; Pkg.add("BenchmarkTools")

julia> using BenchmarkTools

Now we simply prepend the function call we want to benchmark with @btime and it will run as many times as needed to get a representative result:

julia> const x = rand(100, 2)

julia> @btime bananaObjective(x);
  1.950 μs (101 allocations: 10.25 KiB)

Looking at your code. A good thing to know is that in Julia, matrices are column-major. Doing x[:, i] is, therefore, more efficient than x[i, :]. Another thing to note is that x[i, :] creates a copy of the row. That is not needed here, we can just use what is called a “view” over the original matrix. This can be done by prepending the indexing with @view.
Also, your function argument types are a bit restrictive which is unnecessary (and doesn’t help performance)

So, in total, we could write something like

function bananaObjective(x::AbstractVector{<:Number}) 
    return 100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2
end

function bananaObjective(x::AbstractMatrix{T}) where T <: Number
    n = size(x, 2)
    output = zeros(T, n)
    for i in 1:n
        output[i] = bananaObjective(@view x[:, i])
    end
    return output
end

and benchmarking it as

julia> const x = rand(2, 100);

julia> @btime bananaObjective(x);
  140.582 ns (1 allocation: 896 bytes)

which is quite a lot faster (and we can see that the allocations have disappeared).

Going further, for arrays of this dimension it might make sense to use https://github.com/JuliaArrays/StaticArrays.jl but I don’t want to overwhelm you so I’ll skip that for now

Edit: Actually some StaticArray stuff can be found under the fold

StaticArrays

using StaticArrays
using BenchmarkTools

function bananaObjective(x::AbstractVector{<:Number}) 
    return 100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2
end

function bananaObjective(x::AbstractVector{SVector{2, T}}) where T <: Number
    n = length(x)
    output = zeros(T, n)
    for i in 1:n
        output[i] = bananaObjective(x[i])
    end
    return output
end

const x = [rand(SVector{2}) for i in 1:100]
@btime bananaObjective(x)
# 112.251 ns (1 allocation: 896 bytes)

StaticArrays with preallocation of output and @inbounds to give automatic SIMD

using StaticArrays
using BenchmarkTools

function bananaObjective(x::AbstractVector{<:Number}) 
    return 100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2
end

function bananaObjective!(output, x::AbstractVector{SVector{2, T}}) where T <: Number
    n = length(x)
    @inbounds for i in 1:n
        output[i] = bananaObjective(x[i])
    end
    return output
end

const x = [rand(SVector{2}) for i in 1:100]
const output = zeros(Float64, 100)
@btime bananaObjective!(output, x)
# 34.359 ns (0 allocations: 0 bytes)

piever · September 25, 2019, 3:40pm

If instead of execution time you want to minimize number of lines of code and increase readability, you can also use eachcol which return an iterator of columns of the matrix, so for example:

# here it's important to allow AbstractVector as we'll be using views
function bananaObjective(x::AbstractVector{<:Number})
    return 100*(x[2] - (x[1])^2)^2 + (1 - x[1])^2
end

x = rand(2, 100)

map(bananaObjective, eachcol(x))

But in practice for this case (length 2) StaticArrays is probably the right tool for the job.

dataDiver · September 25, 2019, 3:47pm

Thanks for your suggestions. To be honest I totally forgot about speed optimization, I was just comparing equivalent ‘vanilla’ implementations. Here’s something else, we can get > x10 speedup by getting rid of the sub-function:

function bananaObjective(x::Array{T, 2}) where {T <: AbstractFloat}
  n = size(x)[1]
  output = zeros(T, n)
  for i in 1:n
    output[i] = 100*(x[i, 2] - (x[i, 1])^2)^2 + (1 - x[i, 1])^2
  end
  return output
end
x = bench(Float64, 100, 500_000); mean(x)
# 0.0013012647628784179

Of course doing the same thing in C++ also gives a huge speedup:

//[[Rcpp::export]]
NumericVector bananaObjective(NumericMatrix x)
{
  int n = x.nrow();
  NumericVector output(x.nrow());
  for(int i = 0; i < n; ++i)
  {
    output[i] = 100*std::pow(x(i, 1) - std::pow(x(i, 0), 2.0), 2.0) + std::pow(1 - x(i, 0), 2.0);
  }
  return output;
}

Then in R

x = benchmarkBanana(100, 500000); mean(x)
# [1] 0.001865685

If find that with static arrays, you can get the same performance from regular arrays with easy speed optimizations such as @simd and @inbounds and other hacks. I also have an illness that makes me reticent about using more packages than I need. But the column major Julia matrices are highly relevant for speed - though it’s the same with R.

[later…] Okay I’ve just tried an equivalent implementation with the StaticArrays above but using vanilla arrays and with regular arrays it’s 1/3 slower so it’ probably worth using StaticArrays for this.

Topic		Replies	Views
Why is my code very slow General Usage performance	16	667	June 23, 2023
A problem about performance Performance	30	712	October 3, 2022
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4623	February 4, 2018
OpenBLAS: Julia slower than R Performance linearalgebra	41	7738	March 26, 2019
Performance inv & solve in Julia & R Performance	15	1413	August 11, 2020

Cool Case of Faster than C++ performance

Related topics