Performance of logarithm calculation (when not to use the dot operator?)


#1

Edits for improved clarify in italics
I realised last week just how slow the log() function really is, as an example, comparing log() to fft(), it turns out log() is about 2 times slower, even though it is working on individual elements, see below:

data = randn(Float32, 100, 10000) + 1im*randn(Float32, 100, 10000)
amplitude = abs.(data)
d = log.(amplitude);
f = fft(data,1);

using BenchmarkTools

julia> @benchmark fft!(data,1)
BenchmarkTools.Trial: 
  memory estimate:  3.25 KiB
  allocs estimate:  51
  --------------
  minimum time:     3.204 ms (0.00% GC)
  median time:      3.327 ms (0.00% GC)
  mean time:        3.495 ms (0.00% GC)
  maximum time:     6.788 ms (0.00% GC)
  --------------
  samples:          1427
  evals/sample:     1

julia> @benchmark d .= log.(amplitude)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     8.664 ms (0.00% GC)
  median time:      8.885 ms (0.00% GC)
  mean time:        9.175 ms (0.00% GC)
  maximum time:     15.413 ms (0.00% GC)
  --------------
  samples:          544
  evals/sample:     1

After some digging I concluded that the problem is that julia is not using simd to compute the log().
Some web searches led me to: http://software-lisc.fbk.eu/avx_mathfun/
where I found "an AVX translation of the SSE2 implementation " (I’m sure there are many other implementations), with some hacking of that code, I managed to get the log(Array) calculation part going doing 8 elements in parallel at a time (by removing the sin and cos functions) to get about a 7 times speed-up, however I ran into 2 issues:

  1. to prevent unnecessary memory allocation, I rather created a log!(outputVector, inputVector)
  2. this only works if I can call it as vector, the dot-fusion operator actually gets in the way here.

Is there a way to address point 2 by overloading a special case of the log. function?
It feels doubtful or am I missing something?

I considered packing this as a normal julia package, however it is platform dependant and would force a non-Julia interface.

At the same time, I’m also wondering if the same problem occurs for sin and cos and whether spending further time on that code to get those working as well, although I’m using them less often.


#2

Rather than defining a new log! function, I would recommend just using the broadcast notation, which is more powerful and flexible anyway. You can operate in place and avoid allocation by just doing y .= log.(x) where y is your pre-allocated output array.


#3

I realise I was a bit unclear in my original post:
The fast log function I found operates on 8 elements at a time using simd. If I call the log.() function, it operates on individual elements. I want to try to find a way to do a block call from the fusion operator to this function to do 8 log conversions in parallel.


#4

Right now, you would have to use something like https://github.com/JuliaMath/Yeppp.jl or https://github.com/JuliaMath/VML.jl.

Issue to track: https://github.com/JuliaLang/julia/issues/19640