Performance of logarithm calculation (when not to use the dot operator?)

lwabeke · April 24, 2018, 12:19pm

Edits for improved clarify in italics
I realised last week just how slow the log() function really is, as an example, comparing log() to fft(), it turns out log() is about 2 times slower, even though it is working on individual elements, see below:

data = randn(Float32, 100, 10000) + 1im*randn(Float32, 100, 10000)
amplitude = abs.(data)
d = log.(amplitude);
f = fft(data,1);

using BenchmarkTools

julia> @benchmark fft!(data,1)
BenchmarkTools.Trial: 
  memory estimate:  3.25 KiB
  allocs estimate:  51
  --------------
  minimum time:     3.204 ms (0.00% GC)
  median time:      3.327 ms (0.00% GC)
  mean time:        3.495 ms (0.00% GC)
  maximum time:     6.788 ms (0.00% GC)
  --------------
  samples:          1427
  evals/sample:     1

julia> @benchmark d .= log.(amplitude)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     8.664 ms (0.00% GC)
  median time:      8.885 ms (0.00% GC)
  mean time:        9.175 ms (0.00% GC)
  maximum time:     15.413 ms (0.00% GC)
  --------------
  samples:          544
  evals/sample:     1

After some digging I concluded that the problem is that julia is not using simd to compute the log().
Some web searches led me to: http://software-lisc.fbk.eu/avx_mathfun/
where I found "an AVX translation of the SSE2 implementation " (I’m sure there are many other implementations), with some hacking of that code, I managed to get the log(Array) calculation part going doing 8 elements in parallel at a time (by removing the sin and cos functions) to get about a 7 times speed-up, however I ran into 2 issues:

to prevent unnecessary memory allocation, I rather created a log!(outputVector, inputVector)
this only works if I can call it as vector, the dot-fusion operator actually gets in the way here.

Is there a way to address point 2 by overloading a special case of the log. function?
It feels doubtful or am I missing something?

I considered packing this as a normal julia package, however it is platform dependant and would force a non-Julia interface.

At the same time, I’m also wondering if the same problem occurs for sin and cos and whether spending further time on that code to get those working as well, although I’m using them less often.

rdeits · April 24, 2018, 12:48pm

Rather than defining a new log! function, I would recommend just using the broadcast notation, which is more powerful and flexible anyway. You can operate in place and avoid allocation by just doing y .= log.(x) where y is your pre-allocated output array.

lwabeke · April 24, 2018, 12:53pm

I realise I was a bit unclear in my original post:
The fast log function I found operates on 8 elements at a time using simd. If I call the log.() function, it operates on individual elements. I want to try to find a way to do a block call from the fusion operator to this function to do 8 log conversions in parallel.

kristoffer.carlsson · April 24, 2018, 1:05pm

Right now, you would have to use something like https://github.com/JuliaMath/Yeppp.jl or https://github.com/JuliaMath/VML.jl.

Issue to track: https://github.com/JuliaLang/julia/issues/19640

Topic		Replies	Views
Performance of log.(x) General Usage performance , broadcast	11	2691	April 12, 2019
Sum of logs Performance	19	1037	May 10, 2022
Maximizing an objective function with many logs Performance tullio	6	719	April 14, 2021
Sum(log.(p * C)) is 2 to 4 times slower than in NumPy Performance python , matrices	19	1343	January 26, 2021
Speed up simple product accumulator loop Performance loops , tullio	6	855	August 17, 2020

Performance of logarithm calculation (when not to use the dot operator?)

Related topics