Latency discrepancy in a busy loop with IO and without

Hi!

I am trying to use Julia in the same domain where previously I used C/C++ (low latency)
In general if you avoid allocations, and do everything proper (type stable) the speed is impressive, and comparable with C++.

One thing I discovered recently though which puzzled me, and may be someone can explain me what is going on here.

I have an application which doing a busy (infinite) loop to check is there any data available (C API) then process the data (loop basically is main application cycle). Reaction time is the key here. What I found is if I do some job (parsing etc) in a real-time (when there is check if data available) it take 10us to parse the data, if I replay the same data in a loop (i.e. reading data from file) it takes 0.5-1 micro. I tried to simulate it with this example

GC.enable(false)


@inline function soft_wait()
    target = time_ns() + 1_000_000_0
    while time_ns() < target
        #yield()                         
    end
end

@inline function none()
end

function bench(waitfn)
 
    dt_sum = 0
    iter_count = 100

    f = 0.0
   

    for i in 1:iter_count

        waitfn()

        # do job
        t1 = time_ns()
        
            for j in 1:10
                f+= Base.parse(Float64, "1.23342100000123")
            end

        t2 = time_ns()

        dt = t2-t1
    
        dt_sum+=dt

    end

    println("dt=",dt_sum/iter_count/1000, " us")
    
end

println("none")
for i in 1:10
bench(none)
end

println("wait")
for i in 1:10
bench(soft_wait)
end

println("none")
for i in 1:10
bench(none)
end


Results I got (nanos) :

none
dt=1.25096 us
dt=1.16522 us
dt=1.13798 us
dt=1.13917 us
dt=1.13479 us
dt=1.14266 us
dt=2.31419 us
dt=1.5399 us
dt=1.54975 us
dt=1.5331 us
wait
dt=6.51385us
dt=6.15658 us
dt=6.366 us
dt=7.63653 us
dt=6.78932 us
dt=7.07114 us
dt=5.68233 us
dt=6.79439 us
dt=7.0995 us
dt=7.70668 us
none
dt=1.59886 us
dt=1.57142 us
dt=1.617109 us
dt=2.73608 us
dt=1.57679 us
dt=1.79721 us
dt=1.58004 us
dt=2.63929 us
dt=1.5768 us
dt=1.57579 us

If you replace soft_wait with sleep it become even worse.

Why there is a such difference ? I did the same benchmark in C++ (used sleep), the latencies there are consistent i.e the same. I tried both1.11 and 1.10 julia, alma linux - two different boxes one physical (3x diff), one VM (above 5x diff)

Weirdly, I don't see that stark or consistent a difference in the simulation, which granted may not be entirely faithful to the real situation (press arrow to expand results). This is also on Windows, a likely significant difference.
none
dt=9.157 us
dt=8.018 us
dt=7.887 us
dt=7.316 us
dt=7.002 us
dt=6.342 us
dt=6.342 us
dt=6.418 us
dt=6.524 us
dt=8.164 us
wait
dt=8.512 us
dt=9.639 us
dt=8.169 us
dt=9.089 us
dt=9.763 us
dt=8.708 us
dt=9.654 us
dt=15.863 us
dt=33.01 us
dt=13.782 us
none
dt=6.796 us
dt=7.984 us
dt=8.556 us
dt=9.201 us
dt=12.353 us
dt=7.697 us
dt=8.678 us
dt=8.499 us
dt=7.182 us
dt=7.992 us

However, I do see the starker difference if I switch the wait function to Libc.systemsleep (~39us) or sleep (~30us), so this is a real effect. My immediate thought is that it’s the state of the CPU and multitasking OS, which could be many things. When a process sleeps (not like your soft_wait), the OS is pushed more than usual to do some work on other processes; oversleeping often happens because the OS was stuck in another high priority process before it could make its way back. Waiting means the caches are not in the same state as the unwaiting version, which can change performance. The CPU may downclock to save energy at the cost of performance when it thinks it’s not busy. What gives me pause is that you don’t observe the same thing with sleep in C++, where people have run into this phenomenon (links below). However, they are using different systems, and the people who don’t see this aren’t as likely to mention it.

Incidentally, it would also explain why your hard-coded 10ms `soft_wait()` is more accurate and consistent than either `Libc.systemsleep(0.01)` or `sleep(0.01)`, which your benchmark didn't time. Busy-waiting is a known technique for reducing latency, but it has its issues in a multitasking context.
julia> @benchmark soft_wait()
BenchmarkTools.Trial: 500 samples with 1 evaluation per sample.
 Range (min … max):  10.000 ms … 10.052 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.000 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.001 ms ±  5.571 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  ██▄▁▄▁▅▄▁▄▄▁▅▅▄▅▁▁▄▁▄▁▁▁▄▄▁▁▁▁▁▅▄▄▁▁▅▁▄▄▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▄▅ ▆
  10 ms        Histogram: log(frequency) by time        10 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Libc.systemsleep(0.01) # doesn't yield to task queue
BenchmarkTools.Trial: 321 samples with 1 evaluation per sample.
 Range (min … max):  12.887 ms …  22.358 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.455 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.539 ms ± 663.148 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▅▆▆█▄█▂
  ▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▂▄▃▇▅███████▇▆█▇▄▅▃▃▂▁▁▂▂▃▂▁▁▁▁▁▁▁▁▁▂ ▃
  12.9 ms         Histogram: frequency by time         17.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sleep(0.01) # if in task, yields to queue
BenchmarkTools.Trial: 322 samples with 1 evaluation per sample.
 Range (min … max):  13.661 ms …  22.372 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.459 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.516 ms ± 595.392 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   ▄█▅▇▄▅▄▁▁ █▁▂
  ▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▃▃▃▄▄▅▅▃▅▄▅█▅█████████▇████▇▆▇▇▆▄▇▃▅▃▃▃▃ ▄
  13.7 ms         Histogram: frequency by time         16.4 ms <

 Memory estimate: 112 bytes, allocs estimate: 4.
It's worth nothing that performance variation occurs even if you're just doing one very specific thing in a loop. That parse call goes as low as 627ns, which is about 0.1x the lower timings of the 10-iteration loop, but it can go as high as 5.25us, just because my CPU and OS felt like it. It's a fairly spread distribution of timings up to 1.4us, so the variation in my results isn't surprising.
julia> @benchmark Base.parse(Float64, "1.23342100000123")
BenchmarkTools.Trial: 10000 samples with 171 evaluations per sample.
 Range (min … max):  627.485 ns …   5.249 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     637.427 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   688.542 ns ± 166.162 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃▁  ▁▁▁▁       ▂▁                                           ▁
  ████▇▅█████▆▅▄▆▆▆███▇▇▇▇▇▆▆▅▆▅▅▅▅▅▆▆▇▆▆▆▅▆▅▆▅▆▅▆▅▅▄▅▇▅▅▅▄▄▄▅▄ █
  627 ns        Histogram: log(frequency) by time        1.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks Benny.

I think you right. As for a busy wait (i.e. we don’t sleep just loop) my guess it can happens due to context switch as well.

Probably what is happening the Linux scheduler (fair) schedules other task if there are not enough cores and the task using a lot of CPU for some time

So I think this benchmark is bogus with current setup, and one should use an isolated core/pinning to do it proper - I guess these effects should gone in this case