Why/where does this allocate?

I think this function in Julia allocates once, 16 bytes, at poptask (can’t easily run it alone):

julia> function wait()
           GC.safepoint()
           W = Base.workqueue_for(Threads.threadid())
           Base.poptask(W)
           result = Base.try_yieldto(Base.ensure_rescheduled)  # probably not here
           Base.process_events()  # NOT here
           # return when we come out of the queue
           return result
       end

It’s called from:

julia> s = "Palli"; @time (Base.iolock_begin(); Base.uv_write(stdout, convert(Ptr{UInt8}, pointer(s)), convert(UInt, 7)))
Palli  0.000051 seconds (2 allocations: 32 bytes)
7

There are the 2 remaining allocations I’m trying to eliminate from print (plus one hidden):

# caller must have acquired the iolock
julia> function uv_write(s::Base.LibuvStream, p::Ptr{UInt8}, n::UInt)
           uvw = Base.uv_write_async(s, p, n)
           ct = current_task()
           Base.preserve_handle(ct)
           Base.sigatomic_begin()
           Base.uv_req_set_data(uvw, ct)
           Base.iolock_end()
           local status
           try
               Base.sigatomic_end()
               # wait for the last chunk to complete (or error)
               # assume that any errors would be sticky,
               # (so we don't need to monitor the error status of the intermediate writes)
               wait()::Cint  # 1 allocation of 16 bytes
               Base.sigatomic_begin()
           finally
               # try-finally unwinds the sigatomic level, so need to repeat sigatomic_end
               Base.sigatomic_end()
               Base.iolock_begin()
               # ct.queue === nothing || list_deletefirst!(ct.queue::IntrusiveLinkedList{Task}, ct)
               # if uv_req_data(uvw) != C_NULL
                   # uvw is still alive,
                   # so make sure we won't get spurious notifications later
               #    uv_req_set_data(uvw, C_NULL)
               # else
                   # done with uvw
               #    Libc.free(uvw)
               # end
               Base.iolock_end();
               Base.unpreserve_handle(ct)
           end
           if status < 0
               throw(_UVError("write", status))
           end
           return Int(n)
       end

Note, I’ve commented out a bit there from what is in Base. And added Base. in some places so that I can easily substitute.

I believe nothing up to finally allocates there except wait(). And then something after. uv_write_async DOES allocated with Libc.malloc, and then freed in that other function… maybe not a good code-pattern, I thought I had found a memory leak. I know about that hidden allocation. Any good idea about how to eliminate that less bad allocation also?

I’m a bit confused. How can you tell if this is really allocating, unless you measure with BenchmarkTools.jl and variable interpolation?

@time and @allocated is always reliable proof you get allocation somewhere(?!).

I always get (at least) 2 allocations.

Actually for the code I’m looking at I get 2 OR 3 allocations, so if I eliminate those 2, not the occasional third one then those ways and @btime and @benchmark would show 0 allocations (for min.).

Getting there is my first step, knowing you have 0 always is a tougher goal. There’s a tool for that, but it would also miss the hidden 4th allocation I think.

The classic advice is “Don’t benchmark with non-const globals”, and s is a non-const global. As far as I remember, the resulting number of allocations is between zero and a lot.

I just wonder why you don’t use the reliable method, that’s all.

2 Likes

Fair enough, but it doesn’t matter for allocations I believe, and @btime, if you mean that, is worse here.

What’s reliable here?

@time print(s_now_const)  # 1 either way, note println does more... I'm working on both.

The real reason I don’t use e.g.:

@benchmark print($s_now_const)
PalliPalliPalli [... cluttering my REPL until CTRL-C, and also I just learned it's not safe, in general (see recent issue), though I belive ok here]

Yes, without interpolation and that tool, your code tends to be (partially? or fully) optimized away, but I think never with side-effect like printing. And in all cases would show same or fewer allocations, never more(?).

I wasn’t really looking for timings, I’m just used to doing @time for (timing and) allocations (too), @allocations is the macro for it only, and it always shows GC-tracked allocations, and doesn’t support interpolations.

I think this might be similar to this. If I remember correctly, wait() checks with the scheduler if any other tasks or libuv “events” are ready, and will switch to them / call libuv callbacks, so you might see allocations from those tasks here (but I might be terribly wrong, you have to check). There is this Ref{} allocation when you write to IO, so maybe thats it?

1 Like

I did try the steps from here:

GitHub - JuliaCI/Coverage.jl: Take Julia code coverage and memory allocation…

and while I did get a file, it’s rather useless only, showing this line in it:

const s = "Palli"; @time (Base.iolock_begin(); Base.uv_write(stdout, convert(Ptr{UInt8}, pointer(s)), convert(UInt, 7)))

While that’s the cause, the strange thing is that if I removed the file and tried again, I couldn’t get a file with user or all, also going further in the instructions:

julia> analyze_malloc(".")
ERROR: IOError: stat("./.julia/artifacts/e9b6449c9e90ab3718db7e31df350994ebe44003/include/gap/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src/src"): too many symbolic links encountered (ELOOP)

I think I’ve confirmed wait must be involved, and poptask it calls, doing the allocation, at least indirectly, i.e. since likely because it calls:

so the cause here, maybe:

Where is the Ref?

For the new uv_write_new I’m making to integrate uv_write_async_new into uv_write I get (to work on the hidden Libc.malloc allocation) I get:

@time (Base.iolock_begin(); uv_write_new(stdout, convert(Ptr{UInt8}, pointer("Palli")), convert(UInt, 7)))
Palli  0.039388 seconds (12.11 k allocations: 588.969 KiB, 99.88% compilation time)
double free or corruption (out)

[2983938] signal 6 (-6): Aborted
in expression starting at REPL[14]:1

It seems to work since it prints and @time works, but why might I get “double free or corruption (out)”? I’m not freeing at all (of with a memory leak for now).

julia> function uv_write2(s::Base.LibuvStream, p::Ptr{UInt8}, n::UInt)
           #       function uv_write_async_new(s::Base.LibuvStream, p::Ptr{UInt8}, n::UInt)
           Base.check_open(s)
           while true
               uvw = Memory{UInt8}(undef, Base._sizeof_uv_write % Int).ptr  # pointer(Memory{UInt8}(undef, Base._sizeof_uv_write % Int))  # Libc.malloc(Base._sizeof_uv_write)
               Base.uv_req_set_data(uvw, C_NULL) # in case we get interrupted before arriving at the wait call
               nwrite = min(n, Base.MAX_OS_WRITE) # split up the write into chunks the OS can handle.
               # TODO: use writev instead of a loop
               err = ccall(:jl_uv_write,
                           Int32,
                           (Ptr{Cvoid}, Ptr{Cvoid}, UInt, Ptr{Cvoid}, Ptr{Cvoid}),
                           s, p, nwrite, uvw,
                           @cfunction(Base.uv_writecb_task, Cvoid, (Ptr{Cvoid}, Cint)))
               if err < 0
                   # Libc.free(uvw)
                   uv_error("write", err)
               end
               n -= nwrite
               p += nwrite
               if n == 0
                   return uvw
               end
           end;
       # end; uvw = uv_write_async_new(s, p, n)
                  ct = current_task()
                  Base.preserve_handle(ct)
                  Base.sigatomic_begin()
                  Base.uv_req_set_data(uvw, ct)
                  Base.iolock_end()
                  local status
                  try
                      Base.sigatomic_end()
                      # wait for the last chunk to complete (or error)
                      # assume that any errors would be sticky,
                      # (so we don't need to monitor the error status of the intermediate writes)
                      status = wait()  # 1 allocation of 16 bytes
                      Base.sigatomic_begin()
                  finally
                      # try-finally unwinds the sigatomic level, so need to repeat sigatomic_end
                      Base.sigatomic_end()
                      Base.iolock_begin()
                      # ct.queue === nothing || list_deletefirst!(ct.queue::IntrusiveLinkedList{Task}, ct)
                      # if uv_req_data(uvw) != C_NULL
                          # uvw is still alive,
                          # so make sure we won't get spurious notifications later
                      #    uv_req_set_data(uvw, C_NULL)
                      # else
                          # done with uvw
                      #    Libc.free(uvw)
                      # end
                      Base.iolock_end();
                      Base.unpreserve_handle(ct)
                  end
                  # if status < 0
                  #    throw(_UVError("write", status))
                  # end
                  return Int(n)
              end

EDIT: I get my 0 allocations for typlues (not helping with the main issue of the thread) with @generated in front of the function here getting me an NTuple! My first use of that obscure feature. But it still doesn’t help me enough, ok often, not here when needing a pointer to it for ccall

Are (tuples and) NTuple inherently slow, and allocate more?

julia> mytuple_200() = ([UInt8(0) for i = 1:200]...,)

julia> @time mytuple_200()  # not always 5 allocations for first...?!
  0.000057 seconds (5 allocations: 2.109 KiB)

next time around:

julia> @btime mytuple_200()
  8.320 μs (3 allocations: 464 bytes)

julia> @btime my_smallertuple_()
  229.241 ns (3 allocations: 80 bytes)

Always 3 allocations min. independent of size.

I was looking into SmallVector at:

and it’s based on NTuple, and my code on simplified ntuple, or Base._ntuple it calls.

I realize I can’t get a pointer to it so not a substitute Memory or (any) Vector type, for me. But at least I though I would get fewer allocations that way, then worry about if getting a pointer to it would be possible with a hack.

I was hoping this about global scope but no:

julia> function my_code()
         mytuple_200()
         mytuple_200()
         nothing
       end
my_code (generic function with 1 method)

julia> @btime my_code()
  16.614 μs (6 allocations: 928 bytes)

vs.

julia> @benchmark Memory{UInt8}(undef, 200)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.397 ns …  28.090 μs  ┊ GC (min … max):  0.00% … 99.75%
 Time  (median):     29.724 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   73.294 ns ± 537.958 ns  ┊ GC (mean ± σ):  27.37% ±  6.50%
1 Like