Regular println vs Core.stdout

@time println(Core.stdout, "Hello")
Hello
  0.000027 seconds

julia> @time println("Hello")
Hello
  0.000074 seconds

julia> @time print(Core.stdout, "Hello")
Hello  0.000020 seconds

Why is it 3 times faster [see my follow-up comment, it’s in practice slightly slower], and what are the pros and cons? I think, but not sure, it may not support threading (or rather uses no locking, thus working just out-of-order? avoiding UV lib?)? I see it’s being used for juliac as a current workaround, but should it be used more generally? Apparently neither allocated, but I think that’s a lie, since Julia doesn’t track some hidden allocations of print[ln], and are those allocations reduced with core, or even eliminated?

help?> Core.stdout
  │ Warning
  │
  │  The following bindings may be internal; they may change or be removed in future versions:
  │
  │    •  Core.stdout

Will it be guaranteed to be there, because of juliac?

julia> @time print(Core.CoreSTDOUT, "Hello")
Core.CoreSTDOUTHello  0.000131 seconds (2 allocations: 64 bytes)

Are you sure this isn’t noise? You could try benchmarking using BenchmarkTools.jl (expect tons of “Hello, world” printed to the terminal, at this point you’ll probably be benchmarking your terminal’s ability to handle lots of output). Also, that’s four zeros before the first significant digit, do you truly care about this difference?

“Regular” println uses Base.stdout, which is a dynamic global variable. It can change types. It can be redirected. It can support IO contexts and all sorts of things.

Core.stdout is a truly constant binding that will not change.

julia> isconst(Core, :stdout)
true

julia> isconst(Base, :stdout)
false
4 Likes

FYI: Well, not at all sure any more:

$ time julia +1.12 --project=. -e 'using BenchmarkTools; @btime print(Core.stdout, "Hello")' >/dev/null

real	0m5,635s
user	0m5,860s
sys	0m0,648s

$ time julia +1.12 --project=. -e 'using BenchmarkTools; @btime print("Hello")' >/dev/null

real	0m4,981s
user	0m5,709s
sys	0m0,168s

So why might it be actually (slightly; 13%) slower? And is sys a large part of that reason; why it almost 4x slower?

It seems like a workaround to use only for juliac, for now. But it will work there, and not go away? And no huge downside (such as I speculated about), there? Is I subtract sys from it (as if that could be realistic), it’s still 4% slower, otherwise 13% slower, very strange since it has fewer capabilities, and you might think thus faster.

EDIT: the threaded situation is similar. No bug there. I was using shell> that doesn’t support redirection (which is kind of an annoying limitation of the shell mode in Julia), so I edit out what I wrote here previously.

Using @btime you have no idea whether those two invocations are doing the same number of iterations, so you can’t conclude anything from their time.

For a fixed number of iterations, I find Base.stdout to be more than 3x faster, and it seems to be related to using more cores even though Julia is running single-threaded:

$ time julia +beta --startup-file=no -e 'for _ in 1:10000000; print(Core.stdout, "Hello"); end' >/dev/null
julia +beta --startup-file=no -e  > /dev/null  1.87s user 3.48s system 181% cpu 2.937 total
$ time julia +beta --startup-file=no -e 'for _ in 1:10000000; print("Hello"); end' >/dev/null
julia +beta --startup-file=no -e 'for _ in 1:10000000; print("Hello"); end' >  0.45s user 1.42s system 631% cpu 0.295 total
2 Likes

You’re also measuring the Julia startup time here. Additionally, you should make sure @btime uses a fixed number of runs (as @danielwe has also mentioned in the mean time).

When I measure in Julia itself, I get (ignoring all prints)

julia> @btime println("Hello");
  59.400 μs (5 allocations: 80 bytes)

julia> @btime println(Core.stdout, "Hello");
  45.200 μs (0 allocations: 0 bytes)

julia> @btime print("Hello\n");
  31.200 μs (1 allocation: 16 bytes)

julia> @btime print(Core.stdout, "Hello\n");
  19.800 μs (0 allocations: 0 bytes)

So I do notice differences, but I’m not sure what we are actually benchmarking here (cf. @ForceBru 's post).

3 Likes

It seems like Core.stdout has a bug (memory leak?).

It’s up to at least 15x slower.

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:20000000; print(\"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:20000000; print("Hello"); end;\' >/dev/null
  Time (mean ± σ):      1.514 s ±  0.049 s    [User: 2.324 s, System: 0.086 s]
  Range (min … max):    1.446 s …  1.601 s    10 runs
 
$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:20000000; print(Core.stdout, \"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:20000000; print(Core.stdout, "Hello"); end;\' >/dev/null
 ⠦ Current estimate: 15.078 s

Only 5x 50% slower here, since it gets worse with higher loop count:

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:200000; print(Core.stdout, \"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:200000; print(Core.stdout, "Hello"); end;\' >/dev/null
  Time (mean ± σ):     422.7 ms ±  21.3 ms    [User: 1186.3 ms, System: 132.4 ms]
  Range (min … max):   401.9 ms … 463.4 ms    10 runs

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:200000; print(\"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:200000; print("Hello"); end;\' >/dev/null
  Time (mean ± σ):     275.9 ms ±  19.3 ms    [User: 597.3 ms, System: 90.5 ms]
  Range (min … max):   254.6 ms … 306.3 ms    10 runs

EDIT: Note at low loop count the difference is misleading, I need subtract startup to get 5x.

EDIT: So likely wrong conclusion: For loop count 2, Core is slightly faster, or rather “same”, I’m mostly measuring noise and/or startup cost.

Timing single calls aren’t reliable when they’re really fast.

julia> @time print(Core.stdout, "Hello")
Hello  0.000291 seconds

julia> @time for _ in 1:1000 print(Core.stdout, "Hello") end
#=omitted Hellos=#  0.028425 seconds

which would suggest that in the larger loop, each print(Core.stdout,...) is suddenly 10x faster than a single call. That’s obviously false. Since timings also vary, it’s best to use tools that try to measure that.

I’m inferring from the numbers that the subtracted startup was ~239.2ms, so the print("Hello") gets to ~36.7ms. Note that your timings were intervals with variances, so a difference would add them up. Intervals that contain 0 don’t really say much, so the smaller the average timings or the differences, the less meaningful. The larger benchmark is probably measuring something real though.

I tried this in an adaptation of the hyperfine benchmark because I was curious how redirecting to /dev/null could be affecting things, minus the startup noise. Seems like it might affect a lot, though I’m not confident doing it from within Julia is the same thing as the command line:

PS C:\Users\Benny> julia +rc -e '@time for _ in 1:200_000; print(devnull, \"Hello\"); end'
  0.000003 seconds

PS C:\Users\Benny> julia +rc -e '@time for _ in 1:200_000; print(Core.stdout, \"Hello\"); end'
#=omitted Hellos=#4.682942 seconds

A smaller but still large difference holds up in BenchmarkTools:

julia> @btime print(devnull, $("Hello"))
  2.100 ns (0 allocations: 0 bytes)

julia> @btime print(Core.stdout, $("Hello"))
#= omitted Hellos=#  18.300 μs (0 allocations: 0 bytes)

EDIT: Julia appears to implement print(devnull, ...) to do nothing, so it’s indeed different if redirecting at the command line doesn’t do that (timings suggest so).

unsafe_write(::DevNull, ::Ptr{UInt8}, n::UInt)::Int = n

unsafe_write(io::Union{Core.CoreSTDERR, Core.CoreSTDOUT}, x::Ptr{UInt8}, nb::UInt) = Core.unsafe_write(io, x, nb)

I use min number, not intervals to subtract.

I’m tracing down what it does and is uses malloc_s (non-standard malloc according to AI, not sure where defined) that is not freed, unless I missed it or if not needed; it it like alloca?

I see it used a lot in Julia, didn’t find a definition of it there (might have missed), only for e.g.
malloc_size and malloc_usable_size).

Yes, that 15x is huge and real, and 5x may be approximate, but so far from it that I think not comparable, likely to be very underestimated to me close to 15x. So no scalable, you think malloc or rather malloc_s the issue, and missing corresponding free[_s]?

I think I now ruled it out defined in Julia; and seeing in unrealated place:

char *name = (char*)(len >= 256 ? malloc_s(alloc_len) : alloca(alloc_len));

I infer it’s not stack allocation like alloca [there],

“Googling” “malloc_s” Google’s AI tells me (basically secure malloc wrapper, but where defined?! in GNU libc?):

The function malloc_s is not a standard C library function. It is likely a custom or secure version of the malloc function, often found in libraries designed for security-conscious or embedded systems. These “safe” versions of malloc often include additional error checking or initialization features, such as zero-filling the allocated memory.

Here’s a breakdown of what malloc_s might entail:

Security Considerations:
Standard malloc can return NULL if memory allocation fails, which can lead to crashes if not handled properly. malloc_s might be designed to handle errors more gracefully, potentially by logging errors or returning a specific error code rather than a NULL pointer.

Initialization:
malloc does not automatically initialize the allocated memory. malloc_s might initialize the allocated block to all zeros or some other default value, which can prevent issues with uninitialized data in some scenarios

FYI: https://stackoverflow.com/questions/10191981/memcpy-s-and-strcpy-s-error-but-manual-for-loop-works

In my example, strcpy_s and malloc_s throw an error

Implying C++?

It’s neither defined here (only other nonstandard):
Replacing malloc (The GNU C Library) or here:
malloc(3) - Linux manual page

It IS found once in this online book (why is it so hard to find a definition, it’s driving be crazy…):

Secure Coding in C and C++

Off-topic but intriguing:

malloc_s IS actually defined in Julia (as wrapper for malloc, you just need to search for “*malloc_s”…) and as any malloc it needs a corresponding free.

I recall I’ve seen this code in Julia before. The free then was in a DIFFERENT function. Such awfulness (is that standard programming, was new to me?). Back then it was matched, I confirmed NOT a leak that I had suspected (but also found out these mallocs are invisible to Julia profiler tools) and now I’m trying to find it back. This is UV code, and using of it, and if I recall the free lives in the UV codebase (likely related to uv_write), but the corresponding malloc_s IS in the JuliaLang codebase… so not just in a different function.

I want to find it again to see if there’s actually some reason that free isn’t triggered, i.e. a memory leak…

[E.g. free_write_req in UV has two stray frees, but that’s I think not the function I was looking for.]