Regular println vs Core.stdout

Palli · August 18, 2025, 5:19pm

@time println(Core.stdout, "Hello")
Hello
  0.000027 seconds

julia> @time println("Hello")
Hello
  0.000074 seconds

julia> @time print(Core.stdout, "Hello")
Hello  0.000020 seconds

Why is it 3 times faster [see my follow-up comment, it’s in practice slightly slower], and what are the pros and cons? I think, but not sure, it may not support threading (or rather uses no locking, thus working just out-of-order? avoiding UV lib?)? I see it’s being used for juliac as a current workaround, but should it be used more generally? Apparently neither allocated, but I think that’s a lie, since Julia doesn’t track some hidden allocations of print[ln], and are those allocations reduced with core, or even eliminated?

help?> Core.stdout
  │ Warning
  │
  │  The following bindings may be internal; they may change or be removed in future versions:
  │
  │    •  Core.stdout

Will it be guaranteed to be there, because of juliac?

julia> @time print(Core.CoreSTDOUT, "Hello")
Core.CoreSTDOUTHello  0.000131 seconds (2 allocations: 64 bytes)

ForceBru · August 18, 2025, 5:28pm

Are you sure this isn’t noise? You could try benchmarking using BenchmarkTools.jl (expect tons of “Hello, world” printed to the terminal, at this point you’ll probably be benchmarking your terminal’s ability to handle lots of output). Also, that’s four zeros before the first significant digit, do you truly care about this difference?

mbauman · August 18, 2025, 5:32pm

“Regular” println uses Base.stdout, which is a dynamic global variable. It can change types. It can be redirected. It can support IO contexts and all sorts of things.

Core.stdout is a truly constant binding that will not change.

julia> isconst(Core, :stdout)
true

julia> isconst(Base, :stdout)
false

Palli · August 18, 2025, 6:30pm

FYI: Well, not at all sure any more:

$ time julia +1.12 --project=. -e 'using BenchmarkTools; @btime print(Core.stdout, "Hello")' >/dev/null

real	0m5,635s
user	0m5,860s
sys	0m0,648s

$ time julia +1.12 --project=. -e 'using BenchmarkTools; @btime print("Hello")' >/dev/null

real	0m4,981s
user	0m5,709s
sys	0m0,168s

So why might it be actually (slightly; 13%) slower? And is sys a large part of that reason; why it almost 4x slower?

It seems like a workaround to use only for juliac, for now. But it will work there, and not go away? And no huge downside (such as I speculated about), there? Is I subtract sys from it (as if that could be realistic), it’s still 4% slower, otherwise 13% slower, very strange since it has fewer capabilities, and you might think thus faster.

EDIT: the threaded situation is similar. No bug there. I was using shell> that doesn’t support redirection (which is kind of an annoying limitation of the shell mode in Julia), so I edit out what I wrote here previously.

danielwe · August 18, 2025, 7:14pm

Using @btime you have no idea whether those two invocations are doing the same number of iterations, so you can’t conclude anything from their time.

For a fixed number of iterations, I find Base.stdout to be more than 3x faster, and it seems to be related to using more cores even though Julia is running single-threaded:

$ time julia +beta --startup-file=no -e 'for _ in 1:10000000; print(Core.stdout, "Hello"); end' >/dev/null
julia +beta --startup-file=no -e  > /dev/null  1.87s user 3.48s system 181% cpu 2.937 total
$ time julia +beta --startup-file=no -e 'for _ in 1:10000000; print("Hello"); end' >/dev/null
julia +beta --startup-file=no -e 'for _ in 1:10000000; print("Hello"); end' >  0.45s user 1.42s system 631% cpu 0.295 total

eldee · August 18, 2025, 7:18pm

You’re also measuring the Julia startup time here. Additionally, you should make sure @btime uses a fixed number of runs (as @danielwe has also mentioned in the mean time).

When I measure in Julia itself, I get (ignoring all prints)

julia> @btime println("Hello");
  59.400 μs (5 allocations: 80 bytes)

julia> @btime println(Core.stdout, "Hello");
  45.200 μs (0 allocations: 0 bytes)

julia> @btime print("Hello\n");
  31.200 μs (1 allocation: 16 bytes)

julia> @btime print(Core.stdout, "Hello\n");
  19.800 μs (0 allocations: 0 bytes)

So I do notice differences, but I’m not sure what we are actually benchmarking here (cf. @ForceBru 's post).

Palli · August 18, 2025, 7:37pm

It seems like Core.stdout has a bug (memory leak?).

It’s up to at least 15x slower.

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:20000000; print(\"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:20000000; print("Hello"); end;\' >/dev/null
  Time (mean ± σ):      1.514 s ±  0.049 s    [User: 2.324 s, System: 0.086 s]
  Range (min … max):    1.446 s …  1.601 s    10 runs
 
$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:20000000; print(Core.stdout, \"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:20000000; print(Core.stdout, "Hello"); end;\' >/dev/null
 ⠦ Current estimate: 15.078 s

Only 5x ~~50%~~ slower here, since it gets worse with higher loop count:

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:200000; print(Core.stdout, \"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:200000; print(Core.stdout, "Hello"); end;\' >/dev/null
  Time (mean ± σ):     422.7 ms ±  21.3 ms    [User: 1186.3 ms, System: 132.4 ms]
  Range (min … max):   401.9 ms … 463.4 ms    10 runs

$ hyperfine "julia +1.12 --project=. -e 'for _ in 1:200000; print(\"Hello\"); end;\' >/dev/null"
Benchmark 1: julia +1.12 --project=. -e 'for _ in 1:200000; print("Hello"); end;\' >/dev/null
  Time (mean ± σ):     275.9 ms ±  19.3 ms    [User: 597.3 ms, System: 90.5 ms]
  Range (min … max):   254.6 ms … 306.3 ms    10 runs

EDIT: Note at low loop count the difference is misleading, I need subtract startup to get 5x.

EDIT: So likely wrong conclusion: ~~For loop count 2, Core is slightly faster, or rather “same”, I’m mostly measuring noise and/or startup cost.~~

github.com/JuliaLang/julia

Core.stdout isn't scalable; gets slower over time (with high/er loopcount)

opened 08:29PM - 18 Aug 25 UTC

PallHaraldsson

[Important for juliac] >It’s up to at least 15x slower. vs 5x or smaller with …lower loop count than: ``` for _ in 1:200000; print(Core.stdout, \"Hello\"); end ``` I suspect a memory leak (any hidden allocations, like with regular print with stdout)? Julia doesn't show them when calling malloc. What else could it be? I opened this thread and longer test on it here: https://discourse.julialang.org/t/regular-println-vs-core-stdout/131685/7?u=palli

Benny · August 18, 2025, 9:18pm

Timing single calls aren’t reliable when they’re really fast.

julia> @time print(Core.stdout, "Hello")
Hello  0.000291 seconds

julia> @time for _ in 1:1000 print(Core.stdout, "Hello") end
#=omitted Hellos=#  0.028425 seconds

which would suggest that in the larger loop, each print(Core.stdout,...) is suddenly 10x faster than a single call. That’s obviously false. Since timings also vary, it’s best to use tools that try to measure that.

I’m inferring from the numbers that the subtracted startup was ~239.2ms, so the print("Hello") gets to ~36.7ms. Note that your timings were intervals with variances, so a difference would add them up. Intervals that contain 0 don’t really say much, so the smaller the average timings or the differences, the less meaningful. The larger benchmark is probably measuring something real though.

I tried this in an adaptation of the hyperfine benchmark because I was curious how redirecting to /dev/null could be affecting things, minus the startup noise. Seems like it might affect a lot, though I’m not confident doing it from within Julia is the same thing as the command line:

PS C:\Users\Benny> julia +rc -e '@time for _ in 1:200_000; print(devnull, \"Hello\"); end'
  0.000003 seconds

PS C:\Users\Benny> julia +rc -e '@time for _ in 1:200_000; print(Core.stdout, \"Hello\"); end'
#=omitted Hellos=#4.682942 seconds

A smaller but still large difference holds up in BenchmarkTools:

julia> @btime print(devnull, $("Hello"))
  2.100 ns (0 allocations: 0 bytes)

julia> @btime print(Core.stdout, $("Hello"))
#= omitted Hellos=#  18.300 μs (0 allocations: 0 bytes)

EDIT: Julia appears to implement print(devnull, ...) to do nothing, so it’s indeed different if redirecting at the command line doesn’t do that (timings suggest so).

unsafe_write(::DevNull, ::Ptr{UInt8}, n::UInt)::Int = n

unsafe_write(io::Union{Core.CoreSTDERR, Core.CoreSTDOUT}, x::Ptr{UInt8}, nb::UInt) = Core.unsafe_write(io, x, nb)

Palli · August 18, 2025, 9:25pm

I use min number, not intervals to subtract.

I’m tracing down what it does and is uses malloc_s (non-standard malloc according to AI, not sure where defined) that is not freed, unless I missed it or if not needed; it it like alloca?

I see it used a lot in Julia, didn’t find a definition of it there (might have missed), only for e.g.
malloc_size and malloc_usable_size).

Yes, that 15x is huge and real, and 5x may be approximate, but so far from it that I think not comparable, likely to be very underestimated to me close to 15x. So no scalable, you think malloc or rather malloc_s the issue, and missing corresponding free[_s]?

I think I now ruled it out defined in Julia; and seeing in unrealated place:

char *name = (char*)(len >= 256 ? malloc_s(alloc_len) : alloca(alloc_len));

I infer it’s not stack allocation like alloca [there],

“Googling” “malloc_s” Google’s AI tells me (basically secure malloc wrapper, but where defined?! in GNU libc?):

The function malloc_s is not a standard C library function. It is likely a custom or secure version of the malloc function, often found in libraries designed for security-conscious or embedded systems. These “safe” versions of malloc often include additional error checking or initialization features, such as zero-filling the allocated memory.

Here’s a breakdown of what malloc_s might entail:

Security Considerations:
Standard malloc can return NULL if memory allocation fails, which can lead to crashes if not handled properly. malloc_s might be designed to handle errors more gracefully, potentially by logging errors or returning a specific error code rather than a NULL pointer.

Initialization:
malloc does not automatically initialize the allocated memory. malloc_s might initialize the allocated block to all zeros or some other default value, which can prevent issues with uninitialized data in some scenarios

FYI: https://stackoverflow.com/questions/10191981/memcpy-s-and-strcpy-s-error-but-manual-for-loop-works

In my example, strcpy_s and malloc_s throw an error

Implying C++?

It’s neither defined here (only other nonstandard):
Replacing malloc (The GNU C Library) or here:
malloc(3) - Linux manual page

It IS found once in this online book (why is it so hard to find a definition, it’s driving be crazy…):

Secure Coding in C and C++

Off-topic but intriguing:

Palli · August 18, 2025, 11:28pm

malloc_s IS actually defined in Julia (as wrapper for malloc, you just need to search for “*malloc_s”…) and as any malloc it needs a corresponding free.

I recall I’ve seen this code in Julia before. The free then was in a DIFFERENT function. Such awfulness (is that standard programming, was new to me?). Back then it was matched, I confirmed NOT a leak that I had suspected (but also found out these mallocs are invisible to Julia profiler tools) and now I’m trying to find it back. This is UV code, and using of it, and if I recall the free lives in the UV codebase (likely related to uv_write), but the corresponding malloc_s IS in the JuliaLang codebase… so not just in a different function.

I want to find it again to see if there’s actually some reason that free isn’t triggered, i.e. a memory leak…

[E.g. free_write_req in UV has two stray frees, but that’s I think not the function I was looking for.]

Palli · August 19, 2025, 12:38pm

$ time julia +nightly --project=. -e 'for _ in 1:100000 print(Core.stdout,"Hello"); end' > /dev/null

real	0m0,331s

I.e. 0.331-0.247 = 0.087s after ignoring startup.

$ time julia +nightly --project=. -e 'for _ in 1:1000000 print(Core.stdout,"Hello"); end' > /dev/null

real	0m0,965s

0.965 - 0.247 = 0.718 so is actually scalable, less than 10x slower, so far.


$ time julia +nightly --project=. -e 'for _ in 1:100000 print("Hello"); end' > /dev/null

real	0m0,263s

0.263-0.247 = 0.016s starting out much faster.

$ time julia +nightly --project=. -e 'for _ in 1:1000000 print("Hello"); end' > /dev/null

real	0m0,323s

0.323-0.247 = 0.076s. Meaning 4.75x slower, for 10x more prints, very scalable. BOTH are [the superlinear scalability has to do with the Linux kernel, seemingly, because of syscall, not Julia]. It was a mistake thinking Core.stdout wasn’t scalable, it was only less scalable than regular stdout, then my calculated 15x difference (and widening gap) is/seems misleading.

Startup cost, very similar to subtract:

$ time julia +nightly --project=. -e 'print("Hello");' > /dev/null

real	0m0,247s
user	0m0,140s
sys	0m0,108s

$ time julia +nightly --project=. -e 'print(Core.stdout, "Hello");' > /dev/null

real	0m0,248s
user	0m0,158s
sys	0m0,092s

EDIT: ignore rest, I think I timed with sleep, will edit after retiming:

If I monitor the program with strace I see:

...
write(14, "HelloHelloHelloHelloHelloHelloHe"..., 32765) = 32765
write(14, "HelloHelloHelloHelloHelloHelloHe"..., 32765) = 32765
write(14, "HelloHelloHelloHelloHelloHelloHe"..., 32765) = 32765
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0

But then is stops for a bit, and then continues this loopy behavior. It seems very strange, as if the kernel is rate-limiting. But why should it? since I’m redirecting to /dev/null. [Though I AM outputting to the terminal for strace, not a lot each time seemingly, so unclear if I’m only seeing rate-liming there..]

This shows clearly that Julia buffers itself, for Core.stdout however, then I get a syscall for each Hello, explaining why it’s slower. So that’s not too surprising then. Only if it gets slower the more you output. Could the kernel be adjusting its rate-limiting to get more aggressive over time? I no longer believe julia has a memory leak, likely ruled it out (monitor the program externally, see no increase in memory, though Julia starts very high, and may not need to get more memory from the kernel for a while; not sure how to really track mallocs to libc, that do not go to the kernel…).

Intriguingly these pauses, for epoll_pwait (is it related to locking?) do NOT happen with Core.stdout since then epoll_pwait is absent. So it’s in a way better, just still slower since way more write(14, ..)

Topic		Replies	Views
Why is printing to a terminal slow? Performance	28	5499	November 24, 2021
Julia slower than Python to sort and reverse a list of integers Performance	40	2726	April 28, 2023
Thread overhead variability across machines Performance	13	1893	November 28, 2017
Why is [x]or (a function call) so long/slow (not inlined)? Offtopic	23	2966	January 5, 2017
Huge performance improvement by separating function? General Usage	15	1208	August 27, 2018

Regular println vs Core.stdout

Secure Coding in C and C++

Related topics