Julia slower than Python to sort and reverse a list of integers

It’s not gotcha, because Python’s deep learning frameworks are not written in Python. It’s huge undertaking to maintain and create such a framework that solves some of compiler problems for Python users and can be called from python. It’s actually a great example of how bad it is, if you want something not in the standard toolkit that is fast you are screwed or there is a two language problem. There is Jax which is allowing some more flexibility, not that it matters for people who just want to call ready tool, but it matters for people who are trying to develop or improve existing tools.

2 Likes

as a user I will experience fast code.

2 Likes

Yes you might experience the same speed as some other programming languages get for free, because of these massive framework compiling code for you and doing “magic” you wouldn’t need in typed programming and compiled languages. But lots of people have to do extra job for that to happen instead of focusing on solving realm problem. Let alone bazillion of small packages compiled for different Python versions performing sometimes as trivial operation as computing hamming distance (including whole CI for one line of code in worse case example). It will be fast as long as you stick to framework and do all your ops within limits of what it allows you, without ever stepping outside of it, as that usually leads to huge overhead.

1 Like

If I recall TimSort is a stable sort, but QuickSort usually isn’t. It is however now in later versions of Julia (since 1.9 I believe), and very fast (may even include Radix sort, I forget, I think it’s not to be relied on that the default sort is only Quicksort), so I think you can ignore the algorithmic differences. TimSort was actually made to be fast on already sorted (or reverse sorted) data, and I would also keep that in mind. [The implementation language doesn’t, matter except on the Python side; Julia is as fast as C.]

1 Like

lots of people have done lots of work to make Julia feel “magic” too. I don’t think it’s relevant to the user experience if that work is done in C or in Julia all the way down. BTW, Julia implements some things in C as well.

1 Like

Ok, ok, folks, let’s cool it a bit. It’s fine to post benchmarks where Julia is slow (heck, we want it). There’s also different kinds of benchmarks—ones that compare language features like the microbenchmarks and other ones that compare user experiences using whatever primitives are provided. Both are valid. No need to bicker about it. It is also good to dig into why something is or isn’t slow.

31 Likes

In fact, you don’t even need buffering, just a bit more memory:

@time let lines = readlines("random_numbers.txt")
    println(join(sort!(parse.(Int64, lines), rev = true), '\n'))
end

This one is equally fast, and significantly simpler.

1 Like

Are there any plans to finally make Julia’s standard IO functions (stdout, open, print, write, etc.) buffered by default, with a fixed-size, preallocated buffer (a couple of pages, e.g. 8-64 kibibytes) just like the FILE operations in C (and in many other common languages) are? Unbuffered I/O can be very slow and is really only suitable for specialist use, as it usually triggers a whole kernel context switch for every single argument of a print() or write() statement, going through a huge code path in the operating system, often for each single number written. Unbuffered I/O can even become grotesquely slow in situations where the underlying file-system has unbuffered synchronous writes as well (e.g. I’ve encountered some operating systems that do that by default for some removable media, e.g. USB sticks). Having to explicitly work around the severe performance penalty of unbuffered IO by default, using hacks like join or IOBuffer (which both introduce potentially huge allocations, and corresponding L1 cache churn) is neither user friendly nor a performant long-term solution.

15 Likes

I’d really like this, but when I’ve asked other devs, they seem weirdly (imo) reluctant. I think the big holdup may be figuring the right conditions under which stdout is buffered or not. I don’t think stderr should ever be buffered by default. It should definitely be easier to switch to buffering.

4 Likes

It’s nicer to do:

using BufferedStreams
bstdout = BufferedOutputStream(stdout)

and then write to bstdout. i.e. with BufferedStreams.jl (which defaults to a 128KiB buffer) you can opt-in to buffering pretty easily.

12 Likes

This looks very cool. I should remember this in the future.

I’m not sure what the issue is, but since it’s faster, and available, why not the default:

I mean without having to do using BufferedStreams [...] couldn’t this be integrated as a standard lib or some way? But I mean with no code changes for users, versus the status quo? [Those using that package should still work, as a no op.]

Don’t e.g. C and C++ buffer by default too (and Python, would this conflict with interop that language most importantly or any that do such buffering, I’m trying to think up reasons, for it not the default)? Only reason I could think of, if this is redundant with some buffering in lower layer(s) i.e. the OS. If it’s a problem if could be opt-out rather than opt-in? But it seems implementing that opt-out could be deferred, as I’m very unsure it’s actually needed.

I realize the trend is to rather get rid of stdlibs… keep Julia small, but this is just such a basic case of Julia being worse than needed, all would want this, in particular people who wouldn’t know how to opt-in currently.

It would be breaking.

For example, suppose I’m using println to output status updates in a long-running calculation — that would suddenly no longer work (the updates could appear extremely infrequently) without code changes to call flush(stdout) after every output.

(C and some other languages automatically flush stdout after output that contains a newline, which mitigates this issue somewhat, at the price of losing some of the benefits of buffering for large amounts of output. C++ does this for std::endl.)

2 Likes

Thanks, BufferedStreams.jl looks promising, but its documentation seems rather confusing at first glance, and needs a major rewrite before this package can be recommended to any new Julia users.

Yes, I agree that it needs improvements, which should be easy to write since it is quite simple to use. (A great way to contribute!)

Would it need to be? I mean yes, with the semantics of BufferedStreams.jl (and current syntax), but we could do better (better than even C, C++ and Python?).

I’m thinking we assume threading support (unlike I guess C) from the OS, just as Python does in all current versions: threading — Thread-based parallelism — Python 3.7.17 documentation

Changed in version 3.7: This module used to be optional, it is now always available.

Is it a bad idea to simply have a background thread that only wakes up say every 0.1 sec. and if buffers haven’t been flushed then it will?

Then the large 128 KB buffer can stay, otherwise we could reconsider and have it smaller (why was it chosen that large?) and/or flush on end-of-line (which might not be optimal).

2 Likes
@time rows = rand(Int64, 999999)  # I'm rather sure this is faster in Julia, not the part I choose to look at right now, even more likely for Float64, and Float32 that Python doesn't have.
  0.004751 seconds (3 allocations: 7.630 MiB)

@time println.(rows);  # noticeably slow, in part because showing in the terminal, but same below
 10.833938 seconds (8.08 M allocations: 260.670 MiB, 0.35% gc time)

julia> @time using PyCall
  0.499943 seconds (414.26 k allocations: 27.650 MiB, 2.80% compilation time)

[EDIT: Was for some reason: @time using PyCall  # a bit slow but one-time-cost, ignore for now.
  2.573149 seconds (1.22 M allocations: 85.362 MiB, 0.49% gc time, 99.41% compilation time: 100% of which was recompilation) ]

# It's actually faster to print with Julia, calling to Python, even if summing up that one-time overhead... for this size.
@time py"print".(rows);
  7.284762 seconds (3.00 M allocations: 137.331 MiB, 2.55% gc time)

[EDIT: Strange, the allocation numbers aren't stable I see also:
  7.617555 seconds (3.21 M allocations: 151.482 MiB, 2.82% gc time, 2.53% compilation time)
  7.559810 seconds (3.00 M allocations: 137.331 MiB, 2.49% gc time) ]

I think we can (and must) do better, until then PyCall (or PythonCall) is your friend. And simpler code than Python…, that needs a loop, and Julia can use broadcasting (the period). One thing to look at first, why fewer allocations 3M vs 8M.

It’s slightly faster in Python…:

import time

start = time.time();

import numpy as np
numbers = np.random.randint(-2147483648, 2147483647, size=999999)

for n in numbers:
    print(n)

et = time.time()
elapsed_time = et - start
print('Execution time:', elapsed_time, 'seconds')

Execution time: 5.458845615386963 seconds [50% faster than pure Julia; 25-29% 1.8 sec faster than Julia with PyCall, plus the 0.5 sec fixed overhead.]

The number of allocations seem to explain timing diff. to a large degree, but not in full.

1 Like

@cjdoris I tried the same with:

$ julia -O0
julia> @time using PythonCall
  1.705963 seconds (1.43 M allocations: 89.960 MiB, 7.26% gc time, 0.92% compilation time)

[On defaults, not -O0:
  2.288473 seconds (1.43 M allocations: 90.056 MiB, 5.41% gc time, 1.62% compilation time) ]

@time PythonCall.println.(rows);
 10.990645 seconds (8.08 M allocations: 260.670 MiB, 0.16% gc time)
[Was: 11.140644 seconds (8.08 M allocations: 260.670 MiB, 0.17% gc time) ]

It’s 10.990645/7.284762 = 51% 11.140644/7.559810 = 47% slower than PyCall for some reason. Likely because it allocates 8.08/3.00 = 2.7 times more… which is strange, and why does either allocate at all?

Seems a heck of a lot easier to do:

using BufferedStreams
bout = BufferedOutputStream(stdout)
println.(bout, rows); flush(bout)

and is 2x faster than Python on my machine.

9 Likes

Those three lines/four statements are non-discoverable (I couldn’t find BufferedStreams in Julia’s docs), more than the current one line/statement in Julia, and two lines/statements with PyCall.

@time (println.(bout, rows); flush(bout))
  3.953521 seconds (3.00 M allocations: 129.708 MiB, 1.13% gc time)

It’s actually just (3.953521/5.458845615386963) 28% faster than Python alone for me, and while yes it’s 48% faster than Julia calling Python for the print (what you meant and timed?), and this is yes, rather obscure:

py"print".(rows);

My point was if you are comparing to Python, and it’s faster, then it’s natural to use (or at least test with) Python as I did. I was curious if it was as fast. Especially for new users or coming from Python it’s obscure to have to know about flush (and BufferedStreams), which isn’t needed.

I always liked your py macro, and misse[d] it from PythonCall, wasn’t really sure how to do the same then until I figured out PythonCall.println.(rows); And it’s arguably a bit more readable.

[What wasn’t calling Python for the print not as fast as doing it there? My answer: Since Python runs in the same process, it shares data which is great (i.e. Julia numerical arrays to NumPy arrays). There should be no copying overhead (in most cases), but e.g. strings strings are an exception, since Julia uses UTF-8, and Python doesn’t yet (PyPy does). I didn’t know how to call PyPy, but with it, or soon a future Python version, that will switch to UTF-8 internally, I expect the overhead to disappear, Julia will no longer be 48% faster, and it will have code (mental) overhead.]