EDIT: I found one speedup myself (see my answer below) but still interested in replies…
I have a critical loop which looks like this:
for j in 1:about_1000
x = a_simple_calculation()
@printf(op, " %g", x) # this is line 241
All in all this loop runs about 12 million times, and takes about 27 seconds total, i.e. about 2 microseconds per loop. The profiler confirms that the slowest part is the @printf line inside the loop, not the calculation of x. The profiler’s output for this part was:
I also tried building a string (basically replacing the first @printf with the appropriate @sprintf command, and writing a line at a time), and that was actually slightly slower. So that suggests the problem is not the print buffering itself.
So… is there a fastest way to print (or make strings) in Julia? Or some other obvious improvement to this code?
Using join() improved it by about a factor of 5 (down to about 5 seconds from 27) but I am still interested in speeding it more if there is something obvious. I did this:
x = zeros(about_1000)
for j in 1:about_1000
x[j] = a_simple_calculation()
s = join(x, " ")
@printf(op, " %s\n", s)
p.s. It was not character-for-character the same, because join() always shows floats as floats where @printf with “%g” does not (e.g. if the data is 0, join prints “0.0” but @printf with %g prints “0”), but the data appears to be numerically identical, which is what matters.
On my second example (using join and then printing a line or part of a line at a time), print is comparable in speed to @printf. In fact looking at the profiler, the join takes far longer (about 30 times as long) than either print or @printf. So the critical path is creating the string, not printing it.
So my question now becomes “What is the fastest way to put many numbers into a string?”, and the current winner is join, but of course I am greedy so I would like to know if there might be something better…
An IOBuffer is an in-memory buffer/replacement for IO, so you can write to it like printing to stdout. The advantage is that you don’t hit the disk but stay in memory, so writing to it is usually faster than writing to disk.
Once you’re done with writing your results to it, you can consume all buffered data at once and write it out to disk in one swoop with e.g. write(stdout, take!(io_buf)).
Yes I’m sure . The use of join() means the write-out-the-numbers part of the code now takes about 14% of the total run time, compared to about 64% before. I’m sure I could get that 14% down by doing a direct memory transfer (using JuliaPy I guess) or even rewriting the Python 3rd party module in Julia, but I’m up against diminishing returns now.
I’d like to add here that constructing the String may not be necessary at all, when printing to an IOBuffer and writing its entire content to disk afterwards. Strings in julia are immutable, so joining them requires creating a new one. Using an IOBuffer should avoid all of that and it’ll still have the same representation as if you had printed it to some file or such.
Yeah that looks good to me. Of course, only benchmarking can be the final judge for this, but from my intuition this should be pretty much ideal (apart from memory mapping the binary data instead of printing & parsing).
This setup obviously won’t work in a streaming setting, since you’d only write to disk/out of memory at the end, but for this usecase of creating a file it should be fine. It’ll also break if your data is too large to fit in memory, but at that point you can’t avoid hitting disk anyway (batching and partial writing to a known byte offset in a file could be done as well, though once you’re in that territory, it’s really just because you have way too much data to communicate).
The CSV.jl package seems to be at least twice as fast on my machine. Part of the reason is perhaps that @printf still does heap allocations, while CSV’s inner loops work hard to be allocation-free.
Unfortunately, CSV.jl doesn’t implement a method that works on simple arrays for some reason (CSV.jl#861), so you have to wrap the array in “table” (which itself requires a matrix (Tables.jl#243), not a vector, so you have to do a reshape):
Yes, under the assumption that it’s possible to keep the full matrix in memory for CSV.jl to work in, I’m not surprised it’s faster! As far as I remember, CSV.jl already does batching to some extent. That’s not what the original OP asked about though, in that post there was a limitation about doing some calculation in each loop iteration.
If keeping all results at the same time in memory is an option, using CSV.jl or similar is going to be faster (though I admit that it’s kind of disappointing to see heap allocations from @sprintf…)
Yes, that does the write about another 50% faster (i.e. it takes about 2/3 as long as using join() and writing a line at a time). I wasn’t going to do any more optimisation because it was already fast enough (see my comment 7), but that was a very easy change to do. Thank you!
I might look at CSV.jl later, but it’s getting late here.