Int -> Str -> print(io,_) without allocations

I have billions of Ints that I wish to write to a file. Something similar to mmwrite() in MatrixMarket.jl. I am already using IOBuffer with a large enough sizehint. When I do print(io, n::Int), behind the scenes, a conversion to String is made, causing lots of allocations, slowing down the process.

Strings are variable-sized, hence allocate. Is there a more efficient way to do s=some_efficient_string_type(n); print(io,s) that would avoid allocations? Perhaps something related to ShortStrings.jl or StaticStrings.jl?

Related: print(iobuffer, number) without calling string(number)? · Issue #55835 · JuliaLang/julia · GitHub

1 Like

I solved this by writing separate digits to io.

using SparseArrays, MatrixMarket
const _digit_chars ::String = "0123456789";
@inline function _mat_p_ex(io::IO, p::Int) ::Nothing
   #= Export the position/index of a matrix entry. =#
   @inbounds begin n=ndigits(p); D=10^(n-1);  
      for _ = 1:n  
         d = (p Ă· D) % 10;  
         print(io,_digit_chars[d+1]);  
         D Ă·= 10 end end end;
@inline function _mat_w_ex(io::IO, w::tw) where {tw}
   #= Export the weight/value of a matrix entry. =#
   print(io,w) end;  # inefficient default
@inline function _mat_w_ex(io::IO, w::tw) where {tw<:Integer}
   w<0 && print(io,'-'); 
   _mat_p_ex(io, abs(w)) end;  # more efficient specialization
@inline function _entry_write(io::IO, i::Int, j::Int, w::tw, sep::Char) ::Nothing  where {tw}  
   _mat_p_ex(io,i); write(io,sep);  
   _mat_p_ex(io,j); write(io,sep);  
   _mat_w_ex(io,w); write(io,'\n'); 
   nothing end;
function _mat_export(X::SparseMatrixCSC{tw,tp}, m::Int, n::Int, sep::Char, io::IO, buf::IOBuffer, n_bytes::Int)  where {tw, tp}
   I = X.colptr;  pp = X.rowval;  ww = X.nzval;
   for j = 1:n 
      for k = I[j] : I[j+1]-1  
         i = pp[k]
         w = ww[k]  
         _entry_write(buf,i,j,w,sep);  
         if position(buf) ≥ n_bytes  write(io,take!(buf)) end end end;  
   write(io,take!(buf)); 
   nothing end;
function mat_ex(X::SparseMatrixCSC{tw,tp}, file::String, n_bytes::Int=100_000_000, sep::Char=' ', buf::IO=IOBuffer(sizehint=ceil(Int,n_bytes*1.1))) ::Nothing  where {tw,tp}
   #= Export a sparse matrix to the MatrixMarket.jl format, which is cross-language/platform (can be used in Python, MatLab, Mathematica, C++, C, Fortran, ...). =#
   m,n = size(X);  r = nnz(X);  
   open(file, "w") do io
      println(io, "%%MatrixMarket matrix coordinate integer general")
      println(io, "$m $n $r")
      _mat_export(X,m,n,sep,io,buf,n_bytes) end
end;

This is 2x faster than built-in method (for Int weights), since Chars don’t allocate:

n=10^6;  X = sprand(n, n, 1e-4, r-> rand(-1000:1000,r)); 

@time mat_ex(X, "/path/to/my/file.mtx")
 22.077753 seconds (866 allocations: 6.417 GiB, 1.83% gc time)

julia> @time MatrixMarket.mmwrite("/path/to/my/file.mtx", X);
 50.613167 seconds (1.15 G allocations: 32.051 GiB, 2.46% gc time)

Please let me know if there are any possible performance improvements.

1 Like

What would be an efficient way to export _mat_w_ex(io::IO, w::AbstractFloat) without allocations (char by char) and also preserve the accuracy? Scientific notation? Is base 2 or 10 better for preserving data/precision?

As I commented in the linked issue, for specifically writing to an IOBuffer you can probably do better by directly writing the digits in reverse order directly into the buffer, allowing you to just divrem by 10 over and over rather than having to update the divisor.

Note also that you can just do write(io, UInt8('0') + (d % UInt8)) to directly write an ASCII digit byte, rather than looking up from an array.

(If you know you are writing an ASCII character, it is also slightly more efficient to directly write the UInt8 conversion rather than the Char, since for a general Char it needs to check whether a multi-byte UTF8 representation is required. e.g. write(io, UInt8('\n')) is very slighty faster than print(io, '\n') or write(io, '\n').

PS. I would say that print is slightly preferred over write for String and Char data, because it is allowed to change the encoding to match the output stream, whereas write always writes raw UTF8 bytes. (print outputs strings whereas write outputs raw bytes. That’s why write(io, UInt8('\n')) is used in my example, above, to output a raw ASCII byte.) In practice, they are usually equivalent, as is the case here with an IOBuffer stream.

Don’t use a text format?

Why not use a binary format, e.g. HDF5.jl or even just raw bytes? That will be way faster than writing a text format, more space efficient, and will incur no data loss.

Text formats are a poor choice for writing large amounts of numeric data.

1 Like

True, but HDF5.jl, JLD.jl, Serialization.jl are not open formats (any potentially malicious code can hide in there). Sharing data to people without trusting them necessitates using an open format.

Thanks for the tips, will try it out.

You can’t write directly to the buffer of an IOBuffer, because the docs of IOBuffer states:

Passing data as scratch space to IOBuffer with write=true may give unexpected behavior
Once write is called on an IOBuffer, it is best to consider any previous references to data invalidated; in effect IOBuffer “owns” this data until a call to take!. Any indirect mutations to data
could lead to undefined behavior by breaking the abstractions expected by IOBuffer.

You have two options:

  • Carry around your own buffer, e.g. a Vector{UInt8}, and then write your integers to that. Then flush that buffer to your IO when it’s reached a certain size. However, this is pretty annoying. What if you are writing a bunch of different stuff, interleaved, and just not integers? Why should you implement the flushing logic yourself? Etc. etc.
  • Just use BufferIO.jl. For example:
using BufferIO
# use arbitrary IO sink, could be a file or whatever, here IOBuffer
io = BufWriter(IOBuffer())
@allocated write(io, 5) # returns zero
2 Likes

Which definition of “open format” does this refer to?

It’s certainly a problem that JLD will execute untrusted code during loading but it is news to me that HDF5 would.

As long as the loader doesn’t execute code within the data, it doesn’t really matter if billions of Ints are stored as text or in binary form. Nobody can review such amounts of data anyhow.

1 Like

Yes, HDF5.jl should only be used to read trusted data (they do effectively @inbounds based on data from the file for performance); there are sandboxed readers and viewers that can be used GitHub - usnistgov/h5wasm: A WebAssembly HDF5 reader/writer library, and the format is open, so you could implement a hardened subset of the format if required.

Oh, that’s scary, but sounds more like an implementation problem than a fundamental problem with the format. Or are there parts of it that are inherently unsafe?

Yes, if you run it in a sandbox it is safe, so there is no inherent issue with the format.