Int -> Str -> print(io,_) without allocations

I have billions of Ints that I wish to write to a file. Something similar to mmwrite() in MatrixMarket.jl. I am already using IOBuffer with a large enough sizehint. When I do print(io, n::Int), behind the scenes, a conversion to String is made, causing lots of allocations, slowing down the process.

Strings are variable-sized, hence allocate. Is there a more efficient way to do s=some_efficient_string_type(n); print(io,s) that would avoid allocations? Perhaps something related to ShortStrings.jl or StaticStrings.jl?

Related: print(iobuffer, number) without calling string(number)? · Issue #55835 · JuliaLang/julia · GitHub

1 Like

I solved this by writing separate digits to io.

using SparseArrays, MatrixMarket
const _digit_chars ::String = "0123456789";
@inline function _mat_p_ex(io::IO, p::Int) ::Nothing
   #= Export the position/index of a matrix entry. =#
   @inbounds begin n=ndigits(p); D=10^(n-1);  
      for _ = 1:n  
         d = (p Ă· D) % 10;  
         print(io,_digit_chars[d+1]);  
         D Ă·= 10 end end end;
@inline function _mat_w_ex(io::IO, w::tw) where {tw}
   #= Export the weight/value of a matrix entry. =#
   print(io,w) end;  # inefficient default
@inline function _mat_w_ex(io::IO, w::tw) where {tw<:Integer}
   w<0 && print(io,'-'); 
   _mat_p_ex(io, abs(w)) end;  # more efficient specialization
@inline function _entry_write(io::IO, i::Int, j::Int, w::tw, sep::Char) ::Nothing  where {tw}  
   _mat_p_ex(io,i); write(io,sep);  
   _mat_p_ex(io,j); write(io,sep);  
   _mat_w_ex(io,w); write(io,'\n'); 
   nothing end;
function _mat_export(X::SparseMatrixCSC{tw,tp}, m::Int, n::Int, sep::Char, io::IO, buf::IOBuffer, n_bytes::Int)  where {tw, tp}
   I = X.colptr;  pp = X.rowval;  ww = X.nzval;
   for j = 1:n 
      for k = I[j] : I[j+1]-1  
         i = pp[k]
         w = ww[k]  
         _entry_write(buf,i,j,w,sep);  
         if position(buf) ≥ n_bytes  write(io,take!(buf)) end end end;  
   write(io,take!(buf)); 
   nothing end;
function mat_ex(X::SparseMatrixCSC{tw,tp}, file::String, n_bytes::Int=100_000_000, sep::Char=' ', buf::IO=IOBuffer(sizehint=ceil(Int,n_bytes*1.1))) ::Nothing  where {tw,tp}
   #= Export a sparse matrix to the MatrixMarket.jl format, which is cross-language/platform (can be used in Python, MatLab, Mathematica, C++, C, Fortran, ...). =#
   m,n = size(X);  r = nnz(X);  
   open(file, "w") do io
      println(io, "%%MatrixMarket matrix coordinate integer general")
      println(io, "$m $n $r")
      _mat_export(X,m,n,sep,io,buf,n_bytes) end
end;

This is 2x faster than built-in method (for Int weights), since Chars don’t allocate:

n=10^6;  X = sprand(n, n, 1e-4, r-> rand(-1000:1000,r)); 

@time mat_ex(X, "/path/to/my/file.mtx")
 22.077753 seconds (866 allocations: 6.417 GiB, 1.83% gc time)

julia> @time MatrixMarket.mmwrite("/path/to/my/file.mtx", X);
 50.613167 seconds (1.15 G allocations: 32.051 GiB, 2.46% gc time)

Please let me know if there are any possible performance improvements.

1 Like

What would be an efficient way to export _mat_w_ex(io::IO, w::AbstractFloat) without allocations (char by char) and also preserve the accuracy? Scientific notation? Is base 2 or 10 better for preserving data/precision?

As I commented in the linked issue, for specifically writing to an IOBuffer you can probably do better by directly writing the digits in reverse order directly into the buffer, allowing you to just divrem by 10 over and over rather than having to update the divisor.

Note also that you can just do write(io, UInt8('0') + (d % UInt8)) to directly write an ASCII digit byte, rather than looking up from an array.

(If you know you are writing an ASCII character, it is also slightly more efficient to directly write the UInt8 conversion rather than the Char, since for a general Char it needs to check whether a multi-byte UTF8 representation is required. e.g. write(io, UInt8('\n')) is very slighty faster than print(io, '\n') or write(io, '\n').

PS. I would say that print is slightly preferred over write for String and Char data, because it is allowed to change the encoding to match the output stream, whereas write always writes raw UTF8 bytes. (print outputs strings whereas write outputs raw bytes. That’s why write(io, UInt8('\n')) is used in my example, above, to output a raw ASCII byte.) In practice, they are usually equivalent, as is the case here with an IOBuffer stream.

Don’t use a text format?

Why not use a binary format, e.g. HDF5.jl or even just raw bytes? That will be way faster than writing a text format, more space efficient, and will incur no data loss.

Text formats are a poor choice for writing large amounts of numeric data.

2 Likes

True, but HDF5.jl, JLD.jl, Serialization.jl are not open formats (any potentially malicious code can hide in there). Sharing data to people without trusting them necessitates using an open format.

Thanks for the tips, will try it out.

You can’t write directly to the buffer of an IOBuffer, because the docs of IOBuffer states:

Passing data as scratch space to IOBuffer with write=true may give unexpected behavior
Once write is called on an IOBuffer, it is best to consider any previous references to data invalidated; in effect IOBuffer “owns” this data until a call to take!. Any indirect mutations to data
could lead to undefined behavior by breaking the abstractions expected by IOBuffer.

You have two options:

  • Carry around your own buffer, e.g. a Vector{UInt8}, and then write your integers to that. Then flush that buffer to your IO when it’s reached a certain size. However, this is pretty annoying. What if you are writing a bunch of different stuff, interleaved, and just not integers? Why should you implement the flushing logic yourself? Etc. etc.
  • Just use BufferIO.jl. For example:
using BufferIO
# use arbitrary IO sink, could be a file or whatever, here IOBuffer
io = BufWriter(IOBuffer())
@allocated write(io, 5) # returns zero
4 Likes

Which definition of “open format” does this refer to?

It’s certainly a problem that JLD will execute untrusted code during loading but it is news to me that HDF5 would.

As long as the loader doesn’t execute code within the data, it doesn’t really matter if billions of Ints are stored as text or in binary form. Nobody can review such amounts of data anyhow.

2 Likes

Yes, HDF5.jl should only be used to read trusted data (they do effectively @inbounds based on data from the file for performance); there are sandboxed readers and viewers that can be used GitHub - usnistgov/h5wasm: A WebAssembly HDF5 reader/writer library, and the format is open, so you could implement a hardened subset of the format if required.

2 Likes

Oh, that’s scary, but sounds more like an implementation problem than a fundamental problem with the format. Or are there parts of it that are inherently unsafe?

1 Like

Yes, if you run it in a sandbox it is safe, so there is no inherent issue with the format.

The safetensors format is a safe way to store a bunch of matrices in a file and there’s a Julia package for it - SafeTensors.jl.

3 Likes

I’ve been looking at zmij/zmij.c at main · vitaut/zmij · GitHub and they use a really interesting approach where you do the reversal by writing bytes to Int64s in the right order and then you can just write those Int64s out in the right order.

1 Like

You could define your own binary format that’s good enough? If it’s just a linear sequence of ints write a 64-bit integer (count) followed by all the ints directly in binary.

@stevengj Using your suggestions about UInt8, I further decreased the runtime by about 20%, thanks!

for specifically writing to an IOBuffer you can probably do better by directly writing the digits in reverse order directly into the buffer, allowing you to just divrem by 10 over and over rather than having to update the divisor.

Would you mind giving me a code snippet of what you had in mind (ideally editing my _mat_export, _mat_w_ex)? Are you suggesting writing mirror images of numbers to the files? That will leave the wrong result in the end. I want the result that MatrixMarket.jl produces, as it is the standard, used by SuiteSparseMatrixCollection.jl and other collections.

You can’t write directly to the buffer of an IOBuffer, because the docs of IOBuffer states:

@jakobnissen I’m confused, are you referring to me or stevengj? I’m new to buffers, I first used one just a week ago. I don’t understand the docstring you cited. Could you give me a bit more info? Am I doing something unsafe in my code?

What would be an efficient way to export _mat_w_ex(io::IO, w::AbstractFloat) without allocations (char by char) and also preserve the accuracy? Scientific notation? Is base 2 or 10 better for preserving data/precision?

Any hints about this? I don’t want to reinvent the wheel, since numerical math is a topic, well covered by Julia. I don’t know all the details about how float/complex data types work. I’d like a function that works on Float16, Float32, Float64 and exports as many accurate decimals as the data contains, one character at a time.

If this is your requirement you clearly need to follow its specification rather than speculating about writing in base 2.

No matter how standard it is (in some context) it’s still an ASCII format and highly wasteful and unsuitable for storing billions of data. While it’s certainly possible to optimize writing data to the format, it seems like a misguided investment of time.

1 Like

I’m referring to Steven’s suggestion of writing directly to the buffer contained in an IOBuffer. This is unsafe - don’t do it. Specifically, if io::IOBuffer wraps some v::AbstractVector{UInt8}, then you should not mutate v directly, but only through io, using the IOBuffer API.

In the linked issue, IIUC, Steven suggests enhancing IOBuffer’s internal implementation to allow allocation-free writing to it. That would be a good idea, and would be perfectly safe, since it would happen in internal Base code.

2 Likes