Binary reading and matrix allocations: most performant way?

Would anyone be able to offer some insight towards getting faster performance for the below sort of use case?

Say I have a struct that holds a Matrix and my primary goal is to fill it with values, possibly transposed, and it could be a very large matrix:

struct mytest
	mat::Matrix
end

Now to define dimensions and create some binary data for the sake of a MWE for testing performance:

m = 5000
n = 50000
dtype = Float32
nbytes = sizeof(dtype) * m * n
bytes = rand(UInt8, nbytes)
tempfile = "tmp"
open(tempfile, "w") do fid
	write(fid, bytes)
end

Defining a few different functions for reading the sample data:

function testread1(fid::IOStream, dtype::Type, m::Int, n::Int)
	seek(fid, 0)
	nbytes = sizeof(dtype) * m * n
	temp = zeros(UInt8, nbytes)
	readbytes!(fid, temp, nbytes)
	temp2 = reinterpret(dtype, temp)
	return reshape(temp2, (m, n))'
end

function testread2(fid::IOStream, dtype::Type, m::Int, n::Int)
	seek(fid, 0)
	read!(fid, Array{dtype}(undef, m, n))'
end

function testread3(fid::IOStream, dtype::Type, m::Int, n::Int)::Matrix
	seek(fid, 0)
	read!(fid, Array{dtype}(undef, m, n))'
end

And finally testing performance of both reading the data and then allocating a mytest struct:

using BenchmarkTools

fid = open(tempfile, "r")
@benchmark test1 = testread1(fid, dtype, m, n) # median 0.159 s
@benchmark test2 = testread2(fid, dtype, m, n) # median 0.129 s
@benchmark test3 = testread3(fid, dtype, m, n) # median 1.182 s

@benchmark a = mytest(test1) # median 1.239 s
@benchmark b = mytest(test2) # median 1.241 s
@benchmark c = mytest(test3) # median 0.00074 s

As you can see, testread1() and testread2() were fast but they return an adjoint(::Matrix{Float32}) or similar, which is then costly to finally convert into a plain Matrix when I allocate a mytest struct from it.

Any advice?

Update: I realize it’s basically the transpose operation that’s hurting performance, and I guess that’s the heart of my question. I find that it often comes up in my work that I need to transpose a matrix, or reshape it, etc, and then use it in a context that strictly requires a Matrix, not an adjoint for example.

How about using Memory-mapped IO?

https://docs.julialang.org/en/v1/stdlib/Mmap/

1 Like

The biggest performance problem are probably type instabilities Matrix is not a concrete type and neither is Array{dtype}. (The dimension needs to be specified)
Unless it is not possible for some reason, that should be fixed first.

1 Like

I’m not sure of my measurements (because I get different results on different attempts), but this should be slightly faster

function testread4(fid::IOStream, dtype::Type, m::Int, n::Int)
	seek(fid, 0)
    rowbytes= sizeof(dtype) * m
    row = Vector{UInt8}(undef, rowbytes)
	mat = Matrix{dtype}(undef, n, m)
    @inbounds for r in 1:n
        readbytes!(fid, row, rowbytes)
	    mat[r,:] .= reinterpret(dtype,row)
    end
	mat
end

# or using this form: It seems as if, by using a function as a for argument, the compiler automatically avoids bounds checks. Does anyone know if this is really the case?

function testread4(fid::IOStream, dtype::Type, m::Int, n::Int)
	seek(fid, 0)
    rowbytes= sizeof(dtype) * m
    row = Vector{UInt8}(undef, rowbytes)
	mat = Matrix{dtype}(undef, n, m)
    for r in axes(mat,1)
        readbytes!(fid, row, rowbytes)
	    mat[r,:] .= reinterpret(dtype,row)
    end
	mat
end