Write/read multiple arrays of different types 300+ Million elements each. Brute force .jld seems stupid.

I am working with a semi-large dataset consisting of multiple arrays of different types each of ~300 Mil indexes.
Using the JLD package to load the data as .jld files takes around 60 minutes, which is simply too slow.

Writing/reading binary is takes only ~60 seconds, but then I have trouble working with strings.

As an example, how does one convert a large Array{UInt8,1} back to Array{String,1}, as illustrated below.
Write to binary
a = fill(“hello world”,300*10^6); # a = fill(“hello world”, 10^5) if ram is an issue
out = open(“test.bin”,“w”)
Read from binary
out = open(“test.bin”,“r”)
a = read(out);

Then I am stuck at Char.(a[1:11]). How does one get back to the original array of “hello world”?

join(Char.(a[1:11]) returns `“hello world”, and

[join(Char.(a[1+11*i:11+11*i])) for i ∈ 0:100000-1]

returns your original array (in the 10^5 case above, RAM is an issue on this machine!

But it appears to me this is dependent on fixed width?

To write variable length strings in binary, you have to encode the length in some way. Eg put an Int16 in front (if your strings are shorter than 2^{15} bytes), etc. Then you can use the String(::Vector{UInt8}) constructor after you have read that many bytes.

There are alternative solutions, such as mmapping the files etc, but the above is quite fast in practice.

1 Like

This only works for “hello world strings”. But useful to see the application of join to concatenate the character arrays into single strings.

That is a great answer, but how would one go about implementing the Int16 in front of strings option?

read(io, Int16) and write(io, Int16(...)) would take care of this.

JLD2 loads 10*10^7 strings in 3 seconds so should be faster than JLD for your dataset.


@Tamas_Papp could you show a functioning example of how one would go about writing fx. [“1st string”,“2nd string”,“3rd string”] to binary and back to Array{String,1}.

It seems there is limited information on writing strings to binary. A simple example that can scale to 100+ million element arrays would be very useful for the community.

Something like

# T is the type we use for string length, it has to be consistent between read and write
function write_string(io::IO, T::Type{<:Signed}, str::AbstractString)
    raw = codeunits(str)
    write(io, T(length(raw)))
    write(io, raw)

function read_string(io::IO, T::Type{<:Signed})
    len = read(io, T)
    String(read(io, len))

tmp = tempname()
str = "Ἐγχειρίδιον Ἐπικτήτου"

open(tmp, "w") do io
    write_string(io, Int16, str)

open(tmp, "r") do io
    read_string(io, Int16)

but I am sure someone can optimize this.

1 Like

Well its a start @Tamas_Papp :slight_smile:
I will test it out on a 500+ million element string array, and get back with benchmarks.

Here is another attempt

function write_strs(f, strings)
    open(f, "w") do io
        for str in strings
            write(io, UInt16(sizeof(str)))
            write(io, str)

function read_strs(f)
    strings = String[]
    open(f, "r") do io
        while !eof(io)
            l = read(io, UInt16)
            s = Base.StringVector(l)
            for i in 1:l
                s[i] = read(io, UInt8)
            push!(strings, String(s))
    return strings

strings = ["aba", "cbc", "αβ", "longerstring...."]
write_strs("foo", strings)

a = fill("hello world", 10^7)
write_strs("big", a)
@time b = read_strs("big")
@assert b==a

@kristoffer.carlsson I will benchmark your attempt against @Tamas_Papp on 500+ million elements Array{String,1} and get back with results.

JLD2 loads it in almost the same time as these handwritten functions so might be better just going with that. The filesize is a bit bigger though.

1 Like

The problem with basing my workflow on JLD2 is that if it is broken, I can’t just fix it easily. This was a problem in practice with the transition to 1.0 for me, and since then I have been wary of using it.

1 Like

The 1.0 release should hopefully at least reduce that worry a bit.