Write/read multiple arrays of different types 300+ Million elements each. Brute force .jld seems stupid.

Mikkel-Holm · September 20, 2018, 12:25pm

I am working with a semi-large dataset consisting of multiple arrays of different types each of ~300 Mil indexes.
Using the JLD package to load the data as .jld files takes around 60 minutes, which is simply too slow.

Writing/reading binary is takes only ~60 seconds, but then I have trouble working with strings.

As an example, how does one convert a large Array{UInt8,1} back to Array{String,1}, as illustrated below.
Write to binary
a = fill(“hello world”,300*10^6); # a = fill(“hello world”, 10^5) if ram is an issue
out = open(“test.bin”,“w”)
write(out,a)
close(out)
Read from binary
out = open(“test.bin”,“r”)
a = read(out);

Then I am stuck at Char.(a[1:11]). How does one get back to the original array of “hello world”?

nilshg · September 20, 2018, 1:41pm

join(Char.(a[1:11]) returns `“hello world”, and

[join(Char.(a[1+11*i:11+11*i])) for i ∈ 0:100000-1]

returns your original array (in the 10^5 case above, RAM is an issue on this machine!

But it appears to me this is dependent on fixed width?

Tamas_Papp · September 20, 2018, 1:59pm

To write variable length strings in binary, you have to encode the length in some way. Eg put an Int16 in front (if your strings are shorter than 2^{15} bytes), etc. Then you can use the String(::Vector{UInt8}) constructor after you have read that many bytes.

There are alternative solutions, such as mmapping the files etc, but the above is quite fast in practice.

Mikkel-Holm · September 20, 2018, 2:15pm

This only works for “hello world strings”. But useful to see the application of join to concatenate the character arrays into single strings.

Mikkel-Holm · September 20, 2018, 2:21pm

That is a great answer, but how would one go about implementing the Int16 in front of strings option?

Tamas_Papp · September 20, 2018, 2:22pm

read(io, Int16) and write(io, Int16(...)) would take care of this.

kristoffer.carlsson · September 20, 2018, 2:22pm

JLD2 loads 10*10^7 strings in 3 seconds so should be faster than JLD for your dataset.

Mikkel-Holm · September 20, 2018, 2:37pm

@Tamas_Papp could you show a functioning example of how one would go about writing fx. [“1st string”,“2nd string”,“3rd string”] to binary and back to Array{String,1}.

It seems there is limited information on writing strings to binary. A simple example that can scale to 100+ million element arrays would be very useful for the community.

Tamas_Papp · September 20, 2018, 2:59pm

Something like

# T is the type we use for string length, it has to be consistent between read and write
function write_string(io::IO, T::Type{<:Signed}, str::AbstractString)
    raw = codeunits(str)
    write(io, T(length(raw)))
    write(io, raw)
end

function read_string(io::IO, T::Type{<:Signed})
    len = read(io, T)
    String(read(io, len))
end

tmp = tempname()
str = "Ἐγχειρίδιον Ἐπικτήτου"

open(tmp, "w") do io
    write_string(io, Int16, str)
end

open(tmp, "r") do io
    read_string(io, Int16)
end

but I am sure someone can optimize this.

Mikkel-Holm · September 20, 2018, 3:03pm

Well its a start @Tamas_Papp
I will test it out on a 500+ million element string array, and get back with benchmarks.

kristoffer.carlsson · September 20, 2018, 3:04pm

Here is another attempt

function write_strs(f, strings)
    open(f, "w") do io
        for str in strings
            write(io, UInt16(sizeof(str)))
            write(io, str)
        end
    end
end

function read_strs(f)
    strings = String[]
    open(f, "r") do io
        while !eof(io)
            l = read(io, UInt16)
            s = Base.StringVector(l)
            for i in 1:l
                s[i] = read(io, UInt8)
            end
            push!(strings, String(s))
        end
    end
    return strings
end

strings = ["aba", "cbc", "αβ", "longerstring...."]
write_strs("foo", strings)
read_strs("foo")

a = fill("hello world", 10^7)
write_strs("big", a)
@time b = read_strs("big")
@assert b==a

Mikkel-Holm · September 20, 2018, 3:10pm

@kristoffer.carlsson I will benchmark your attempt against @Tamas_Papp on 500+ million elements Array{String,1} and get back with results.

kristoffer.carlsson · September 20, 2018, 4:57pm

JLD2 loads it in almost the same time as these handwritten functions so might be better just going with that. The filesize is a bit bigger though.

Tamas_Papp · September 21, 2018, 5:11am

The problem with basing my workflow on JLD2 is that if it is broken, I can’t just fix it easily. This was a problem in practice with the transition to 1.0 for me, and since then I have been wary of using it.

kristoffer.carlsson · September 21, 2018, 2:09pm

The 1.0 release should hopefully at least reduce that worry a bit.

Topic		Replies	Views
Loading/writing a single element from an array in JLD Data jld	3	1874	December 3, 2017
Efficient Disk Usage and JLD General Usage	5	632	April 18, 2018
How to optymaly to save strings in JLD ? What wrong? General Usage	7	924	September 24, 2017
Performance of Memory Mapped Arrays (vs. JLD2) Performance	2	2159	October 26, 2018
Writing an array too large to store in memory Performance question , jld	2	2206	March 16, 2018

Write/read multiple arrays of different types 300+ Million elements each. Brute force .jld seems stupid.

Related topics