I am working with a semi-large dataset consisting of multiple arrays of different types each of ~300 Mil indexes.
Using the JLD package to load the data as .jld files takes around 60 minutes, which is simply too slow.
Writing/reading binary is takes only ~60 seconds, but then I have trouble working with strings.
As an example, how does one convert a large Array{UInt8,1} back to Array{String,1}, as illustrated below. Write to binary
a = fill(“hello world”,300*10^6); # a = fill(“hello world”, 10^5) if ram is an issue
out = open(“test.bin”,“w”)
write(out,a)
close(out) Read from binary
out = open(“test.bin”,“r”)
a = read(out);
Then I am stuck at Char.(a[1:11]). How does one get back to the original array of “hello world”?
To write variable length strings in binary, you have to encode the length in some way. Eg put an Int16 in front (if your strings are shorter than 2^{15} bytes), etc. Then you can use the String(::Vector{UInt8}) constructor after you have read that many bytes.
There are alternative solutions, such as mmapping the files etc, but the above is quite fast in practice.
@Tamas_Papp could you show a functioning example of how one would go about writing fx. [“1st string”,“2nd string”,“3rd string”] to binary and back to Array{String,1}.
It seems there is limited information on writing strings to binary. A simple example that can scale to 100+ million element arrays would be very useful for the community.
# T is the type we use for string length, it has to be consistent between read and write
function write_string(io::IO, T::Type{<:Signed}, str::AbstractString)
raw = codeunits(str)
write(io, T(length(raw)))
write(io, raw)
end
function read_string(io::IO, T::Type{<:Signed})
len = read(io, T)
String(read(io, len))
end
tmp = tempname()
str = "Ἐγχειρίδιον Ἐπικτήτου"
open(tmp, "w") do io
write_string(io, Int16, str)
end
open(tmp, "r") do io
read_string(io, Int16)
end
function write_strs(f, strings)
open(f, "w") do io
for str in strings
write(io, UInt16(sizeof(str)))
write(io, str)
end
end
end
function read_strs(f)
strings = String[]
open(f, "r") do io
while !eof(io)
l = read(io, UInt16)
s = Base.StringVector(l)
for i in 1:l
s[i] = read(io, UInt8)
end
push!(strings, String(s))
end
end
return strings
end
strings = ["aba", "cbc", "αβ", "longerstring...."]
write_strs("foo", strings)
read_strs("foo")
a = fill("hello world", 10^7)
write_strs("big", a)
@time b = read_strs("big")
@assert b==a
The problem with basing my workflow on JLD2 is that if it is broken, I can’t just fix it easily. This was a problem in practice with the transition to 1.0 for me, and since then I have been wary of using it.