Read a text file directly as vector of SVectors


#1

Here is my try:

function read_dataset(filename, V::Type{SVector{D, T}}, sep::Char = ',') where {D, T}

    data = SVector{D, T}[]
    open(filename) do io
        for ss in eachline(io)
            s = split(ss, sep)
            push!(data, V(ntuple(k -> parse(T, s[k]), Val(D))))
        end
    end
    return data
end

I feel like this is completely “wrong” though, because when I checked the source code for readdlm it seems super complicated and to understand it I would have to go very deep.

So my simple question is if there is already an existing method that loads a text file directly as a Vector{SVector} and not as a Matrix?

Notice : Reinterpret cannot work here because for 99.9% of datafiles I will need to load, the row axis is the “SVector” axis, and the column axis gives the discrete data points, e.g.:

-0.3999999999999999,0.3
1.076,-0.11999999999999997
-0.7408864000000001,0.32280000000000003
0.554322279213056,-0.22226592

etc. This means that I would first need to transpose the matrix, which does not really seem efficient?


#2

Not sure if there is a faster way, but why not just read it to a vector, line by line, then reinterpret at the end? If u know the size of the vector, u can also use sizehint! to allocate once.


#3
  1. How can I tell the size of the vector I will need? Is there a function that tells me how many lines the file will have?
  2. Isn’t the way reinterpret works changed in 0.7?

#4

You have to read the file, eg with readline and just count, or read by byte and just count the \ns, adjusting for whether the last line has a terminating newline or not.

But a single pass with push! should be faster; you can always do a sizehint! at the end.


#5

Okay I clearly do not get it, so please explain a bit more…

After I have finished reading each line, I will already have pushed all the data in the vector. Thus I will also have counted everything and already filled my vector.

What is the meaning of sizehint! then? It is completely meaningless, right? I truly don’t get how sizehint! can be used at “the end”.


#6

There is countlines but it will read the file anyways. If memory is not a problem, you can use readlines to read all lines at once in a string vector then call length. Benchmarking both would be interesting. Also if you have full control over the format, then encoding this information in the first line would probably be the fastest.

Not sure about the second question.


#7

Sorry, I meant resize!. I am under the impression that it will free the extra storage, but I may be misreading the code.