How can I read back a delimited file with mixed column types involving vectors

liuyxpp · May 15, 2022, 2:18am

Say I have data in the following format

v = [(1, 0.1, [1, 2, 3]), (2, 0.2, [3,4,5])]

I can write it to a delimited file using

using DelimitedFiles
open("test.csv", w) do io
     writedlm(io, v)
end

The content of test.csv is

1	0.1	[1, 2, 3]
2	0.2	[3, 4, 5]

I tried the following way to read it back:

w = open("test.csv", "r") do io
    readdlm(io)
end

However, it splits the vector into parts of strings.

With CSV.jl, it also errored:

using CSV
CSV.File("test.csv", types=[Int, Float64, Vector{Int}]

Anyone has a better idea other than breaking the vector manually before saving?

nilshg · May 15, 2022, 6:17am

You need to use a file format which supports reading and writing vectors, not a plain text file.

Same discussion was had here: DataFrames/CSV: how to read vectors from *.csv?

liuyxpp · May 15, 2022, 9:34am

I am aware of that discussion after I posted this. However, I don’t want to add those heavy dependency packages such DataFrames.jl. I guess I will simply extract all elements and encoded the vector length in another column. Thanks!

rocco_sprmnt21 · May 15, 2022, 1:38pm

one possible way to do this.

map( r-> (@. eval(Meta.parse(*(w[r,end-2:end]...)))), axes(w,1))

Although I have no idea how inefficient it is.

lmiq · May 15, 2022, 2:39pm

For each line, replace commas and braces by blanks, then split the data.

rafael.guerra · May 15, 2022, 3:24pm

This is not going to win any coding prize but, fwiw:

m = []
open("test.csv", "r") do io
    while !eof(io)
       push!(m, eval(Meta.parse(readline(io))))
    end
end
m
2-element Vector{Any}:
 (1, 0.1, [1, 2, 3])
 (2, 0.2, [3, 4, 5])

m == v    # true

liuyxpp · May 16, 2022, 1:14am

Wow, that means I can even parse more complex tuples like (1, 0.1, [1, 2, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) which is very close to my actual use case. Thanks!

rocco_sprmnt21 · May 16, 2022, 8:41am

I too had thought of a solution “close” to the one proposed by @rafael.guerra

map( r-> (@. eval(Meta.parse(*(w[r,end-2:end]...)))), axes(w,1))

arr=[]
open("arrayinfile.csv", "r") do io
    while !eof(io)
        w = readline(io)
        push!(arr,eval(Meta.parse(match(r"\[(.*?)\]", w).match)))
    end
end

Note that this (the solution proposed by @rafael.guerra ) works if you use a comma and not a space as the delimiter when saving the file, as is the default.
Note incidentally in this regard that the documentation is misleading as it describes a kwarg delim = which is not actually defined.

julia> using DelimitedFiles

help?> writedlm
search: writedlm

  writedlm(f, A, delim='\t'; opts)

  Write A (a vector, matrix, or an iterable collection of      
  iterable rows) as text to f (either a filename string or an  
  IO stream) using the given delimiter delim (which defaults   
  to tab, but can be any printable Julia object, typically a   
  Char or AbstractString).

the syntax that works is the following:

open("arrayinfile.csv", "w") do io
   writedlm(io, v, ',')
end

liuyxpp · May 16, 2022, 10:22am

Sometimes I want to read the CSV by eye, so I choose “\t” for the delimiter. And I come up the following solution in the end which works as expected.

trace = []
    open(trace_file, "r") do io
        readline(io)  # skip the header line
        while !eof(io)
            line = replace(readline(io), "\t"=>",");
            push!(trace, eval(Meta.parse(line)))
        end
    end

suavesito · May 16, 2022, 1:47pm

You should know that this approach is not really that fast.

On my computer I get

julia> v = [(rand(Int), rand(), rand(Int64, 3)) for _ in 1:1000];

julia> open("test.csv", "w") do io
            writedlm(io, v, ',')
       end

julia> function open_csv()
           m = Vector{Tuple{Int, Float64, Vector{Int}}}()
           open("test.csv", "r") do io
               while !eof(io)
                   push!(m, eval(Meta.parse(readline(io))))
               end
           end
           m
       end
open_csv (generic function with 1 method)

julia> open_csv(); # compile it

julia> @time @eval w = open_csv();
  0.273463 seconds (54.11 k allocations: 2.600 MiB)

julia> w == v
true

So just over 0.2 seconds for a fairly simple expression repeated just 1000 times. If you need to save this data and retrieved in bigger amounts, maybe you should consider another approach to save it in binary format.

Or in another free style CSV format, like specifying number of arrays, later number of elements and then the elements themselves, just for bringing an example.

liuyxpp · May 16, 2022, 1:51pm

Thanks for benchmarking this. The time is totally fine with me

Topic		Replies	Views
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2848	March 26, 2021
DataFrames: reading vector from *.csv file to dataframe column General Usage	2	1373	October 3, 2019
Outputing/Inputing vectors in DataFrames General Usage dataframes , csv	2	468	January 17, 2023
Read vector from data file Data csv , io	8	739	January 18, 2024
Reading complex text files with vectors General Usage question , io	6	1058	August 25, 2021

How can I read back a delimited file with mixed column types involving vectors

Related topics