How can I read back a delimited file with mixed column types involving vectors

Say I have data in the following format

v = [(1, 0.1, [1, 2, 3]), (2, 0.2, [3,4,5])]

I can write it to a delimited file using

using DelimitedFiles
open("test.csv", w) do io
     writedlm(io, v)
end

The content of test.csv is

1	0.1	[1, 2, 3]
2	0.2	[3, 4, 5]

I tried the following way to read it back:

w = open("test.csv", "r") do io
    readdlm(io)
end

However, it splits the vector into parts of strings.

With CSV.jl, it also errored:

using CSV
CSV.File("test.csv", types=[Int, Float64, Vector{Int}]

Anyone has a better idea other than breaking the vector manually before saving?

You need to use a file format which supports reading and writing vectors, not a plain text file.

Same discussion was had here: DataFrames/CSV: how to read vectors from *.csv?

I am aware of that discussion after I posted this. However, I don’t want to add those heavy dependency packages such DataFrames.jl. I guess I will simply extract all elements and encoded the vector length in another column. Thanks!

one possible way to do this.

map( r-> (@. eval(Meta.parse(*(w[r,end-2:end]...)))), axes(w,1))

Although I have no idea how inefficient it is.

For each line, replace commas and braces by blanks, then split the data.

This is not going to win any coding prize but, fwiw:

m = []
open("test.csv", "r") do io
    while !eof(io)
       push!(m, eval(Meta.parse(readline(io))))
    end
end
m
2-element Vector{Any}:
 (1, 0.1, [1, 2, 3])
 (2, 0.2, [3, 4, 5])

m == v    # true
4 Likes

Wow, that means I can even parse more complex tuples like (1, 0.1, [1, 2, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) which is very close to my actual use case. Thanks!

I too had thought of a solution “close” to the one proposed by @rafael.guerra

map( r-> (@. eval(Meta.parse(*(w[r,end-2:end]...)))), axes(w,1))

arr=[]
open("arrayinfile.csv", "r") do io
    while !eof(io)
        w = readline(io)
        push!(arr,eval(Meta.parse(match(r"\[(.*?)\]", w).match)))
    end
end

Note that this (the solution proposed by @rafael.guerra ) works if you use a comma and not a space as the delimiter when saving the file, as is the default.
Note incidentally in this regard that the documentation is misleading as it describes a kwarg delim = which is not actually defined.

julia> using DelimitedFiles

help?> writedlm
search: writedlm

  writedlm(f, A, delim='\t'; opts)

  Write A (a vector, matrix, or an iterable collection of      
  iterable rows) as text to f (either a filename string or an  
  IO stream) using the given delimiter delim (which defaults   
  to tab, but can be any printable Julia object, typically a   
  Char or AbstractString).

the syntax that works is the following:

open("arrayinfile.csv", "w") do io
   writedlm(io, v, ',')
end

Sometimes I want to read the CSV by eye, so I choose “\t” for the delimiter. And I come up the following solution in the end which works as expected.

trace = []
    open(trace_file, "r") do io
        readline(io)  # skip the header line
        while !eof(io)
            line = replace(readline(io), "\t"=>",");
            push!(trace, eval(Meta.parse(line)))
        end
    end

You should know that this approach is not really that fast. :sweat_smile:

On my computer I get

julia> v = [(rand(Int), rand(), rand(Int64, 3)) for _ in 1:1000];

julia> open("test.csv", "w") do io
            writedlm(io, v, ',')
       end

julia> function open_csv()
           m = Vector{Tuple{Int, Float64, Vector{Int}}}()
           open("test.csv", "r") do io
               while !eof(io)
                   push!(m, eval(Meta.parse(readline(io))))
               end
           end
           m
       end
open_csv (generic function with 1 method)

julia> open_csv(); # compile it

julia> @time @eval w = open_csv();
  0.273463 seconds (54.11 k allocations: 2.600 MiB)

julia> w == v
true

So just over 0.2 seconds for a fairly simple expression repeated just 1000 times. If you need to save this data and retrieved in bigger amounts, maybe you should consider another approach to save it in binary format.

Or in another free style CSV format, like specifying number of arrays, later number of elements and then the elements themselves, just for bringing an example. :smile:

1 Like

Thanks for benchmarking this. The time is totally fine with me :slight_smile:

1 Like