Is this an efficient way to write some information into a .csv file using julia?


#1

Is the following efficient? I was mainly concerned about the push! function.

function writeToFile(filename::String, am::Vector{Am})
	df=DataFrame(name1=Int[], name2=Int[], name3=Int[])
	for a in am
		push!(df,[a.name1,a.name2,a.name3])
        end
        CSV.write(filename,df)
end

#2
CSV.write(filename, am)

should work already if am is a Vector of NamedTuples.


#3

@piever. What do you mean by a NamedTuples?

I tried a toy example and got an error.

using CSV
mutable struct myType
	index::Int      # Int64
        sIndex::Int    # Int64
        myType() = new(0,0)
end

function writeToFile(filename::String, a::Vector{myType})
	@assert length(a)>0
        CSV.write(filename,a)
end

function main()
   aVector = Vector{myType}(6)
   for i in 1:6
	   aVector[i] = myType()
	   aVector[i].index=i
	   aVector[i].sIndex = i
   end
   writeToFile("path to my csv", aVector)
end

main()

I got an error, saying that "ArgumentError: no default Tables.row implementation for type: Array{myType, 1}.


#4

The following only applies if you have the latest version of Tables.jl (which is 0.1.14), I’m not sure it will work in earlier versions.

A NamedTuple is an object of type (a=1, b="something"): like a Tuple except it has names. An Array of NamedTuples (for example v = [(a=i, b=i+1) for i in 1:3]) can be considered a table and therefore you can do CSV.write(filename, v). I thought am was a vector of named tuples. If it isn’t, instead of allocating a DataFrame (in case your data is really big and you are afraid this could slow things down) you can simply generate a NamedTuple iterator and write it to CSV.

itr = ((name1=a.name1, name2=a.name2, name3=a.name3) for a in am)
CSV.write(filename, itr)

This way you avoid allocating the DataFrame and the data is streamed directly from your vector to the CSV. Note that itr is probably a better way to convert to DataFrame. For example you could just do DataFrame(itr) and get the DataFrame (avoiding the temporary vectors [a.name1, a.name2, a.name3]).


#5

@piever, is there a corresponding reversed way? I mean, to read a .csv file into a NamedTuple, and then to create a type/mutable struct based on it. :wink:


#6

To consider a DataFrame as an iterator of rows you can simply do:

struct MyType
    x::Int
    y::Float64  
end  
df = CSV.read(filename)
struct_vec = map(row -> MyType(row.x, row.y), Tables.rows(df))

As an alternative, if your data is stored as a DataFrame and therefore in columnar storage (it is basically a list of vectors), you can convert it to a named tuple of arrays with Tables.columntable(df) and then create a StructArray from it:

using StructArrays
cols = Tables.columntable(df)
StructArray{MyType}(cols)

#7

Yes, note that you can do table = CSV.File(file) |> Tables.columntable to materialize a csv file into a NamedTuple of Vectors directly, no intermediate construction into a DataFrame. In general, you can directly materialize a csv file into any Tables.jl-compatible “sink” function, like SQLite.load!, Feather.write, IndexedTables.table, etc. It makes it nice to not have to always intermediate through DataFrames if that’s not needed.