You can use CSV.jl and manually specify the column types as String
s, i.e.
source = CSV.Source("data.csv", types=[String,String,String])
That doesn’t deal with the gzip part, but I believe you can pass a stream instead of the filename to CSV.Source
, so you might be able to just use the Libz.jl decompression stuff for that.
If you then load IterableTables, source
will be an iterable table. If you want to manually iterate over things, you can call getiterator(source)
and that will return an instance that you can iterate over, and it will return NamedTuple
s with the columns as String
s as the elements of the iterator. Note that if you put a function barrier into the whole thing you can iterate over things in a type stable manner. So the whole story would probably look like this (I haven’t tried this code!):
function my_processing_function(it)
for i in it
# Here you can access the individual columns as i.column_name
end
end
file = open(filename, "r")
io = Libz.ZlibInflateInputStream(file)
csv = CSV.Source(io, types=fill(String,number_of_columns))
it = getiterator(csv)
myprocessingfunction(it)
I left out any code that frees the various resources you are allocating here, i.e. you might want to wrap this into various try...finally
blocks that close
the various resources this is allocating.
Alternatively you might also be able to express your data transformation as a Query.jl, in which case you could just query csv
directly:
@from i in csv
# Do whatever you want to do here
end
The query will automatically put things behind a function barrier, i.e. any transformation that is happening inside the query itself should be run in a type stable way.
The whole CSV.jl integration works but is a bit clunky because of the manual resource management part. I’ve been thinking about writing a CSV parser that integrates more smoothly with the IterableTables/Query world, and it might not even be that difficult given all the create plumbing from TextParse.jl that I could reuse, but right now it is not high on my priority list, there are too many other things to do first that seem more important to me.