You can use CSV.jl and manually specify the column types as
source = CSV.Source("data.csv", types=[String,String,String])
That doesn’t deal with the gzip part, but I believe you can pass a stream instead of the filename to
CSV.Source, so you might be able to just use the Libz.jl decompression stuff for that.
If you then load IterableTables,
source will be an iterable table. If you want to manually iterate over things, you can call
getiterator(source) and that will return an instance that you can iterate over, and it will return
NamedTuples with the columns as
Strings as the elements of the iterator. Note that if you put a function barrier into the whole thing you can iterate over things in a type stable manner. So the whole story would probably look like this (I haven’t tried this code!):
for i in it
# Here you can access the individual columns as i.column_name
file = open(filename, "r")
io = Libz.ZlibInflateInputStream(file)
csv = CSV.Source(io, types=fill(String,number_of_columns))
it = getiterator(csv)
I left out any code that frees the various resources you are allocating here, i.e. you might want to wrap this into various
try...finally blocks that
close the various resources this is allocating.
Alternatively you might also be able to express your data transformation as a Query.jl, in which case you could just query
@from i in csv
# Do whatever you want to do here
The query will automatically put things behind a function barrier, i.e. any transformation that is happening inside the query itself should be run in a type stable way.
The whole CSV.jl integration works but is a bit clunky because of the manual resource management part. I’ve been thinking about writing a CSV parser that integrates more smoothly with the IterableTables/Query world, and it might not even be that difficult given all the create plumbing from TextParse.jl that I could reuse, but right now it is not high on my priority list, there are too many other things to do first that seem more important to me.