Iterate delimited file as NamedTuples

question

#1

I have (gzip compressed) files that have ;-delimited rows. I would like to iterate over the rows of each file so that in the body of the iteration, I would have a NamedTuple of the raw fields, ie unparsed. This is because

  1. I only need a subset of the columns, so parsing them all would be wasted,
  2. some numbers have a , as the decimal separator, some have weird strings for missing values,
  3. I mostly want to do tabulations on some fields (ie counting), or save certain transformed statistics in a Dict or similar.

Total (uncompressed) data is about 400G. I have a hand-rolled solution using BioJulia/Libz.jl, readline, and split, which lacks everything but the NamedTuples. I would be happy to sacrifice speed for not using indexes if necessary.

I wonder if someone can point me in the right direction using the library ecosystem, eg CSV.Source and IterateTables, or similar. I tried but could not get single-pass reading or raw fields.


#2

You can use CSV.jl and manually specify the column types as Strings, i.e.

source = CSV.Source("data.csv", types=[String,String,String])

That doesn’t deal with the gzip part, but I believe you can pass a stream instead of the filename to CSV.Source, so you might be able to just use the Libz.jl decompression stuff for that.

If you then load IterableTables, source will be an iterable table. If you want to manually iterate over things, you can call getiterator(source) and that will return an instance that you can iterate over, and it will return NamedTuples with the columns as Strings as the elements of the iterator. Note that if you put a function barrier into the whole thing you can iterate over things in a type stable manner. So the whole story would probably look like this (I haven’t tried this code!):

function my_processing_function(it)
    for i in it
        # Here you can access the individual columns as i.column_name
    end
end

file = open(filename, "r")
io = Libz.ZlibInflateInputStream(file)
csv = CSV.Source(io, types=fill(String,number_of_columns))
it = getiterator(csv)
myprocessingfunction(it)

I left out any code that frees the various resources you are allocating here, i.e. you might want to wrap this into various try...finally blocks that close the various resources this is allocating.

Alternatively you might also be able to express your data transformation as a Query.jl, in which case you could just query csv directly:

@from i in csv
    # Do whatever you want to do here
end

The query will automatically put things behind a function barrier, i.e. any transformation that is happening inside the query itself should be run in a type stable way.

The whole CSV.jl integration works but is a bit clunky because of the manual resource management part. I’ve been thinking about writing a CSV parser that integrates more smoothly with the IterableTables/Query world, and it might not even be that difficult given all the create plumbing from TextParse.jl that I could reuse, but right now it is not high on my priority list, there are too many other things to do first that seem more important to me.


#3

Thank you for your help. I benchmarked your solution agains the “naive” one below

using Libz

function dofields(f, io;
                  progress = 1000000, limit = 0, delim = ';', maxlines = 0)
    line = 0
    while !eof(io)
        if isa(progress, Integer) && line % progress == 0
            print(".")
        end
        line += 1
        if maxlines > 0 && line ≥ maxlines
            break
        end
        f(split(readline(io), delim; limit = limit))
    end
end

function stats(filename; options...)
    io = open(filename, "r")
    c = Dict{String,Int}()
    dofields(ZlibInflateInputStream(io);
             limit = 7, delim = ';', options...) do fields
                 kind = fields[6]
                 c[kind] = get(c, kind, 0) + 1
             end
    close(io)
    c
end

and using the getiterator approach takes about 10x longer (850s vs the 90s for dofields above).

I wonder if I could have my cake and eat it to, and define a Data.Source for this particular purpose. I have looked at the functions, and it is unclear to me how to make a NamedTuple from a given set of column names (I guess I cannot use @NT since they are not known at compile time). I am actually reading the column names from another file.


#4

Can you post the exact code that you used for timing the getiterator version? I’d like to dig into this and figure out what is going on. The data file is probably too big to post somewhere, or is the compressed version more easily handled? Or maybe we could create a script that creates a data file with fake data but the right size?

This part in the iterable tables documentation has some pointers on how to expose something as an iterable table. Here is the code for the DataTable implementation. Essentially one needs to implement isiterable, isiterabletable, getiterator, length, eltype, start, next and done. The key is to have start, next and done type stable, so one has to use generated functions for that.

Specifically, this is the code that creates the NamedTuple instances in each iteration, all in a generated function.


#5

@davidanthoff: Thanks for looking into this. I made an MWE you an experiment with.

It appears that the indexing of NamedTuples accounts for most of the speed difference: if I just use a numerical index, it is much faster. But there is still a factor of 2-5 (strangely, depending on data size).

I appreciate your help, using named columns would be so much nicer.


#6

@Tamas_Papp Could you also post the code that uses a numerical index with the NamedTuple? Seems really strange that that would be faster…

The implementation that uses getiterator is allocating a lot more memory (about an order of magnitude). I’m trying to find out why that is, but I’m running in this https://github.com/JuliaLang/julia/issues/21838


#7

@davidanthoff I posted a revision in the same gist, but now there is no difference between named and indexed columns. I updated packages in the meantime, but not Julia, so something must have fixed it. One less thing to worry about I guess.