Reading a non-uniform CSV file

I am using CSV.jl to read a file of this format:

# 60 lines of completely random unformatted crap here
    1    1  Te  5S     1    0  0.5  -0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240
    2    1  Te  5S     1    0  0.5  0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240
    3    1  Bi  5S     1    0  0.5  -0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240

The real file is much larger of course, 1-2 billion lines.

I am only interested in both the “header” line 1 1 Te 5S 1 0 0.5 -0.5 and the third column of each data section.

Currently I am reading the file like this:

function parse(filename)
    skip = 61
    ndata = 6 # kpts * bands + 1
    nsections = 3
    for n in 0:nsection-1
        header = CSV.File(filename, header=false, datarow=skip+ndata*n, limit=1, ignorerepeated=true, delim=" ");
        # Process header
        data = CSV.File(filename, header=false, datarow=skip+(ndata+1)*n, limit=ndata, threaded=false, ignorerepeated=true, delim=" ", select=[3]).Column3;
      # Process and reshape data
    end
end

Of course, this is very inefficient. Is there a better way of going about this? Using Pandas in python I was able to read the whole thing directly then pull out the headers and reshape the remaining data into a 3D array quite easily, but I haven’t had much success doing that in Julia.

I would consider just using primitives, eg read the lines with eachline, split on , and then parse the fields.

3 Likes