Reading a non-uniform CSV file

ash · November 10, 2020, 6:15am

I am using CSV.jl to read a file of this format:

# 60 lines of completely random unformatted crap here
    1    1  Te  5S     1    0  0.5  -0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240
    2    1  Te  5S     1    0  0.5  0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240
    3    1  Bi  5S     1    0  0.5  -0.5
       1       1        0.0000000000
       1       2        0.0024000000
       1       3        0.0002413400
       2       1        0.0241243140
       2       2        0.1234214000
       2       2        0.0979007240

The real file is much larger of course, 1-2 billion lines.

I am only interested in both the “header” line 1 1 Te 5S 1 0 0.5 -0.5 and the third column of each data section.

Currently I am reading the file like this:

function parse(filename)
    skip = 61
    ndata = 6 # kpts * bands + 1
    nsections = 3
    for n in 0:nsection-1
        header = CSV.File(filename, header=false, datarow=skip+ndata*n, limit=1, ignorerepeated=true, delim=" ");
        # Process header
        data = CSV.File(filename, header=false, datarow=skip+(ndata+1)*n, limit=ndata, threaded=false, ignorerepeated=true, delim=" ", select=[3]).Column3;
      # Process and reshape data
    end
end

Of course, this is very inefficient. Is there a better way of going about this? Using Pandas in python I was able to read the whole thing directly then pull out the headers and reshape the remaining data into a 3D array quite easily, but I haven’t had much success doing that in Julia.

Tamas_Papp · November 10, 2020, 8:40am

I would consider just using primitives, eg read the lines with eachline, split on , and then parse the fields.

Topic		Replies	Views
Reading headers of delimited files General Usage csv	1	578	January 3, 2023
Read vector from data file Data csv , io	8	737	January 18, 2024
Read file with CSV.read New to Julia	8	19787	September 9, 2019
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	76	September 3, 2024
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018

Reading a non-uniform CSV file

Related topics