Reading Fixed-Width Column Data

wibble · May 31, 2017, 5:39pm

Hi All,

I’m trying to read in a file with rows that look like

 13    MEXICO                                                130971
 14    INDONESIA                                             126582
 15    UNITED KINGDOM                                        114486

Note the possible presence of spaces in the 2nd column. I haven’t been able to find any tool that will easily do this for me (saw this, but the suggestion there doesn’t work here). Does one exist that I’m not finding, or do I have to do a line-by-line regex or somesuch?

Thanks!

ScottPJones · May 31, 2017, 9:04pm

Do you have a description of the format?
Just going by those three lines, something like the following might be able to split up the data in the line:

function readmyfile(filename)
    ind, countries, val = Vector{Int}(), Vector{String}(), Vector(Int)()
    open(filename) do io
        while true
            line = readline(io)
            line == "" && break
            push!(ind, parse(Int, strip(line[1:7])))
            push!(countries, strip(line[8:60]))
            push!(val, parse(Int, strip(line[61:end])))
       end
    end
    (ind, countries, val)
end

Then, if you read the file, it will return three vectors with the data:

julia> readmyfile("/j/foo.txt")
([13, 14, 15], String["MEXICO", "INDONESIA", "UNITED KINGDOM"], [130971, 126582, 114486])

Tamas_Papp · June 1, 2017, 3:15pm

You can just read line by line, extract the columns using indices, then parse. The tricky part is getting the column indices, if you are lucky there is some metadata that describes that (eg US Census Bureau usually supplies such metadata).

wibble · June 4, 2017, 1:28pm

I’ve got a mod of Scott’s code working. Thanks for the help!

ScottPJones · June 4, 2017, 2:00pm

If my post solves your issue, you can mark it as the solution, to help other people with the same issue find it.

Guido_Gay · September 6, 2017, 2:44pm

Hi all.

I’m have a fixed format file named “data.txt”:

10622431120301612015
10522431120301612015
10922431120301612015
10622431120301612015
10522431120301612015
10422431120301612015
10322431120301612015
10722431120301612015
10522431120301612015
10612431120301612015
10622431120301612015
10212431120301612015
10922431120301612015
10922431120301612015
10522431120301612015
10422431120301612015

Variables are described in the file “metadata.csv” (sorry, names are in italian. Posizione = position index; Lunghezza= length; Acronimovariabile = variable name; FormatoCampo = variable type)

“Posizione”;“Lunghezza”;“Acronimovariabile”;“FormatoCampo”
1;1;“sesso”;“String”
2;2;“classeEta”;“String”
4;1;“statoCivile”;“String”
5;3;“cittadinanza”;“String”
8;1;“continente”;“String”
9;2;“zona”;“String”
11;2;“regione”;“String”
13;3;“provincia”;“String”
16;1;“soggiornoLungoperiodo”;“String”
17;4;“anno”;“Int”

The following script uses ScottPJones solution and seems to work. It first reads the metadata, then creates a function “readData” that reads the data.

function createVectors(metadata)
for i in 1:size(metadata)[1]
nome = metadata[i,3]
tipo = metadata[i,4]
eval(parse(“$nome = Vector{$tipo}()”))
end
end

function createFunction(metadata)
out = “”“function readData(filename,metadata)
createVectors(metadata)
open(filename) do io
while true
line = readline(io)
line == “” && break
“””
for i in 1:size(metadata)[1]
inizio = parse(Int,metadata[i,1])
fine = parse(Int,metadata[i,1]) + parse(Int,metadata[i,2]) - 1
nome = metadata[i,3]
if metadata[i,4] == “String”
out = “$out \n push!($nome,strip(line[$inizio:$fine]))”
else
out = “$out \n push!($nome,parse(Int,strip(line[$inizio:$fine])))”
end
end
out = “$out \n end\n end\n end\n”
return(out)
end

metadata,variables= readdlm(“metadata.csv”,‘;’,String;header=true)

eval(parse(createFunction(metadata)))

readData(“data.txt”,metadata);

Reading David P. Sanders, “Creating domain-specific languages in Julia using macros” I’m convinced there are (much) better solutions, but I can’t find them.
Any suggestions?
Thank you
Guido

Tamas_Papp · September 11, 2017, 2:03pm

Please format your code with backticks (```julia … ```).

It is not clear why you are constructing a function like this. Simply parse the metadata into some structure that has concrete type information, then write parser functions for each field, it will be very fast with SubStrings, or see
https://github.com/quinnj/WeakRefStrings.jl