Reading Fortran-style Complex Numbers From File

Part of my inherited research code is a highly parallelized Fortran90 program that produces, as its primary output, a large text file (~1 GB) of complex numbers in the following format:

 (1.014631909321741E-002,-1.825468883204626E-002)

As part of my migration to Julia, I’m trying to reimplement the analysis code that would operate on this file, and I’m having the darndest time getting Julia to parse the data in the file as complex numbers. I can get as far as the following:

wfn = open("wfn.dat")

str = strip(readline(wfn), [' ', '(', ')'])

This produces the following output:

"1.014631909321741E-002,-1.825468883204626E-002"

However, attempts to use something like val = parse(ComplexF64, str) to convert this to a complex number fails because parse expects an imaginary unit specifier like im or i, which is obviously not there. The ideal result is that val is converted to a complex number whose real and imaginary parts are given by the two numbers in succession. Is there a straightforward way to do this.

(Side note, I’d love a format=“fortran” argument for parse() or readdlm(), especially since this is Fortran’s default complex number format. If this exists somewhere, please let me know.)

This is something I would just ask an AI like chatgpt or claude. They are incredibly good at this.

you want to split first, then parse:

ulia> s = "1.014631909321741E-002,-1.825468883204626E-002"
"1.014631909321741E-002,-1.825468883204626E-002"

julia> re, im = split(s, ',')
2-element Vector{SubString{String}}:
 "1.014631909321741E-002"
 "-1.825468883204626E-002"

julia> val = complex(parse(Float64, re), parse(Float64, im))
0.01014631909321741 - 0.01825468883204626im

If you can get your fortran script to output the correct syntax directly, this also works:

julia> s2 = "0.01014631909321741 - 0.01825468883204626im"
"0.01014631909321741 - 0.01825468883204626im"

julia> parse(ComplexF64, s2)
0.01014631909321741 - 0.01825468883204626im

Personally, I would just preprocess the file with a sed or perl 1-liner to strip out the parentheses (you could also do this in Julia), and read it in Julia as CSV. This will be a lot faster if you have large files, because packages like CSV.jl are highly optimized. It’s also easy.

For example, this seems way easier than the code using split to manually parse every line, and is probably faster (and almost definitely faster if you modify it to use CSV.jl):

using DelimitedFiles
data = readdlm(IOBuffer(replace(read("complex.dat", String), r"[()]"=>"")), ',')
cplx_array = @views complex.(data[:,1], data[:,2])

Thank you! That solves it nicely. For the benefit of future readers, here’s a simple function (incorporating @langestefan’s code) that reads Fortran-style complex numbers, albeit only 1 per line:

function read_fortran_complex(io::IOStream)
    # can remove single space from character list if file
    # has no preceding space
    line = strip(readline(io), [' ', '(', ')'])
    re, im = split(line, ',')
    val = complex(parse(Float64, re), parse(Float64, im))
    return val
end

# example w/ data file complex.dat

len = 1000 # change to number of entries in file
cplx_array = complex(zeros(len)) # initialize as complex to avoid conversion error

file = open("complex.dat")

for i in 1:len
    cplx_array[i] = read_fortran_complex(file)
end

Caveat that the above code should be considered functional, not optimal. There’s probably fancier/faster ways to do this, but this is what works for me.

An option parsing directly as ComplexF64 in readdlm and which is Regexless:

using DelimitedFiles
readdlm(IOBuffer(replace(read(file, String), "("=>"", ")"=>"im", ","=>"+")), ComplexF64)

@ducksoverip, have you considered storing such large text files (~1 GB) as binary to save space and allow them to be read ~1000 times faster?

Would be nice to also get some benchmarks on this, considering how common this kind of task is.

Hmm—I tried implementing both yours and @stevengj 's suggestions as functions so I could compare them:

using DelimitedFiles, BenchmarkTools

# @langestefen's suggestion
function read_fortran_complex_split(io::IOStream, len)
    i = 0
    cplx_array = complex(zeros(len))
    for i in 1:len
        line = strip(readline(io), [' ', '(', ')'])
        real, imag = split(line, ',')
        cplx_array[i] = complex(parse(Float64, re), parse(Float64, imag))
    end
end

# @stevengj's suggestion
function read_fortran_complex_regex(io::IOStream)
    data = readdlm(IOBuffer(replace(read(io, String), r"[()]"=>"")), ',')
    cplx_array = @views complex.(data[:,1], data[:,2])
    return cplx_array
end

# @rafael.guerra's suggestion
function read_fortran_complex_noregex(io::IOStream)
    cplx_array = readdlm(IOBuffer(replace(read(io, String), "("=>"", ")"=>"im", ","=>"+")), ComplexF64)
    return cplx_array
end

# ~370 MB list of Fortran-format complex numbers
wfn = open("wfn.dat", "w+")

len = 7621363

@btime read_fortran_complex_split(wfn, len);

@btime read_fortran_complex_regex(wfn);

@btime read_fortran_complex_noregex(wfn);

Unfortunately, the last two functions fail with an ArgumentError: number of rows in dims must be > 0, got 0, for reasons beyond my ken, as Julia’s I/O handling remains an enigma. This happens whether I feed the file in as an already-open IOStream or as a hard-coded filename. That said, the function I wrote using @langestefan’s suggestion worked with an average time of 40 ms, which is plenty fast for me, though it requires you to know the length of the file in advance so you can allocate the array. Regarding saving the files as binary, I could, but a), I’m reluctant to change something that I know works and b), I neither read or produce so many large files so often as to need that level of speed and space-saving. This is all simulation data, so I can (and do) toss the large files after analysis.

Hi, in my suggestion this step is not needed, readdlm will output directly the complex array. Cheers.

I think this is causing your

you probably want

wfn = open("wfn.dat", "r")

And you shouldn’t be re-using wfn either

Assuming you mean storing the Float64 pairs in binary instead of text, I’ll also point out the downsides of losing text editor-readability and endian-neutrality (many multi-byte formats can order each value in 2 ways depending on the context, UTF-8 is an exception). Still, endianness not hard to deal with (picking ltoh or ntoh) in the rare case it’s not automatically handled, and parsing the 48 bytes of ASCII in the example is fundamentally more work than 8+8=16 bytes of direct floats.

Text editor-readability of GB size files is rarely useful, except possibly for the occasional debug session.

Your benchmark is broken, because by the time you call the latter two functions the file wfn has already been read to the end, so there is nothing left to read. Even for the first function, because @btime times it multiple times and returns the fastest result, the answer is meaningless.

Instead, try using seekstart to “rewind” the file each time you benchmark it.

open("wfn.dat", "r") do wfn
    @btime read_fortran_complex_split(seekstart($wfn), $len)
    @btime read_fortran_complex_regex(seekstart($wfn))
    @btime read_fortran_complex_noregex(seekstart($wfn))
end;

With this change, generating a data file with

len = 7621363
open("wfn.dat", "w") do f
    for i = 1:len
        println(f, '(', randn(), ',', randn(), ')')
    end
end

and fixing a typo in your read_fortran_complex_split (real instead of re), I get:

  1.357 s (53368742 allocations: 3.01 GiB)
  2.504 s (45772888 allocations: 2.53 GiB)
  2.444 s (22886462 allocations: 1.74 GiB)

Unfortunately, DelimitedFiles is rather slow; if you really care about performance reading CSV data, the CSV.jl package is faster.

(On the other hand, I would tend to optimize code first for simplicity over speed, and only worry about performance when it becomes a bottleneck.)

You could easily modify the code to avoid this by growing the array as you iterate, e.g.

function read_fortran_complex_split(io::IOStream)
    cplx_array = ComplexF64[]
    for line in eachline(io)
        real, imag = split(strip(line, (' ', '(', ')')), ',')
        push!(cplx_array, complex(parse(Float64, real), parse(Float64, imag)))
    end
    return cplx_array
end

Thank you, both for the explanations and for fixing my code. I’m still getting the hang of benchmarking, among many other things. My benchmarks ran from 10-13 seconds, but given that this is part of a port of a Fortran program that normally takes 10+ hours to run (it had like 12 nested DOloops), I’m pretty sure read times aren’t my bottleneck at the moment.

If the overall analysis or a hot spot is embarrassingly parallel and the original program didn’t multithread already, splitting a Julia port into a handful of still long Tasks over multiple OS threads over CPU cores is usually a relatively easy time-saver. Relatively because optimization is only easy if someone else implemented it; DIY multithreading in particular has to manage the multiplied rate of heap allocations and a risk of CPU-bound Tasks hogging threads.