Julia best practices when it comes to file handling and string/symbol usage. Please provide a code review of the following:

question
metaprogramming
#1

Is this the right place to ask for a quick code review from anyone with their expertise to offer?

This is a simplified version of my code.

I am reading a few files and their contents are streamed to multiple other files (filename dependent on the results of their contents passed through a function).

Here is some executable example code which I hope explains what I am doing. I added some print statements so that you can see the order that operarions happen in.

# Ensure an empty directory for the execution of this question's code
tmp_dir = "/tmp/discourse-question-1234"
rm(tmp_dir, force=true, recursive=true)
mkdir(tmp_dir)

# Write example ".fakeq" files.
# In my real life problem, they would be ".fastq" (see https://en.wikipedia.org/wiki/FASTQ_format)
# and sample would not be known at this stage, simplifying to keep things relevant to question
open("$(tmp_dir)/pool1.fakeq", "w") do f
    write(f, "id1_sample1_ACGTA\n")
    write(f, "id2_sample3_CGTACG\n")
    write(f, "id3_sample2_GTACTAC\n")
    write(f, "id4_sample1_TACGGTAC\n")
    write(f, "id5_sample2_ACGTGTACG\n")
    write(f, "id6_sample3_CGTATACGTA\n")
    write(f, "id7_sample2_GTACCGTAC\n")
    write(f, "id8_sample1_TACGGTAC\n")
    write(f, "id9_sample1_ACGTGTA\n")
end

open("$(tmp_dir)/pool2.fakeq", "w") do f
    write(f, "id10_sample2_ACGTAACGTA\n")
    write(f, "id11_sample1_CGTACGCGTACG\n")
    write(f, "id12_sample3_GTACTACGTACTAC\n")
    write(f, "id13_sample2_TACGGTACTACGGTAC\n")
    write(f, "id14_sample1_ACGTGTACGACGTGTACG\n")
    write(f, "id15_sample3_CGTATACGTACGTATACGTA\n")
    write(f, "id16_sample2_GTACCGTACGTACCGTAC\n")
    write(f, "id17_sample1_TACGGTACTACGGTAC\n")
    write(f, "id18_sample1_ACGTGTAACGTGTA\n")
end

# This array can be in the order of 10 - 20 elements long
csv_header = [
    :identifier,
    :sample_name,
    :sequence,
    :sequence_length
]

# This array can be in the order of 25 - 50 elements long.
# In real-life problem, we know this list of samples up front
# and sample_name is calculated by matching an array of nucleotide
# 'barcode' sequences up against each sequence in the .fastq files
sample_names = [
    :sample1,
    :sample2,
    :sample3
]

# This array can be in the order of 4 - 12 elements long
# In real-life problem, we know this list of pools up front and each
# pool corresponds to a .fastq file mentioned above
pool_list = [
    :pool1,
    :pool2
]

sample_csv_mapping = Dict()  # Why can't this be defined in the try block? 
try
    #sample_csv_mapping = Dict()
    for sample_name in sample_names
        csv_stream = open("$(tmp_dir)/$(sample_name).csv", "w")
        println("OPENED: $(csv_stream.name)")
        sample_csv_mapping[sample_name] = csv_stream
        write(csv_stream, join(csv_header, ","), "\n")
    end
    for pool in pool_list
        open("$(tmp_dir)/$(pool).fakeq", "r") do f
            lines = readlines(f)
            for line in lines
                identifier, sample_name, sequence = split(line, "_")
                sequence_length = length(sequence)
                # Why is this not working as expected
                #write(sample_csv_mapping[Symbol(sample_name)], join(eval.(csv_header), ","), "\n")
                csv_row = [
                    identifier,
                    sample_name,
                    sequence,
                    sequence_length
                ]
                write(sample_csv_mapping[Symbol(sample_name)], join(csv_row, ","), "\n")
                println("WROTE_TO: $(sample_csv_mapping[Symbol(sample_name)].name)")
            end
        end
    end
finally
    for (sample, csv_stream) in sample_csv_mapping
        close(csv_stream)
        println("CLOSED: $(csv_stream.name)")
    end
end

I especially want to know why the commented out line #write(sample_csv_mapping[Symbol(sample_name)], join(eval.(csv_header), ","), "\n") does not work

Also, when sample_csv_mapping is defined inside the try block… finally is unable to access it… see line which is commented out.

Also, is there a more correct way in which to open a few files, write to them in a random and repeated order, then close all of them (even in case of an error)?

0 Likes

#2
help?> eval
search: eval evalfile @eval @evalpoly bytesavailable readavailable TypeVar UndefVarError prevfloat StridedVecOrMat

  eval(expr)

  Evaluate an expression in the global scope of the containing module. Every Module (except those defined with baremodule)
  has its own 1-argument definition of eval, which evaluates expressions in that module.

We see that eval evaluates in the global scope. However, identifier, etc are loop-local.

The short fix is probably

for line in lines
    global identifier
    global sample_name
    global sequence
    identifier, sample_name, sequence = split(line, "_")
[...]

Otherwise, I would advise to use CSV.jl and avoid eval: It is only intended for loops that define functions.

Heuristic: Functionality goes into functions. Top-level code defines types, functions, macros, possibly loads external libraries (possibly in loops with eval if you e.g. need to define a bazillion of wrapper functions for some foreign library).

Then, top-level code calls a handful of functions, possibly in loops. But not more.

0 Likes

#3

In addition, you may want to look at BioSequences.jl, which has readers and writers for fastq files.

0 Likes

#4

I am already using the BioSequences.jl library in my real solution. It is working well for me. Will also look into CSV module. What does that do over and above what I am currently doing with the lower level syntax? I felt that I didn’t need all the extra features and that they would bring extra overhead.

The files I am working with are 4 pools, each fastq file around 2 GB, split into about 30 tab delimeted files. So each resulting file is an average of around 250mb and a maximum of around 500mb.

Want to ensure I am light on RAM usage and leveraging all the CPU I can to speed up the process.

I don’t need to use eval at all. But was hoping to try avoid duplication (see csv_header and csv_row). Not the end of the world if I don’t.

Using eval there was a recent addition, when someone tried to tell me to use Symbols instead of strings. I don’t understand metaprogramming yet so it confuses me. Not really sure how Symbols really help me in this use case, other than being slightly more faster than strings because they are interned yada yada.

I should really break things up into functions. I do this more naturally in Python. In Julia I am still figuring it out, and I tend to bang everything into one script while I am trying out a million different ways to do the thing.

I also want to try open all the fastq files in parallel if possible but that requires further exploration and is not worth it if it makes the code more difficult to understand.

Thanks for your tips. Will try implement them this week.

0 Likes

#5

The point of the finally block is that it runs if something goes wrong in the try block. If something goes wrong, the variables may even be undefined:

try
    x = error("error!")
finally
    print("hello")
end

This is why Julia does not let you access them. The technical way to think about it is that everything between try and catch/finally has its own scope.

0 Likes

#6

It’s not all that much overhead, but maybe doesn’t add a whole lot - I suspect it will be faster/cleaner though. Might be wrong.

For

write(sample_csv_mapping[Symbol(sample_name)], join(eval.(csv_header), ","), "\n")

I’m guessing you just want to switch that eval() to String(). So

write(sample_csv_mapping[Symbol(sample_name)], join(String.(csv_header), ","), "\n")

Also, if you’re just doing this manually (rather than making an intermediate DataFrame say), you don’t need the headers to be symbols, they can just be strings.

Rather than doing this, I’d use the open(file,"a") do f syntax. The “a” is for append, and this way you don’t need your finally block to close stuff.

This can be reduced to

for line in eachline("$(tmp_dir)/$(pool).fakeq")

or better

for line in eachline(joinpath(tmpdir, "$pool.fakeq"))

These versions don’t require loading the whole file into memory (when you do readlines() it makes an array of all the lines, which you don’t need if you’re looping through).

Here’s a simplified version that I think does what you want:

# Setup functions
function randseq()
    len = rand(10:100)
    return join(rand("ATGC", len))
end

randsample() = rand(["sample1", "sample2", "sample3"])

# Generate fake files
tmp = mktempdir()
for p in ["pool1", "pool2"]
    global tmp
    open(joinpath(tmp, "$p.fakeq"), "w") do io
        for _ in 1:100
            write(io, "$(p)_$(randsample())_$(randseq())\n")
        end
    end
end

# Line process function
function process_line(line, dir)
    (pool, sample, seq) = split(line, "_")
    len = length(seq)
    outfile = joinpath(dir, "$sample.csv")
    if !isfile(outfile)
        write(outfile, join(csv_header, ","), "\n")
    end
    open(outfile, "a") do io
        write(io, join([pool, sample, seq, len], ","), "\n")
    end
end


# Do the thing
for p in ["pool1", "pool2"]
    for line in eachline(joinpath(tmp, "$p.fakeq"))
        @info line
        process_line(line, tmp)
    end
end
1 Like

#7

Also, FWIW, it might be worth changing the title of this post. It’s not really about Symbols, though I understand you got here trying to deal with them.

0 Likes

#8

@kevbonham

I am so grateful for your reply. 100% agree on your final comment too. Your solution has given me a new way of looking at the problem. As a user fresh to Julia, I really appreciate that.

Your code sample solves the problem for me, however I now notice that each time it writes to a file, it opens file, writes to file, closes file. This is of course much more simple to code, but I was under the impression that the additional opening and closing of file would bring in a lot more overhead (hence my use of a hashtable to save that otherwise unnecessary opening and closing).

What are your thoughts on that?

0 Likes

#9

Thinking of an adequate subject for this post now. It is mostly about Julia best practices I guess. Going to make a change now. Please comment if you think it is suitable or not.

0 Likes

#10

In principle, definitely true. In practice? If you’re doing this a few million times it might add a couple of seconds to your script at most. If you’re running this script 100 times a day, it’s worth optimizing maybe, but if you run it once for each sequencing batch you get back…

You can always benchmark it I suppose, but my general philosophy is to go with simple/clear/easy to maintain unless the time savings are really worth it to have more complicated / harder to maintain code. Of course “worth it” is in the eye of the beholder :laughing:

0 Likes

#11

Yeah, you are very right here. It is my philosophy too. I guess because I am approaching Julia for the first time, I am approaching it more academically than I usually would. Definitely have to benchmark it. Need to learn exactly how to do that but I am sure, just a rough time estimate will do for this comparison.

FWIW, The batch is usually between 1-2 million sequences. And this part of the pipeline would usually only be run once overnight. So I guess efficiency is not that critical after all.

0 Likes