Extract relevant lines from large file using IDs from other file

Hi there,

I am new to Julia but interested to learn.

I would like to extract lines from a large file (32gb; File B) based on an ID (File A) that is stored in another file.

> FileA
ID01
ID02
ID03
...
IDXX
> FileB
ID05  information more_information  even_more_information  102
ID12  information more_information  even_more_information  115
ID89  information more_information  even_more_information  009
...

I wrote the following Julia script, but was a little disappointed by its performance. I guess that I am not familiar enough with the language to get this running fast enough:

# Read in query file

## Initiate empty vector
queries = Any[]

## Loop through file and populate vector with text
open("fileA") do query_file
        for ln in eachline(query_file)
                push!(queries, ln)
        end
end


# Loop through database file
open("FileB") do file
        ## Loop through lines
        for ln in eachline(file)
                # Extract ID with regex
                ID = match(r"^(\S)+", ln).match

                # search in queries
                if ID in queries
                        println(ln)
                end
        end
end

Any ideas to make this run much faster in Julia?

The above script was inspired by this perl script:

open(IN, 'FileA');
while (<IN>) {
 chomp;
$i_need{$_}++;
}

open (IN, 'FileB') ;

while (<IN>) {
  /^(\S+)\t/;
 if ($i_need{$1}) { print; }
}

Many thanks!

There are a few things that stand out immediately:

  1. You’re using non-constant global variables, which are particularly slow in Julia. Instead, put your actual code into a function and call it.
  2. Any[] is an abstractly-typed container, which will be slower than a container of a concrete type (like String).
  3. You’re building a vector and then repeatedly trying to search for items in it. That means that a vector is probably not the right data structure: a Set{String}() will make the ID in queries line much faster (it has O(1) lookup instead of O(N)).

Please read Performance Tips · The Julia Language which covers the general Julia performance best practices (for example, “Avoid Global Variables” is item 1).

3 Likes

While all of these are true, it’s also worth noting that if you’re putting these into a script and calling julia my_script.jl and comparing to perl my_script.pl, unless your files are absolutely massive, a major part of the difference you’re seeing is probably compile time.

This is one place Julia still lags (though it’s gotten much better). If you’re writing scripts with runtimes under 5 or 10 sec, Julia is going to be slower because it needs to compile everything up front each time.

1 Like

Thanks,

That’s useful feedback. Does this look more like it?

# Extract IDs from query file 
function get_ids(file)

        ## Initiate empty vector
        queries = Set{String}()

        ## Loop through file and populate vector with text
        for ln in eachline(file)
                push!(queries, ln)
        end

        return queries

end

# Check ids with db
function compare_ids(queries, db)

        # Loop through db_lines
        for ln in eachline(db)

                # Extract ID with regex
                ID = match(r"^(\S)+", ln).match

                   # Search in queries
                   if ID in queries
                           println(ln)
                    end
        end
end

# Run the actual functions
open("FileA") do query_file

        queries = get_ids(query_file)

        open("FileB") do db_file

                compare_ids(queries, db_file)

        end

end

@kevbonham: The perl script had a runtime of around 12 minutes, so the few seconds will not make the difference I guess.

1 Like

Even if the files are really large, if you are only doing logic on the first column it is probably better to read both first columns into memory then apply the matching function. Then you can do what you want with the result. If the number of unique IDs is not too large, you can take advantage of something like CategoricalArrays or similar. I’d guess this will be much faster, but could be wrong.

Edit for reasoning: You are looping over every single line of file B for every single ID from file A.

I have not tried out @tbeason’s suggestion but the above script takes roughly 450 secs with Julia while the perl script was about 713 seconds on the exact same files.

I am super happy with that result :wink:

1 Like

Great news! A couple of other thoughts:

Rather than making an empty set and then pushing each line to it, you can use a “comprehension” (not sure if perl has similar constructions, but it looks like):

queries = Set(ln for ln in eachline(file))

Or in this case you could even just do:

queries = Set(eachline(file))

I don’t think there’s necessarily a performance benefit, but in my opinion it looks a little nicer. Also note: you don’t have to do Set{String} if you do it this way - it will be inferred automatically.

2 Likes

Have you tried reading blocks of say 10000 lines (as dataframe or whatever fits your data) instead of single lines? And then filtering there.
Or using other packages such as CSV.jl?

Nope. I was afraid that loading this huge table into memory and filtering would slow things down…

If you have 16GB RAM or more, I would suggest trying my approach. Loading the entire table is unnecessary. The first column is all that you need, and telling Julia that it can expect to reuse some entries (ie CategoricalArray) will save on memory as well. If your IDs are all short like they are in your example, then I’d guess the other columns contribute much more to the overall file size, so that the first column will not require too much memory anyway.

I’m not speaking about loading the whole table but splitting it into chunks.