I would like to extract lines from a large file (32gb; File B) based on an ID (File A) that is stored in another file.
> FileA
ID01
ID02
ID03
...
IDXX
> FileB
ID05 information more_information even_more_information 102
ID12 information more_information even_more_information 115
ID89 information more_information even_more_information 009
...
I wrote the following Julia script, but was a little disappointed by its performance. I guess that I am not familiar enough with the language to get this running fast enough:
# Read in query file
## Initiate empty vector
queries = Any[]
## Loop through file and populate vector with text
open("fileA") do query_file
for ln in eachline(query_file)
push!(queries, ln)
end
end
# Loop through database file
open("FileB") do file
## Loop through lines
for ln in eachline(file)
# Extract ID with regex
ID = match(r"^(\S)+", ln).match
# search in queries
if ID in queries
println(ln)
end
end
end
Any ideas to make this run much faster in Julia?
The above script was inspired by this perl script:
open(IN, 'FileA');
while (<IN>) {
chomp;
$i_need{$_}++;
}
open (IN, 'FileB') ;
while (<IN>) {
/^(\S+)\t/;
if ($i_need{$1}) { print; }
}
There are a few things that stand out immediately:
You’re using non-constant global variables, which are particularly slow in Julia. Instead, put your actual code into a function and call it.
Any[] is an abstractly-typed container, which will be slower than a container of a concrete type (like String).
You’re building a vector and then repeatedly trying to search for items in it. That means that a vector is probably not the right data structure: a Set{String}() will make the ID in queries line much faster (it has O(1) lookup instead of O(N)).
Please read Performance Tips · The Julia Language which covers the general Julia performance best practices (for example, “Avoid Global Variables” is item 1).
While all of these are true, it’s also worth noting that if you’re putting these into a script and calling julia my_script.jl and comparing to perl my_script.pl, unless your files are absolutely massive, a major part of the difference you’re seeing is probably compile time.
This is one place Julia still lags (though it’s gotten much better). If you’re writing scripts with runtimes under 5 or 10 sec, Julia is going to be slower because it needs to compile everything up front each time.
That’s useful feedback. Does this look more like it?
# Extract IDs from query file
function get_ids(file)
## Initiate empty vector
queries = Set{String}()
## Loop through file and populate vector with text
for ln in eachline(file)
push!(queries, ln)
end
return queries
end
# Check ids with db
function compare_ids(queries, db)
# Loop through db_lines
for ln in eachline(db)
# Extract ID with regex
ID = match(r"^(\S)+", ln).match
# Search in queries
if ID in queries
println(ln)
end
end
end
# Run the actual functions
open("FileA") do query_file
queries = get_ids(query_file)
open("FileB") do db_file
compare_ids(queries, db_file)
end
end
@kevbonham: The perl script had a runtime of around 12 minutes, so the few seconds will not make the difference I guess.
Even if the files are really large, if you are only doing logic on the first column it is probably better to read both first columns into memory then apply the matching function. Then you can do what you want with the result. If the number of unique IDs is not too large, you can take advantage of something like CategoricalArrays or similar. I’d guess this will be much faster, but could be wrong.
Edit for reasoning: You are looping over every single line of file B for every single ID from file A.
I have not tried out @tbeason’s suggestion but the above script takes roughly 450 secs with Julia while the perl script was about 713 seconds on the exact same files.
Rather than making an empty set and then pushing each line to it, you can use a “comprehension” (not sure if perl has similar constructions, but it looks like):
queries = Set(ln for ln in eachline(file))
Or in this case you could even just do:
queries = Set(eachline(file))
I don’t think there’s necessarily a performance benefit, but in my opinion it looks a little nicer. Also note: you don’t have to do Set{String} if you do it this way - it will be inferred automatically.
Have you tried reading blocks of say 10000 lines (as dataframe or whatever fits your data) instead of single lines? And then filtering there.
Or using other packages such as CSV.jl?
If you have 16GB RAM or more, I would suggest trying my approach. Loading the entire table is unnecessary. The first column is all that you need, and telling Julia that it can expect to reuse some entries (ie CategoricalArray) will save on memory as well. If your IDs are all short like they are in your example, then I’d guess the other columns contribute much more to the overall file size, so that the first column will not require too much memory anyway.