Background
The genotype data can be stored like:
01212
where, 0 and 2 are homozygotes, 1 is the heterozyges. Frequently they will be read into a Float64 matrix for some dot operations and GEMM things. My question is that if there is a faster way to read them as a float matrix than the bottom one I used. It is still cost 5+ times more time than the C++ codes.
Some binary operations
Binary I/O in C++
// write double array to binary
vector<double> dat;
// output
cout.write(reinterpret_cast<char*>(&dat[0]), dat.size()*sizeof(double));
// input
double g[len];
cin.read(reinterpret_cast<char*>(g), len * sizeof(double));
The stdin/out can be of course changed to other streams. But they are very convenient to be redirected to or from other streams.
Binary I/O in Julia
# Output
# -- Suppose dat is an array of Float64
write(file, dat)
## Input
reinterpret(Float64, read(file))
Some test
Test codes for writing
# write.jl
@time open("x", "w") do io
for i in 1:1000
for j in 1:600000
print(io, rand(0:2))
end
print(io, '\n')
end
end
//write.cpp
#include <iostream>
#include <random>
using namespace std;
int main(int argc, char *argv[])
{
ios_base::sync_with_stdio(false);
mt19937 rng;
uniform_int_distribution<int> uni(0,2);
for(auto i{0}; i<1000; ++i){
for(auto j{0}; j<600000; ++j) cout<<uni(rng);
cout<<'\n';
}
return 0;
}
Test results
g++ -O2 -Wall -std=c++17 write.cpp
time ./a.out >y
# real 0m22.186s
# user 0m19.984s
# sys 0m0.304s
time Julia write.jl
# 52.034295 seconds (1.20 G allocations: 53.649 GiB, 0.91% gc time)
#real 0m52.155s
#user 0m50.501s
#sys 0m1.537s
Test codes for reading
Note, I will convert those 012
to double for later GEMM.
# read.jl
gt = Float64[]
@time for line in eachline("x")
for c in line
push!(gt, Float64('c'-'0'))
end
end
#include <iostream>
#include <vector>
using namespace std;
int main(int argc, char *argv[])
{
ios_base::sync_with_stdio(false);
vector<double> gt;
for(string line; getline(cin, line);)
for(const auto&c:line) gt.push_back(c-'0');
return 0;
}
time Julia read.jl
# 17.118777 seconds (12.37 k allocations: 5.692 GiB, 1.53% gc time)
#real 0m17.346s
#user 0m10.429s
#sys 0m7.269s
gpp -O2 -Wall -std=c++17 read.cpp
time ./a.out <y
#real 0m3.263s
#user 0m0.894s
#sys 0m2.330s
Above can be much better if not to use vector
, but to deal the file line by line. The latter can increase the speed of the c++ codes by 8~9 times, meaning it is tens of times faster than the Julia codes.
Use C++ to convert to double stream then read by Julia
// The converter
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
ios_base::sync_with_stdio(false);
for(string line; getline(cin, line);){
size_t len(line.length());
double t[len];
for(size_t i{0}; i<len; ++i) t[i] = line[i]-'0';
cout.write(reinterpret_cast<char*>(t), sizeof(double)*len);
}
return 0;
}
@time t = reinterpret(Float64, read(pipeline("y", `./a.out`)))
# 11.728569 seconds (1.05 M allocations: 9.061 GiB, 2.63% gc time)
The time reduced from 17s to 12s, meaning it is worth of the effort.