Hi, I am new to Julia from R, and I have a little trouble in learning how to speed up my code.
for a simple demo example, I have a DataFrame with 3 column X 8679568 row, each row is a record of protein pair. The row 1 and row 3 are the same, so row 3 should be removed. I wanna remove the duplicate items that in the whole dataframe for the first 200 row
ps. the picture only show the first 3 row
here is my code that have to speed up. (I know it is a stupid solution)
@time begin
pos = []
for i in 1:200 # I will find the duplicate rows only for the first 200 rows.
for j in i:8679568
# if the column 1 and 2 in row i is the same as column 2 and 1 in row j
# then the row j will be add into vector "pos"
if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i]
push!(pos,j)
break
end
end
end
end
The code it takes time: 21.660150 seconds (339.22 M allocations: 10.111 GiB, 8.08% gc time)
I have no idea how to speed up my code.
[update] If I find the duplicate row just for the first 100 row, then the time takes only 4.774666 seconds (79.00 M allocations: 2.361 GiB, 7.43% gc time, 2.91% compilation time)
[update 2 ] If it is not possible to speedup this simple code, what is the reason for the time it take to 4 seconds to find duplication for the frist 100 row, but 21 second for the first 200 row? ( from what I understand it should be 8 seconds)
[update 3]
For the test, here is the link of whole data : protein, it is about 70M size in txt.gz fomat, and it will contains some missing in the column 3, so I run dropmissing!(df)
after load it in julia.
I check the julia performance tips, and now I know that I should put the code into functions instead of run them in global scope. Compare to the original code I put above, the code wraped in function takes 17.615409 seconds (339.22 M allocations: 10.111 GiB, 8.60% gc time), yes, it runs a little faster but no improve in memory allocations. Then according to tip of @bkamins and @pdeffebach, I modify the code for type stabilities, then the memory allocation problem is solved. it takes 9.137771 seconds (1 allocation: 1.766 KiB) to for seach dup for first 200 row, 35.960302 seconds for first 400, and 214.447423 seconds (2 allocations: 7.953 KiB) for first 1000.
P.S. I actuctlly solve the problem with below code, but any suggestion to reduce memory allocation in the for-loop is helpful. Thanks.
@time begin
dropmissing!(df)
df2 = Array(df)
df2 = string.(df2) # turn all items into string for sort
df2 = sort(df2, dims = 2) # for each row, sort its column
df2 = unique(df2, dims=1) # only keep the unique row
end
it takes 9.367578 seconds (52.49 M allocations: 3.247 GiB, 27.73% gc time) for the whole dataset to de-duplicate instead of only the first 200 row.