How to speed up the for-loop with dataframe access

Hi, I am new to Julia from R, and I have a little trouble in learning how to speed up my code.

for a simple demo example, I have a DataFrame with 3 column X 8679568 row, each row is a record of protein pair. The row 1 and row 3 are the same, so row 3 should be removed. I wanna remove the duplicate items that in the whole dataframe for the first 200 row

fi
ps. the picture only show the first 3 row

here is my code that have to speed up. (I know it is a stupid solution)

@time begin
    pos = []
    for i in 1:200  # I will find the duplicate rows only for the first 200 rows.
        for j in i:8679568
            # if the column 1 and 2 in row i is the same as column 2 and 1 in row j
            # then the row j will be add into vector "pos" 
            if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i] 
                push!(pos,j)
                break
            end
        end
    end
end

The code it takes time: 21.660150 seconds (339.22 M allocations: 10.111 GiB, 8.08% gc time)

I have no idea how to speed up my code.

[update] If I find the duplicate row just for the first 100 row, then the time takes only 4.774666 seconds (79.00 M allocations: 2.361 GiB, 7.43% gc time, 2.91% compilation time)

[update 2 ] If it is not possible to speedup this simple code, what is the reason for the time it take to 4 seconds to find duplication for the frist 100 row, but 21 second for the first 200 row? ( from what I understand it should be 8 seconds)

[update 3]
For the test, here is the link of whole data : protein, it is about 70M size in txt.gz fomat, and it will contains some missing in the column 3, so I run dropmissing!(df) after load it in julia.

I check the julia performance tips, and now I know that I should put the code into functions instead of run them in global scope. Compare to the original code I put above, the code wraped in function takes 17.615409 seconds (339.22 M allocations: 10.111 GiB, 8.60% gc time), yes, it runs a little faster but no improve in memory allocations. Then according to tip of @bkamins and @pdeffebach, I modify the code for type stabilities, then the memory allocation problem is solved. it takes 9.137771 seconds (1 allocation: 1.766 KiB) to for seach dup for first 200 row, 35.960302 seconds for first 400, and 214.447423 seconds (2 allocations: 7.953 KiB) for first 1000.

P.S. I actuctlly solve the problem with below code, but any suggestion to reduce memory allocation in the for-loop is helpful. Thanks.

@time begin
    dropmissing!(df)
    df2 = Array(df)
    df2 = string.(df2) # turn all items into string for sort
    df2 = sort(df2, dims = 2) # for each row, sort its column 
    df2 = unique(df2, dims=1) # only keep the unique row 
end

it takes 9.367578 seconds (52.49 M allocations: 3.247 GiB, 27.73% gc time) for the whole dataset to de-duplicate instead of only the first 200 row.

The profileView.js give me this picture

I think what you want here is innerjoin Joins · DataFrames.jl.

Thanks for your helpful tips. I know that this code is not suitable for solving the problem of deduplication. I want to use this example to learn how to reduce memory allocation. But I have no idea.

Nice problem. How many different proteins do you have, 4?

Edit: corrected the number of visible proteins…

Sorry for the confuse, I have 8679568 pair of protein, and many are duplicate. the pic just for a simple demo.

Yes, but I only see proteins A, B, C, D (so, sorry, 4)? So again, how big is the set of proteins?

it is about 20k protein.

1 Like

Allocations is the place to start. Using push! extends the storage of the vector. As you already know how many rows there are, we can allocate once and track which of the 200 rows has a duplicate in j - this is slightly different to your pos so I changed the name to dup

@time begin
    dup = zeros(Int, 200)
    for i in 1:200
        for j in i:size(df, 1)
            if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i] 
                dup[i] = j
                break
            end
        end
    end
end

So now you have a list of either 0 or the index of the first duplicate found

Thanks, I have try it, but no significant difference, it still takes 22.897745 seconds (339.22 M allocations: 10.111 GiB, 8.56% gc time).

so, is it possible to descrease the memory allocation and speed up for this code block?

There is no way that the code I posted is doing 339M allocations / allocating 10GiB of memory.

I don’t think you are posting enough of your actual code

@tiZ, do you run the calculation in global scope? Try this (didn’t test):

function calcdup(df)
    n = 200
    m = size(df, 1)
    dup = zeros(Int, n)
    for i in 1:n
        for j in i:m
            if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i] 
                dup[i] = j
                break
            end
        end
    end
    return dup
end

@time calcdup(df)
2 Likes

That`s why I was asking. Trying to establish some kind of baseline:

using BenchmarkTools

function setup()
    df = [(rand(1:20_000), rand(1:20_000)) for _ in 1:1_000_000]
end

function indices(df)
    idx = Dict{Int, Dict{Int, Vector{Int}}}()
    for (row_no, row) in enumerate(df)
        if !haskey(idx, row[1])
            idx[row[1]] = Dict{Int, Vector{Int}}()
        end
        if !haskey(idx[row[1]], row[2])
            idx[row[1]][row[2]] = Int[]
        end
        push!(idx[row[1]][row[2]], row_no)
    end
    idx
end

function duplicates(df, idx)
    dups = Dict{Int, Int}()
    for protein1 in keys(idx) 
        for protein2 in keys(idx[protein1])
            if haskey(idx, protein2) && haskey(idx[protein2], protein1)
                for row_no1 in idx[protein1][protein2], row_no2 in idx[protein2][protein1]
                    # @show df[row_no1], df[row_no2]
                    dups[row_no1] = row_no2
                end
            end
        end
    end
    dups
end

df = @btime setup()
idx = @btime indices($df)
dups = @btime duplicates($df, $idx)

yielding

  12.841 ms (2 allocations: 15.26 MiB)
  576.248 ms (2189014 allocations: 249.76 MiB)
  158.008 ms (18 allocations: 91.95 KiB)

That sounds correct. It would probably be useful if OP would add some artificial data generation to his code.

Edit: corrected benchmark.

2 Likes

If you have this anywhere in your code, it will be slow. Make sure to add a type to the container.

Also, make sure you put your code in functions.

2 Likes

Thanks. I run the whole code directly instead of put them into a function.

I learn a lot today.

1 Like

Just curious: how does your code look like now? How much better are the results? Do you intend to find all duplicates in such a data set?

to remove duplicate, try something like this

Set(Set.(df))

wwhere df is the array of +8M pair(p1,p2)

or

Set(Set.(Tables.namedtupleiterator(df[:,1:2])))

this seems faster

Set(sort.([[a,b] for (a,b) in Tables.namedtupleiterator(df[:,1:2])]))

and this is “more” faster

Set(([a>b ? [b,a] : [a,b] for (a,b) in zip(df.p1,df.p2)]))
2 Likes

Instead of this it would be faster to use barrier and define calcdup(p1::AbstractVector, p2:;AbstractVector) and then call it calcdup(df.protein1, df.protein2). Then you also need to change m = length(p1).

5 Likes

I am not convinced. I tried to adapt my synthetic benchmark to DataFrames

using BenchmarkTools
using DataFrames

function setup()
    df = DataFrame(protein1 = [rand(1:20_000) for _ in 1:1_000_000], protein2 = [rand(1:20_000) for _ in 1:1_000_000])
end

function duplicates(df)
    n = 200
    dup = zeros(Int, n)
    for i in 1:n
        for j in i:size(df, 1)
            if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i] 
                dup[i] = j
                break
            end
        end
    end
    return dup
end

df = @btime setup()
dups = @btime duplicates($df)

yielding

  15.308 ms (54 allocations: 30.52 MiB)
  19.705 s (594182583 allocations: 8.85 GiB)

Please consider: this is checking the first 200 rows for duplicates in 1_000_000 rows in 20 seconds, whereas my baseline checks for all duplicates in the data set in around 1 second. So I’m curious to learn what I’m missing here.

Edit: one more remark: I also tried to use innerjoin but failed miserably to generate the necessary self join. Any hints?

1 Like

As @bkamins referenced above, the type stabilities are causing all the problems. Here is a version like recommended, but using DataFramesMeta.jl for easier syntax.

julia> using DataFramesMeta, BenchmarkTools;

julia> function setup()
           df = DataFrame(protein1 = [rand(1:20_000) for _ in 1:1_000_000], protein2 = [rand(1:20_000) for _ in 1:1_000_000])
       end;

julia> function duplicates(df)
           n = 200
           dup = zeros(Int, n)
           for i in 1:n
               for j in i:size(df, 1)
                   if df.protein1[i] == df.protein2[j] && df.protein1[j] == df.protein2[i]
                       dup[i] = j
                       break
                   end
               end
           end
           return dup
       end;
julia> function duplicates_dfm(df)
           n = 200
           dup = zeros(Int, n)
           @with df begin
               for i in 1:n
                   for j in i:length(:protein1)
                       if :protein1[i] == :protein2[j] && :protein1[j] == :protein2[i]
                           dup[i] = j
                           break
                       end
                   end
               end
           end
           return dup
       end;

julia> df = @btime setup();
  15.265 ms (33 allocations: 30.52 MiB)

julia> dups = @btime duplicates($df);
  31.710 s (990677773 allocations: 20.72 GiB)

julia> dups = @btime duplicates_dfm($df);
  193.206 ms (7 allocations: 2.03 KiB)
2 Likes