Count occurances for matrix rows (where column order does not matter)

KevinMoerman · December 7, 2022, 4:31pm

I would like to detect rows in a matrix which only occur once, also the column order does not matter. This relates to this discussion.

I have implemented several options (see below), and all work (note though I am a Julia novice so they might look silly), except they do not scale up well. They quickly become very slow. For instance occursOnce4 below takes 75 seconds for a (200406, 4) array on my machine:

I have also checked the counter function as part of DataStructures. That seems fast. However it produces counts on a unique set, which in this case is Fs, i.e. the column direction sorted version of my input. So I cannot use the counts without the accompanying unique row set (unsorted). Also functions like counter, and also unique unfortunately do not export the indices for the unique members.

Any help would be appreciated. Thanks!

Here is an example mini matrix:
F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8]

and more complex examples can be created using something like:
F=repeat(rand(1:500,5,4),3,1)

For background I am parsing 4 noded quadrilateral face arrays and am looking for faces that are not doubled, i.e. the boundary faces of a volumetric hexahedral mesh.

function occursOnce1(F)    
    Fs=sort(F,dims=2)
    S=prod(string.(Fs),dims=2)
    L=zeros(Bool,size(F,1))
    for q=1:1:size(F,1)#Loop over faces
        L[q]=count(S.==S[q])==1            
    end    
    return L
end

function occursOnce2(F)    
    Fs=sort(F,dims=2)    
    L=zeros(Bool,size(F,1))
    for q=1:1:size(F,1)#Loop over faces
        L[q]=count([Fs[q,:]==Fs[i,:] for i=1:1:size(F,1)])==1            
    end    
    return L
end

function occursOnce3(F)    
    Fs=sort(F,dims=2) #Sorted across columns
    L=zeros(Bool,size(F,1)) #Initialize boolean array
    for q=1:1:size(F,1) #Loop over faces
        L[q]=count(Fs[q,1].==Fs[:,1] .&& Fs[q,2].==Fs[:,2] .&& Fs[q,3].==Fs[:,3] .&& Fs[q,4].==Fs[:,4]) ==1  #Check count          
    end    
    return L
end

function occursOnce4(F)
    Fs=sort(F,dims=2)
    L=[count(==(r),eachrow(Fs)) for r in eachrow(Fs)].==1    
    return L       
end

Dan · December 7, 2022, 6:00pm

This problem has a lot of optimizations possible, but a major issue would be to get down from O(n^2) complexity.

occursOnceX(F) = begin
    Fs = map(sort, eachrow(F))
    p = sortperm(Fs)
    n,m = size(F)
    v = zeros(Int,n-1)
    for c in 1:m, r in 2:n
        v[r-1]==0 && Fs[p[r]][c]!=Fs[p[r-1]][c] && ( v[r-1] = 1 )
    end
    res = zeros(Int,n)
    v[1]==1 && ( res[p[1]] = 1 )
    v[end]==1 && ( res[p[end]] = 1 )
    for i=2:n-1
        v[i]==1 && v[i-1]==1 && ( res[p[i]] = 1 )
    end
    return res
end

Now:

julia> occursOnceX(F)
7-element Vector{Int64}:
 0
 0
 0
 0
 1
 0
 0

seems okay (not thoroughly tested).
PS it takes < 1 second for 100_000 tall table.

halleysfifthinc · December 7, 2022, 6:13pm

Welcome @KevinMoerman! I’m not sure from your question and functions whether you need/prefer the results as a Boolean array or the indices of the undoubled faces, but here’s an attempt that returns the latter:

function onlyoneof(F)
   Ft = SVector{4}.(sort.(eachcol(collect(F'))))
   d = Dict{eltype(Ft),Int}()
   for i in eachindex(Ft)
       if haskey(d, Ft[i])
           delete!(d, Ft[i])
       else
           d[Ft[i]] = i
       end
   end

   return collect(values(d))
end

(Minimum) Timings of the functions with different sized Fs:

F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8]
F2=repeat(rand(1:500,5,4),3,1)
F3=repeat(rand(1:500,10,4),10,1)
F4=repeat(rand(1:500,20,4),200,1)

Function	mini `F`	`F2`	`F3`	`F4`
`occursOnce1`	8.723 μs	12.432 μs	89.270 μs	55.069 ms
`occursOnce2`	4.051 μs	14.256 μs	554.927 μs	1.183 s
`occursOnce3`	2.640 μs	5.241 μs	93.486 μs	101.124 ms
`occursOnce4`	1.235 μs	1.704 μs	18.765 μs	21.516 ms
`onlyoneof`	911.700 ns	1.603 μs	10.459 μs	422.120 μs
`occursOnceX`	817.500 ns	1.646 μs	11.880 μs	890.390 μs

Dan · December 7, 2022, 6:19pm

has correctness issue with values appearing odd # of times > 1.

It would also be nice to benchmark with tall F. This version looks efficient. I wonder how it scales to 100_000 rows.

halleysfifthinc · December 7, 2022, 6:29pm

D’oh! Good catch.

Dan · December 7, 2022, 6:35pm

No worries. Happens to all. I’ve combined our two versions into one short function:

using StaticArrays, StatsBase

function onlyoneofX(F)
       Ft = sort.(SVector{4}.(eachrow(F)))
       cm = countmap(Ft)
       [ifelse(cm[v]==1,1,0) for v in Ft]
end

It returns a 0/1 vector as OP did, but easily adapted to return index list.
(A good improvement was swapping the SVector{4} with sort.).

giordano · December 7, 2022, 9:43pm

I doubt that @views does anything

julia> @macroexpand @views eachrow(F)
:(eachrow(F))

Dan · December 7, 2022, 10:12pm

You are right. Compilation stats threw off my quick @time.
I’ll edit the answer.

bertschi · December 7, 2022, 11:15pm

Unique has a method which takes a function as its first argument:

unique(f, itr)

  Returns an array containing one value from itr for each unique value
  produced by f applied to elements of itr.

julia> F=[5 7 6 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8];

julia> unique(sort, eachrow(F))
3-element Vector{Any}:
 [5, 7, 6, 8]
 [4, 3, 2, 1]
 [1, 2, 5, 6]

KevinMoerman · December 8, 2022, 10:37am

Thanks for the welcome! And also for the super quick and detailed reply! Can you share what code/tools you use for evaluating the performance of these variations?

KevinMoerman · December 8, 2022, 10:40am

Thanks all. These solutions look great. Lots of things to learn from. I consider myself a master Jedi in the “art” of MATLAB (I am the developer of the GIBBON project), but I am slowly realizing that converting super optimized MATLAB code nearly directly into Julia code quite often does not equal super optimized Julia code… at all. Then by optimizing the Julia code, and embracing the “Julian approach” it ends up faster than I thought was possible. Sadly for me though none of the structures/approaches/syntax to speed things up come naturally to me yet, leaving me feeling demoted to a Padawan learner. In MATLAB one avoids for-loops like the plague, with Julia one seems to have to embrace them like the cure. However the promise of unrivaled speed, and a fully open source software project, keep me motivated .

giordano · December 8, 2022, 11:38am

Honestly, that’s quite normal, especially if you don’t want to just translate the code but also make it so efficiently. That requires good knowledge of both languages and their idioms. If I were to translate some Julia code to Matlab I’m pretty sure I’d get something barely working but also terrible performance-wise (or more terrible than it needs to be anyway) and you’d be able to make it much better.

A warm recommendation is having a look at the performance tips in the documentation. Golden rules are:

non-constant globals are bad (type-annotated globals are ok, but they arrived only in Julia v1.8)
put your code in functions
avoid memory allocations as much as possible, especially in hot loops. Julia has a stop-the-world garbage collector (and at the moment it’s also not multi-threaded, but that will hopefully change at some point in the future), but if you have hot loops which don’t need to allocate memory at all because they can reuse buffers, then the GC doesn’t kick in at all
write type-stable code, @code_warntype is your friend. This goes hand-in-hand with previous point because type-unstable code causes extra memory allocations.

I promise that these will become more natural as you get familiar with the language. Note that big improvements to your code above came from changing the algorithm, which has little to do with the language.

Everyone’s favourite benchmarking package in Julia is BenchmarkTools.jl, I presume that’s what @halleysfifthinc used as well. For example, with your original function:

julia> using BenchmarkTools

julia> function occursOnce1(F)    
           Fs=sort(F,dims=2)
           S=prod(string.(Fs),dims=2)
           L=zeros(Bool,size(F,1))
           for q=1:1:size(F,1)#Loop over faces
               L[q]=count(S.==S[q])==1            
           end    
           return L
       end
occursOnce1 (generic function with 1 method)

julia> F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8];

julia> @benchmark occursOnce1($F)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.666 μs …  3.346 ms  ┊ GC (min … max): 0.00% … 99.13%
 Time  (median):     11.305 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.417 μs ± 33.456 μs  ┊ GC (mean ± σ):  2.67% ±  0.99%

    ▁▅█▇▄▁                                                     
  ▁▄██████▇▆▆▅▅▄▄▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  9.67 μs         Histogram: frequency by time        24.3 μs <

 Memory estimate: 7.80 KiB, allocs estimate: 179.

julia> @btime occursOnce1($F); # to print only the minimum
  9.825 μs (179 allocations: 7.80 KiB)

Note that I interpolated F to avoid measuring the cost of accessing a non-constant global variable, which is unrelated to your function. An alternative would be to define a local variable F from the setup stage:

julia> @benchmark occursOnce1(F) setup=(F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.139 μs …  3.815 ms  ┊ GC (min … max): 0.00% … 98.99%
 Time  (median):     11.386 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.660 μs ± 38.143 μs  ┊ GC (mean ± σ):  2.98% ±  0.99%

    ▄█▇▂                                                       
  ▂▅████▇▅▅▅▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  10.1 μs         Histogram: frequency by time        24.7 μs <

 Memory estimate: 7.80 KiB, allocs estimate: 179.

KevinMoerman · December 8, 2022, 12:46pm

This version of @halleysfifthinc’s proposed solution appears most efficient (as I do not care if I get linear indices or a Boolean array as the output):

function onlyoneof(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i])
            delete!(d, Ft[i])
        else
            d[Ft[i]] = i
        end
    end 
    return collect(values(d))
end

Thanks again all for the help!

KevinMoerman · December 8, 2022, 12:57pm

Great, this will be useful to get unique faces for sure. It does not offer the ones that occur once yet, as this requires a counting. This, i.e. the use of the unique function, would be useful for what I want if it exported the indices of the unique elements. It is a shame that, as far as I know, the default julia unique function does not support such additional outputs.

Here is the MATLAB implementation for instance which features such outputs:

[C,ia,ic] = unique(A,'rows')

Here the outputs ia and ic are really handy as they are defined as C =A(ia) and A = C(ic). Then the solution to the problem I pose here (defining “counts” to allow selection of those that occurred once) in MATLAB could feature:

[subInd] = ind2subn(size(C),ic); %Convert to subscript indices (custom function but you get the idea)
Ac=accumarray(subInd,ones(numel(ic),1),size(C)); %Use accumarray to do counts
Ac=reshape(Ac(ic),[size(A,1) 1]); %Reshape to match input

I wonder if julia’s unique function provided the outputs ia and ic, if that would actually offer the fastest approach to getting the entries that occur once.

Thanks again for posting this. Learning about the function f bit, and the use of eachrow is certainly very useful.

KevinMoerman · December 8, 2022, 1:01pm

@Dan (and perhaps @halleysfifthinc) earlier you mentioned this solution has:

correctness issue with values appearing odd # of times > 1.

I cannot see this error. Would you be able to edit this version of the solution to fix it if still present? ? Thanks

Dan · December 8, 2022, 1:50pm

The correctness issue did not disappear. Try the function with:

There are faster/shorter correct versions later in the thread.
(perhaps you do want lines which appear odd (not once) number of times, in that case, the OP should be revised).

KevinMoerman · December 8, 2022, 2:05pm

Thanks. For me that “faulty” solution appears to work, or at least I do not easily see what is wrong and the output (in terms of the mesh output with boundary faces) looks good. Is it that counts >1 are incorrectly treated? If so perhaps it still works for counts equal to 1, which is what I need?

In therms of speed it seems the best too, in terms of use for large arrays (I tested them for ~20000 faces). Or am I missing something?

Using

function onlyoneof(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i])
            delete!(d, Ft[i])
        else
            d[Ft[i]] = i
        end
    end 
    return collect(values(d))
end

I get:

julia> using BenchmarkTools; @benchmark onlyoneof($Fc)
BenchmarkTools.Trial: 372 samples with 1 evaluation.
 Range (min … max):  11.891 ms … 17.619 ms  ┊ GC (min … max): 0.00% … 8.41%
 Time  (median):     13.188 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.444 ms ±  1.009 ms  ┊ GC (mean ± σ):  3.32% ± 4.89%

     ▁▃▁▂▆▄▄▆▂ ▄█▆▁ ▁ ▃    ▄     ▃▂                            
  ▃▃▆█████████▇████▄█▆█▄▇▄██▄▄▆████▆█▆▆▆▃▇▃▆▃▄▃▃▄▃▄▃▄▁▃▃▁▁▁▁▃ ▄
  11.9 ms         Histogram: frequency by time        16.2 ms <

Whereas:

function onlyoneofX(F)
       Ft = sort.(SVector{4}.(eachrow(F)))
       cm = countmap(Ft)
       return [cm[v]==1 for v in Ft]
end

Gives me:

julia> using BenchmarkTools; @benchmark onlyoneofX($Fc)
BenchmarkTools.Trial: 244 samples with 1 evaluation.
 Range (min … max):  18.994 ms … 28.280 ms  ┊ GC (min … max): 0.00% … 5.69%
 Time  (median):     20.002 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.471 ms ±  1.229 ms  ┊ GC (mean ± σ):  3.18% ± 3.98%

    ▁█▇▃▃▁             ▁ ▁                                     
  ▄▅███████▇▇▃▄▃▁▄▃▅▇▆▇█▅█▅▆▄▄▆▃▃▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▃ ▃
  19 ms           Histogram: frequency by time        25.3 ms <

Dan · December 8, 2022, 2:18pm

If it fits the use case (which it totally might) then it’s fine. But it is important to understand the output, to make sure you aren’t surprised by future result. If you want even more speed, there is still more optimization possible. But I think it isn’t a bottleneck for your application now.

KevinMoerman · December 8, 2022, 2:30pm

Had a closer look. I see now what you mean and how it is wrong (uneven repetitions are wrongly seen as occurring once). Since that code appears faster for large arrays, can this issue be fixed easily enough? I’ll do some digging myself too but it might take me a long time. Thanks for your insights!

Dan · December 8, 2022, 2:38pm

The issue is easily fixed using the onlyoneofX function, at a cost of 50% more time. Is there a speed or memory problem to address still? (relative to rest of processing)

On the other hand, you might have actually wanted this alternative behaviour, as triangles touching 3 times could be a problem too (normal meshed surfaces have 2 triangles per edge or an even number generally).
Perhaps, in a future question, you can also provide the original MATLAB code, as any “gold” reference to a question works well to get an accurate answer.

Topic		Replies	Views
How to efficiently find columns of the matrix which are the same? New to Julia question , optimization	10	1174	November 7, 2023
Equal method to matlab unique in rows General Usage question	2	314	November 14, 2022
Most effective way to check if neighboring values in a matrix are the same Performance question , performance	19	2767	August 30, 2022
[Nerdsnipe warning] Speed up short vector comparisons to beat R Performance	37	1711	April 30, 2024
Return duplicate rows in array with no of times and index of first occurence General Usage question , array	6	707	July 13, 2022

Count occurances for matrix rows (where column order does not matter)

Related topics