Count occurances for matrix rows (where column order does not matter)

I would like to detect rows in a matrix which only occur once, also the column order does not matter. This relates to this discussion.

I have implemented several options (see below), and all work (note though I am a Julia novice so they might look silly), except they do not scale up well. They quickly become very slow. For instance occursOnce4 below takes 75 seconds for a (200406, 4) array on my machine:

I have also checked the counter function as part of DataStructures. That seems fast. However it produces counts on a unique set, which in this case is Fs, i.e. the column direction sorted version of my input. So I cannot use the counts without the accompanying unique row set (unsorted). Also functions like counter, and also unique unfortunately do not export the indices for the unique members.

Any help would be appreciated. Thanks!

Here is an example mini matrix:
F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8]

and more complex examples can be created using something like:
F=repeat(rand(1:500,5,4),3,1)

For background I am parsing 4 noded quadrilateral face arrays and am looking for faces that are not doubled, i.e. the boundary faces of a volumetric hexahedral mesh.

function occursOnce1(F)    
    Fs=sort(F,dims=2)
    S=prod(string.(Fs),dims=2)
    L=zeros(Bool,size(F,1))
    for q=1:1:size(F,1)#Loop over faces
        L[q]=count(S.==S[q])==1            
    end    
    return L
end

function occursOnce2(F)    
    Fs=sort(F,dims=2)    
    L=zeros(Bool,size(F,1))
    for q=1:1:size(F,1)#Loop over faces
        L[q]=count([Fs[q,:]==Fs[i,:] for i=1:1:size(F,1)])==1            
    end    
    return L
end

function occursOnce3(F)    
    Fs=sort(F,dims=2) #Sorted across columns
    L=zeros(Bool,size(F,1)) #Initialize boolean array
    for q=1:1:size(F,1) #Loop over faces
        L[q]=count(Fs[q,1].==Fs[:,1] .&& Fs[q,2].==Fs[:,2] .&& Fs[q,3].==Fs[:,3] .&& Fs[q,4].==Fs[:,4]) ==1  #Check count          
    end    
    return L
end

function occursOnce4(F)
    Fs=sort(F,dims=2)
    L=[count(==(r),eachrow(Fs)) for r in eachrow(Fs)].==1    
    return L       
end

This problem has a lot of optimizations possible, but a major issue would be to get down from O(n^2) complexity.

occursOnceX(F) = begin
    Fs = map(sort, eachrow(F))
    p = sortperm(Fs)
    n,m = size(F)
    v = zeros(Int,n-1)
    for c in 1:m, r in 2:n
        v[r-1]==0 && Fs[p[r]][c]!=Fs[p[r-1]][c] && ( v[r-1] = 1 )
    end
    res = zeros(Int,n)
    v[1]==1 && ( res[p[1]] = 1 )
    v[end]==1 && ( res[p[end]] = 1 )
    for i=2:n-1
        v[i]==1 && v[i-1]==1 && ( res[p[i]] = 1 )
    end
    return res
end

Now:

julia> occursOnceX(F)
7-element Vector{Int64}:
 0
 0
 0
 0
 1
 0
 0

seems okay (not thoroughly tested).
PS it takes < 1 second for 100_000 tall table.

Welcome @KevinMoerman! I’m not sure from your question and functions whether you need/prefer the results as a Boolean array or the indices of the undoubled faces, but here’s an attempt that returns the latter:

function onlyoneof(F)
   Ft = SVector{4}.(sort.(eachcol(collect(F'))))
   d = Dict{eltype(Ft),Int}()
   for i in eachindex(Ft)
       if haskey(d, Ft[i])
           delete!(d, Ft[i])
       else
           d[Ft[i]] = i
       end
   end

   return collect(values(d))
end

(Minimum) Timings of the functions with different sized Fs:

F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8]
F2=repeat(rand(1:500,5,4),3,1)
F3=repeat(rand(1:500,10,4),10,1)
F4=repeat(rand(1:500,20,4),200,1)
Function mini F F2 F3 F4
occursOnce1 8.723 ΞΌs 12.432 ΞΌs 89.270 ΞΌs 55.069 ms
occursOnce2 4.051 ΞΌs 14.256 ΞΌs 554.927 ΞΌs 1.183 s
occursOnce3 2.640 ΞΌs 5.241 ΞΌs 93.486 ΞΌs 101.124 ms
occursOnce4 1.235 ΞΌs 1.704 ΞΌs 18.765 ΞΌs 21.516 ms
onlyoneof 911.700 ns 1.603 ΞΌs 10.459 ΞΌs 422.120 ΞΌs
occursOnceX 817.500 ns 1.646 ΞΌs 11.880 ΞΌs 890.390 ΞΌs
1 Like

has correctness issue with values appearing odd # of times > 1.

It would also be nice to benchmark with tall F. This version looks efficient. I wonder how it scales to 100_000 rows.

D’oh! Good catch.

No worries. Happens to all. I’ve combined our two versions into one short function:

using StaticArrays, StatsBase

function onlyoneofX(F)
       Ft = sort.(SVector{4}.(eachrow(F)))
       cm = countmap(Ft)
       [ifelse(cm[v]==1,1,0) for v in Ft]
end

It returns a 0/1 vector as OP did, but easily adapted to return index list.
(A good improvement was swapping the SVector{4} with sort.).

I doubt that @views does anything

julia> @macroexpand @views eachrow(F)
:(eachrow(F))

You are right. Compilation stats threw off my quick @time.
I’ll edit the answer.

Unique has a method which takes a function as its first argument:

unique(f, itr)

  Returns an array containing one value from itr for each unique value
  produced by f applied to elements of itr.

julia> F=[5 7 6 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8];

julia> unique(sort, eachrow(F))
3-element Vector{Any}:
 [5, 7, 6, 8]
 [4, 3, 2, 1]
 [1, 2, 5, 6]
1 Like

Thanks for the welcome! And also for the super quick and detailed reply! Can you share what code/tools you use for evaluating the performance of these variations?

Thanks all. These solutions look great. Lots of things to learn from. I consider myself a master Jedi in the β€œart” of MATLAB (I am the developer of the GIBBON project), but I am slowly realizing that converting super optimized MATLAB code nearly directly into Julia code quite often does not equal super optimized Julia code… at all. Then by optimizing the Julia code, and embracing the β€œJulian approach” it ends up faster than I thought was possible. Sadly for me though none of the structures/approaches/syntax to speed things up come naturally to me yet, leaving me feeling demoted to a Padawan learner. In MATLAB one avoids for-loops like the plague, with Julia one seems to have to embrace them like the cure. However the promise of unrivaled speed, and a fully open source software project, keep me motivated :slight_smile:.

2 Likes

Honestly, that’s quite normal, especially if you don’t want to just translate the code but also make it so efficiently. That requires good knowledge of both languages and their idioms. If I were to translate some Julia code to Matlab I’m pretty sure I’d get something barely working but also terrible performance-wise (or more terrible than it needs to be anyway) and you’d be able to make it much better.

A warm recommendation is having a look at the performance tips in the documentation. Golden rules are:

I promise that these will become more natural as you get familiar with the language. Note that big improvements to your code above came from changing the algorithm, which has little to do with the language.

Everyone’s favourite benchmarking package in Julia is BenchmarkTools.jl, I presume that’s what @halleysfifthinc used as well. For example, with your original function:

julia> using BenchmarkTools

julia> function occursOnce1(F)    
           Fs=sort(F,dims=2)
           S=prod(string.(Fs),dims=2)
           L=zeros(Bool,size(F,1))
           for q=1:1:size(F,1)#Loop over faces
               L[q]=count(S.==S[q])==1            
           end    
           return L
       end
occursOnce1 (generic function with 1 method)

julia> F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8];

julia> @benchmark occursOnce1($F)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.666 ΞΌs …  3.346 ms  β”Š GC (min … max): 0.00% … 99.13%
 Time  (median):     11.305 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.417 ΞΌs Β± 33.456 ΞΌs  β”Š GC (mean Β± Οƒ):  2.67% Β±  0.99%

    β–β–…β–ˆβ–‡β–„β–                                                     
  β–β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  9.67 ΞΌs         Histogram: frequency by time        24.3 ΞΌs <

 Memory estimate: 7.80 KiB, allocs estimate: 179.

julia> @btime occursOnce1($F); # to print only the minimum
  9.825 ΞΌs (179 allocations: 7.80 KiB)

Note that I interpolated F to avoid measuring the cost of accessing a non-constant global variable, which is unrelated to your function. An alternative would be to define a local variable F from the setup stage:

julia> @benchmark occursOnce1(F) setup=(F=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.139 ΞΌs …  3.815 ms  β”Š GC (min … max): 0.00% … 98.99%
 Time  (median):     11.386 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.660 ΞΌs Β± 38.143 ΞΌs  β”Š GC (mean Β± Οƒ):  2.98% Β±  0.99%

    β–„β–ˆβ–‡β–‚                                                       
  β–‚β–…β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–…β–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  10.1 ΞΌs         Histogram: frequency by time        24.7 ΞΌs <

 Memory estimate: 7.80 KiB, allocs estimate: 179.
1 Like

This version of @halleysfifthinc’s proposed solution appears most efficient (as I do not care if I get linear indices or a Boolean array as the output):

function onlyoneof(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i])
            delete!(d, Ft[i])
        else
            d[Ft[i]] = i
        end
    end 
    return collect(values(d))
end

Thanks again all for the help!

Great, this will be useful to get unique faces for sure. It does not offer the ones that occur once yet, as this requires a counting. This, i.e. the use of the unique function, would be useful for what I want if it exported the indices of the unique elements. It is a shame that, as far as I know, the default julia unique function does not support such additional outputs.

Here is the MATLAB implementation for instance which features such outputs:

[C,ia,ic] = unique(A,'rows')

Here the outputs ia and ic are really handy as they are defined as C =A(ia) and A = C(ic). Then the solution to the problem I pose here (defining β€œcounts” to allow selection of those that occurred once) in MATLAB could feature:

[subInd] = ind2subn(size(C),ic); %Convert to subscript indices (custom function but you get the idea)
Ac=accumarray(subInd,ones(numel(ic),1),size(C)); %Use accumarray to do counts
Ac=reshape(Ac(ic),[size(A,1) 1]); %Reshape to match input 

I wonder if julia’s unique function provided the outputs ia and ic, if that would actually offer the fastest approach to getting the entries that occur once.

Thanks again for posting this. Learning about the function f bit, and the use of eachrow is certainly very useful.

@Dan (and perhaps @halleysfifthinc) earlier you mentioned this solution has:

correctness issue with values appearing odd # of times > 1.

I cannot see this error. Would you be able to edit this version of the solution to fix it if still present? :point_up: ? Thanks

The correctness issue did not disappear. Try the function with:

There are faster/shorter correct versions later in the thread.
(perhaps you do want lines which appear odd (not once) number of times, in that case, the OP should be revised).

Thanks. For me that β€œfaulty” solution appears to work, or at least I do not easily see what is wrong and the output (in terms of the mesh output with boundary faces) looks good. Is it that counts >1 are incorrectly treated? If so perhaps it still works for counts equal to 1, which is what I need?

In therms of speed it seems the best too, in terms of use for large arrays (I tested them for ~20000 faces). Or am I missing something?

Using

function onlyoneof(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i])
            delete!(d, Ft[i])
        else
            d[Ft[i]] = i
        end
    end 
    return collect(values(d))
end

I get:

julia> using BenchmarkTools; @benchmark onlyoneof($Fc)
BenchmarkTools.Trial: 372 samples with 1 evaluation.
 Range (min … max):  11.891 ms … 17.619 ms  β”Š GC (min … max): 0.00% … 8.41%
 Time  (median):     13.188 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   13.444 ms Β±  1.009 ms  β”Š GC (mean Β± Οƒ):  3.32% Β± 4.89%

     ▁▃▁▂▆▄▄▆▂ β–„β–ˆβ–†β– ▁ β–ƒ    β–„     β–ƒβ–‚                            
  β–ƒβ–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ˆβ–†β–ˆβ–„β–‡β–„β–ˆβ–ˆβ–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–†β–†β–†β–ƒβ–‡β–ƒβ–†β–ƒβ–„β–ƒβ–ƒβ–„β–ƒβ–„β–ƒβ–„β–β–ƒβ–ƒβ–β–β–β–β–ƒ β–„
  11.9 ms         Histogram: frequency by time        16.2 ms <

Whereas:

function onlyoneofX(F)
       Ft = sort.(SVector{4}.(eachrow(F)))
       cm = countmap(Ft)
       return [cm[v]==1 for v in Ft]
end

Gives me:

julia> using BenchmarkTools; @benchmark onlyoneofX($Fc)
BenchmarkTools.Trial: 244 samples with 1 evaluation.
 Range (min … max):  18.994 ms … 28.280 ms  β”Š GC (min … max): 0.00% … 5.69%
 Time  (median):     20.002 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   20.471 ms Β±  1.229 ms  β”Š GC (mean Β± Οƒ):  3.18% Β± 3.98%

    β–β–ˆβ–‡β–ƒβ–ƒβ–             ▁ ▁                                     
  β–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ƒβ–„β–ƒβ–β–„β–ƒβ–…β–‡β–†β–‡β–ˆβ–…β–ˆβ–…β–†β–„β–„β–†β–ƒβ–ƒβ–ƒβ–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–ƒ β–ƒ
  19 ms           Histogram: frequency by time        25.3 ms <

If it fits the use case (which it totally might) then it’s fine. But it is important to understand the output, to make sure you aren’t surprised by future result. If you want even more speed, there is still more optimization possible. But I think it isn’t a bottleneck for your application now.

Had a closer look. I see now what you mean and how it is wrong (uneven repetitions are wrongly seen as occurring once). Since that code appears faster for large arrays, can this issue be fixed easily enough? I’ll do some digging myself too but it might take me a long time. Thanks for your insights!

2 Likes

The issue is easily fixed using the onlyoneofX function, at a cost of 50% more time. Is there a speed or memory problem to address still? (relative to rest of processing)

On the other hand, you might have actually wanted this alternative behaviour, as triangles touching 3 times could be a problem too (normal meshed surfaces have 2 triangles per edge or an even number generally).
Perhaps, in a future question, you can also provide the original MATLAB code, as any β€œgold” reference to a question works well to get an accurate answer.